PHP PDF XML

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
ryuuka
Forum Contributor
Posts: 128
Joined: Tue Sep 05, 2006 8:18 am
Location: the netherlands

PHP PDF XML

Post by ryuuka »

few question concerning this:
  • 1. can i use php to convert pdf files to XML?
    2. If so what do i use to do that?
    3. Can you give me some pointers as to where i should look?
    4. can it be done with word
    5. If so what do i use to do that?
    6. Can you give me some pointers as to where i should look?
hope you can help me the web supplies me with nothing but progra's you have to pay for or
programs that don't work.

thank you
User avatar
volka
DevNet Evangelist
Posts: 8391
Joined: Tue May 07, 2002 9:48 am
Location: Berlin, ger

Post by volka »

Are you looking for something like http://pdftohtml.sourceforge.net/ ?
ryuuka
Forum Contributor
Posts: 128
Joined: Tue Sep 05, 2006 8:18 am
Location: the netherlands

Post by ryuuka »

no this is a command line program and since i want to do large batches like 200
it wouldn't be practical.
what about word docs?

irecently found something on our company networks that converts large batches of
pdf to doc files. so maybe it would be easier with doc files?
User avatar
volka
DevNet Evangelist
Posts: 8391
Joined: Tue May 07, 2002 9:48 am
Location: Berlin, ger

Post by volka »

Is the result of http://pdftohtml.sourceforge.net/ what you're looking for?
You can write a php script to call the pdftohtml for all your documents. Even a shell script could do it.
ryuuka
Forum Contributor
Posts: 128
Joined: Tue Sep 05, 2006 8:18 am
Location: the netherlands

Post by ryuuka »

it's not what i'm looking for because it out puts every page
into a picture format. that i don't need because later on once i
get this done i want to input the data from the pdf files into a database
User avatar
volka
DevNet Evangelist
Posts: 8391
Joined: Tue May 07, 2002 9:48 am
Location: Berlin, ger

Post by volka »

ryuuka wrote:because it out puts every page
into a picture format.
No. I got the result

Code: Select all

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml>
<page number="1" position="absolute" top="0" left="0" height="1188" width="918">
	<fontspec id="0" size="12" family="Times" color="#000000"/>
<text top="54" left="54" width="153" height="15" font="0">yadda yadda yadda</text>
</page>
</pdf2xml>
and that's not a picture.
ryuuka
Forum Contributor
Posts: 128
Joined: Tue Sep 05, 2006 8:18 am
Location: the netherlands

Post by ryuuka »

Code: Select all

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml>
<page number="1" position="absolute" top="0" left="0" height="1262" width="892">
</page>
</pdf2xml>
i don't know how you use it but this is all i get

from a document that contains the following:
GSM091 - john doe 6310l.doc

Hierbij verklaart john doe van company te hebben ontvangen: een mobiele telefoon met de volgende specificaties:

Mobiel Regnr. Fabrikaat Type Imeinr. Simkaartnr Mobiel nummer Gripnummer Datum uitgifte PUK code

MB091 Oii&KiQ©6]}
Nokia l
63101
351550009302224
15786651
0622958023
4750
27/04/2004
NVT

Betrokkene verklaart tevens met bovengenoemde apparatuur zorgvuldig om te gaan. In geval van eventuele schade c.q. zoekraken of niet retourneren zullen de ontbrekende produkten door de afdeling ICT worden aangeschaft.

Het volledige bedrag (inkoop factuur) zal rechtstreeks via salarisadministratie op maandloon verrekend worden.

Handtekening voo/akkoord:
Uitgegeven door IT Medewerker:

NOKIA 6310i KPN Silver
IMEI :00000000000000
0000000000000000000
[A"
User avatar
volka
DevNet Evangelist
Posts: 8391
Joined: Tue May 07, 2002 9:48 am
Location: Berlin, ger

Post by volka »

Is it already an image in the the pdf itself? Can you mark and copy the text in the acrobat reader?

I invoked

Code: Select all

pdf2html -xml test.pdf out
ryuuka
Forum Contributor
Posts: 128
Joined: Tue Sep 05, 2006 8:18 am
Location: the netherlands

Post by ryuuka »

nm i already know why it won't convert right

i scanned all these documents as pdf file.
normally it would oautomaticly convert all these filesto a pdf with text but because i wanted to
get it done quickly i dragged them from the map it got into, into the corrosponding folder
instead of waiting a few minutes till the server converted it to a pdf with text in it.

i'm an idiot

my guess is everything will work out fine once i change this to a text pdf. it shouldn't be too hard

thanks for the advice. and sorry for waisting your time with this
Post Reply