Page 1 of 1

PHP PDF XML

Posted: Tue Feb 20, 2007 2:29 am
by ryuuka
few question concerning this:
  • 1. can i use php to convert pdf files to XML?
    2. If so what do i use to do that?
    3. Can you give me some pointers as to where i should look?
    4. can it be done with word
    5. If so what do i use to do that?
    6. Can you give me some pointers as to where i should look?
hope you can help me the web supplies me with nothing but progra's you have to pay for or
programs that don't work.

thank you

Posted: Tue Feb 20, 2007 2:33 am
by volka
Are you looking for something like http://pdftohtml.sourceforge.net/ ?

Posted: Tue Feb 20, 2007 2:43 am
by ryuuka
no this is a command line program and since i want to do large batches like 200
it wouldn't be practical.
what about word docs?

irecently found something on our company networks that converts large batches of
pdf to doc files. so maybe it would be easier with doc files?

Posted: Tue Feb 20, 2007 2:46 am
by volka
Is the result of http://pdftohtml.sourceforge.net/ what you're looking for?
You can write a php script to call the pdftohtml for all your documents. Even a shell script could do it.

Posted: Tue Feb 20, 2007 2:54 am
by ryuuka
it's not what i'm looking for because it out puts every page
into a picture format. that i don't need because later on once i
get this done i want to input the data from the pdf files into a database

Posted: Tue Feb 20, 2007 3:03 am
by volka
ryuuka wrote:because it out puts every page
into a picture format.
No. I got the result

Code: Select all

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml>
<page number="1" position="absolute" top="0" left="0" height="1188" width="918">
	<fontspec id="0" size="12" family="Times" color="#000000"/>
<text top="54" left="54" width="153" height="15" font="0">yadda yadda yadda</text>
</page>
</pdf2xml>
and that's not a picture.

Posted: Tue Feb 20, 2007 3:21 am
by ryuuka

Code: Select all

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">

<pdf2xml>
<page number="1" position="absolute" top="0" left="0" height="1262" width="892">
</page>
</pdf2xml>
i don't know how you use it but this is all i get

from a document that contains the following:
GSM091 - john doe 6310l.doc

Hierbij verklaart john doe van company te hebben ontvangen: een mobiele telefoon met de volgende specificaties:

Mobiel Regnr. Fabrikaat Type Imeinr. Simkaartnr Mobiel nummer Gripnummer Datum uitgifte PUK code

MB091 Oii&KiQ©6]}
Nokia l
63101
351550009302224
15786651
0622958023
4750
27/04/2004
NVT

Betrokkene verklaart tevens met bovengenoemde apparatuur zorgvuldig om te gaan. In geval van eventuele schade c.q. zoekraken of niet retourneren zullen de ontbrekende produkten door de afdeling ICT worden aangeschaft.

Het volledige bedrag (inkoop factuur) zal rechtstreeks via salarisadministratie op maandloon verrekend worden.

Handtekening voo/akkoord:
Uitgegeven door IT Medewerker:

NOKIA 6310i KPN Silver
IMEI :00000000000000
0000000000000000000
[A"

Posted: Tue Feb 20, 2007 3:23 am
by volka
Is it already an image in the the pdf itself? Can you mark and copy the text in the acrobat reader?

I invoked

Code: Select all

pdf2html -xml test.pdf out

Posted: Tue Feb 20, 2007 3:35 am
by ryuuka
nm i already know why it won't convert right

i scanned all these documents as pdf file.
normally it would oautomaticly convert all these filesto a pdf with text but because i wanted to
get it done quickly i dragged them from the map it got into, into the corrosponding folder
instead of waiting a few minutes till the server converted it to a pdf with text in it.

i'm an idiot

my guess is everything will work out fine once i change this to a text pdf. it shouldn't be too hard

thanks for the advice. and sorry for waisting your time with this