READING PDF Files

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
omalainternet
Forum Newbie
Posts: 3
Joined: Thu Jul 10, 2008 5:00 pm

READING PDF Files

Post by omalainternet »

Hey,

I'm just making a simple script that grabs all links from a .pdf file.
90% of pdf files I have on my computer can be read with notepad. Well, most of it is 'blurb' characters but the urls are easily readable, the links appear in the source text as .....URI(http://www.domain.com/)..... so they're really easy to strip out.

For the other 10% however, all the code appears as blurb.
I've tried some standard decoding/conversions but nothing to make the links/URI's readable.

I did notice a very important distinction between the 10% and the working 90%. In the 90% pdf's, when viewing them with adobe reader and when hovering over a link, the hand appears with the a 'w' in it (as in www). In the non-readable 10% when hovering over a link, it's just the pointer without the w

I hope this distinction helps someone figure out what I can do to make this 10% readable (well the links at least, that's all I care about at this point)

Thanks in advance everyone.
omalainternet
Forum Newbie
Posts: 3
Joined: Thu Jul 10, 2008 5:00 pm

Re: READING PDF Files

Post by omalainternet »

To find out more info, I used Adobe's online conversion tool to convert the working an non-working PDF's into html and for the 10% that don't work (link-wise) adobe converted the links as plain text with bold and underline tags around it.
For the 90% files adobe converted the links in the document into actual links in html with anchor and all.
Rovas
Forum Contributor
Posts: 272
Joined: Mon Aug 21, 2006 7:09 am
Location: Romania

Re: READING PDF Files

Post by Rovas »

There is an program named pdftohtml that transforms pdf into html files. It works nicely with a majority of files but it' s slow and for the latest version PDF it sometimes makes mistakes converting the more advance features. To get the needed url
use a regular expression to search for anchors, I am bad at making one but study the documentation.
omalainternet
Forum Newbie
Posts: 3
Joined: Thu Jul 10, 2008 5:00 pm

Re: READING PDF Files

Post by omalainternet »

Yeah I'll have to go with that and host the script on my own server, thanks m8
Post Reply