Page 1 of 1

READING PDF Files

Posted: Thu Jul 10, 2008 5:09 pm
by omalainternet
Hey,

I'm just making a simple script that grabs all links from a .pdf file.
90% of pdf files I have on my computer can be read with notepad. Well, most of it is 'blurb' characters but the urls are easily readable, the links appear in the source text as .....URI(http://www.domain.com/)..... so they're really easy to strip out.

For the other 10% however, all the code appears as blurb.
I've tried some standard decoding/conversions but nothing to make the links/URI's readable.

I did notice a very important distinction between the 10% and the working 90%. In the 90% pdf's, when viewing them with adobe reader and when hovering over a link, the hand appears with the a 'w' in it (as in www). In the non-readable 10% when hovering over a link, it's just the pointer without the w

I hope this distinction helps someone figure out what I can do to make this 10% readable (well the links at least, that's all I care about at this point)

Thanks in advance everyone.

Re: READING PDF Files

Posted: Thu Jul 10, 2008 5:45 pm
by omalainternet
To find out more info, I used Adobe's online conversion tool to convert the working an non-working PDF's into html and for the 10% that don't work (link-wise) adobe converted the links as plain text with bold and underline tags around it.
For the 90% files adobe converted the links in the document into actual links in html with anchor and all.

Re: READING PDF Files

Posted: Fri Jul 11, 2008 2:41 am
by Rovas
There is an program named pdftohtml that transforms pdf into html files. It works nicely with a majority of files but it' s slow and for the latest version PDF it sometimes makes mistakes converting the more advance features. To get the needed url
use a regular expression to search for anchors, I am bad at making one but study the documentation.

Re: READING PDF Files

Posted: Fri Jul 11, 2008 10:37 am
by omalainternet
Yeah I'll have to go with that and host the script on my own server, thanks m8