READING PDF Files
Posted: Thu Jul 10, 2008 5:09 pm
Hey,
I'm just making a simple script that grabs all links from a .pdf file.
90% of pdf files I have on my computer can be read with notepad. Well, most of it is 'blurb' characters but the urls are easily readable, the links appear in the source text as .....URI(http://www.domain.com/)..... so they're really easy to strip out.
For the other 10% however, all the code appears as blurb.
I've tried some standard decoding/conversions but nothing to make the links/URI's readable.
I did notice a very important distinction between the 10% and the working 90%. In the 90% pdf's, when viewing them with adobe reader and when hovering over a link, the hand appears with the a 'w' in it (as in www). In the non-readable 10% when hovering over a link, it's just the pointer without the w
I hope this distinction helps someone figure out what I can do to make this 10% readable (well the links at least, that's all I care about at this point)
Thanks in advance everyone.
I'm just making a simple script that grabs all links from a .pdf file.
90% of pdf files I have on my computer can be read with notepad. Well, most of it is 'blurb' characters but the urls are easily readable, the links appear in the source text as .....URI(http://www.domain.com/)..... so they're really easy to strip out.
For the other 10% however, all the code appears as blurb.
I've tried some standard decoding/conversions but nothing to make the links/URI's readable.
I did notice a very important distinction between the 10% and the working 90%. In the 90% pdf's, when viewing them with adobe reader and when hovering over a link, the hand appears with the a 'w' in it (as in www). In the non-readable 10% when hovering over a link, it's just the pointer without the w
I hope this distinction helps someone figure out what I can do to make this 10% readable (well the links at least, that's all I care about at this point)
Thanks in advance everyone.