Parsing pdf files

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
Galahad
Forum Contributor
Posts: 111
Joined: Fri Jun 14, 2002 5:50 pm

Parsing pdf files

Post by Galahad »

I'm trying to parse a pdf file to read it's keywords and title. I had to go through the pdfs on my site and add the keywords and change the title. When I look at the file in a text editor (after all of the garbage), it seems that Acrobat simple appends a new section onto document with the new title, author, etc information in it. Anyway, after I open the connection with fsockopen and send the get http request (it works fine for html documents), I read the result with this code:

Code: Select all

while (!feof($fp)) {
    $line = fgets($fp, 2048);
    if ($is_pdf) {                                                  // deals with .pdf files
      if (preg_match("/(?U)\/Title\s+\(.*\)/", $line)) {
        $temp = preg_replace("/.*\/Title\s+\(/", "", $line);
        $title = preg_replace("/\).*/", "", $temp);
      }

      if (preg_match("/(?U)\/Keywords\s+\(.*\)/", $line)) {
        $temp = preg_replace("/.*\/Keywords\s+\(/", "", $line);
        $keywords = preg_replace("/\).*/", "", $temp);
      }
    } else {
      // process html pages
    }
}
After it finishes reading the page, I process the $title and $keywords variables. Here's my problem: while it matches some lines for the title and keywords expressions, it seems to quit reading the file about 25-35 lines away from the actual end of the file (where the latest title and keywords are). As a result, the title and keywords that it reads are worthless to me. Any ideas why it would do that? I know feof() will return true if an error occurs, so that may be happening. If so, how can I tell whether it was an error or the actual EOF. I'd appreciate some help, thanks.

On a side note: does anyone know how to get rid of the old document info (title, author, etc) in a pdf file?
User avatar
volka
DevNet Evangelist
Posts: 8391
Joined: Tue May 07, 2002 9:48 am
Location: Berlin, ger

Post by volka »

Galahad
Forum Contributor
Posts: 111
Joined: Fri Jun 14, 2002 5:50 pm

Post by Galahad »

This is sort of embarassing: I guess the file just took a little while to get updated in the server cache (or something) or I'm just really confused. I went to lunch, came back and it worked fine. Thanks for the suggestion volka, those are good functions to know.
Galahad
Forum Contributor
Posts: 111
Joined: Fri Jun 14, 2002 5:50 pm

Post by Galahad »

I thought I had this working. Now it finds the line I want on one pdf, but skips about 20-30 lines around what I need on another. I had it echo the line as it reads it, but the lines I want are just missing. If I download the pdf and open it with wordpad or notepad, the lines I want are there. In wordpad, the lines I want are on their own lines, but in notepad, they are not. Nevertheless, they ought to at least get sent to the php script. Is there some kind of a non-printable character that would cause it to either not send a few lines or stop until another special character is reached? I've tried to redo the pdf files, but the same thing keeps happening. I did some tests where it echos the socket_get_status info and since it works on one, it's not a socket problem. It happens every time, too, so I don't believe it is a network issue. If you have any tips, I'd really appreciate it. I've about had it with this.
Post Reply