Parsing pdf files
Posted: Thu Jun 27, 2002 1:33 pm
I'm trying to parse a pdf file to read it's keywords and title. I had to go through the pdfs on my site and add the keywords and change the title. When I look at the file in a text editor (after all of the garbage), it seems that Acrobat simple appends a new section onto document with the new title, author, etc information in it. Anyway, after I open the connection with fsockopen and send the get http request (it works fine for html documents), I read the result with this code:
After it finishes reading the page, I process the $title and $keywords variables. Here's my problem: while it matches some lines for the title and keywords expressions, it seems to quit reading the file about 25-35 lines away from the actual end of the file (where the latest title and keywords are). As a result, the title and keywords that it reads are worthless to me. Any ideas why it would do that? I know feof() will return true if an error occurs, so that may be happening. If so, how can I tell whether it was an error or the actual EOF. I'd appreciate some help, thanks.
On a side note: does anyone know how to get rid of the old document info (title, author, etc) in a pdf file?
Code: Select all
while (!feof($fp)) {
$line = fgets($fp, 2048);
if ($is_pdf) { // deals with .pdf files
if (preg_match("/(?U)\/Title\s+\(.*\)/", $line)) {
$temp = preg_replace("/.*\/Title\s+\(/", "", $line);
$title = preg_replace("/\).*/", "", $temp);
}
if (preg_match("/(?U)\/Keywords\s+\(.*\)/", $line)) {
$temp = preg_replace("/.*\/Keywords\s+\(/", "", $line);
$keywords = preg_replace("/\).*/", "", $temp);
}
} else {
// process html pages
}
}On a side note: does anyone know how to get rid of the old document info (title, author, etc) in a pdf file?