Page 1 of 1
PHP To Search PDF Files
Posted: Wed Aug 27, 2003 6:12 am
by dansharp
Hi,
i've looked at every options for thiis here's the scenario:
the company i work for have abour 1400 PDF files, they want to be able to search through them all - does anybody know how to make this work?!
cant find anything good on google - all the ones you have to pay for either dont work or are thousands of pounds, and all i want - is to be able to search the text inside the PDF files, and output it into a web page - nice and simple...
any ideas people?
Posted: Wed Aug 27, 2003 6:18 am
by JayBird
Posted: Wed Aug 27, 2003 6:29 am
by JayBird
then this search engine script is able to index pdf files, take a look at their code
http://phpdig.toiletoine.net/
Posted: Wed Aug 27, 2003 6:30 am
by dansharp
thing is - i'm pretty new to PHP so i havn't a clue what any of that means or is! anyone willing to simplify it or suggest a script that already does this?
Posted: Wed Aug 27, 2003 7:05 am
by dansharp
arrrgh - all seemed to be going alright with the phpdig stuff - but i just cant get it set up right! it's comming up with connection errors - i've set the mySQL database up on the local server - all ready for it, but the thing just wont connect to it!
Posted: Wed Aug 27, 2003 9:26 am
by liljester
what kind of connection errors?
Posted: Wed Aug 27, 2003 9:56 am
by JayBird
I think the best thing to try and do with this task is the following.
You say you have 1400 files. If you were wanting to search the files for a particular string you would needs to
Open the file {
Read the contents
Check for a match
}
Close the file.
This would have to be looped 1400 time to check all the PDFs. This would happen each time someone started a search.
A better way would be to
Open the file {
Read the contents
Insert into Database
}
Close the file
This wuold be run each time you added or removed a PDF file (if you aren't planning to add or remove any, then that would only need to be done once.
Now, when you want to do a search, you search the DB instead, this would produce far less overhead on the server and produce faster results.
Thats how i would do it anyway.
Mark
Posted: Thu Aug 28, 2003 3:03 am
by dansharp
interesting idea mark. i'm just spending the first part of the morning looking at typo3 - they've got pdf searching function - i'm now looking at basing the entire site around typo3, and seeing how that works out.
where can i find a decent tutorial on doing the
Open the file {
Read the contents
Insert into Database
}
Close the file
sort of a job - i'm an asp boy converting desperatley to php!
Posted: Thu Aug 28, 2003 4:18 am
by JayBird
You can use this as a basis to get meta information like Title, Version, Creator, etc. The sample below has six things it looks for, just add what you need in that array.
It doesn't always work perfectly but at least it's a starting point.
Code: Select all
<?
/* Fill in the PDF location below to parse meta data */
$thePDF = "myPDF.pdf";
$fd = fopen($thePDF, "rb");
$pdfstring = fread($fd, filesize($thePDF));
fclose($fd);
$pdfVars = array ("/Title","/Producer","/CreationDate","/Author","/Creator","/Version");
echo("<b>PDF $myPDF Information:</b><p>");
for($k=0; $k<count($pdfVars); $k++) {
$rawVar = strpos($pdfstring, $pdfVars[$k]);
$thisVar = $pdfVars[$k];
if($rawVar === false) {
$$thisVar = "";
echo("Couldn't find " . $pdfVars[$k] . "<p>n");
} else {
$thisChunk = substr($pdfstring, $rawVar, 200);
if(ereg($PDFVars[$k]."( |\n|\t|\r)*(",$thisChunk)) {
$endPos = strpos($thisChunk, ")")-1;
$$thisVar = substr($pdfstring, $rawVar+1, $endPos);
$blech = explode("(", stripslashes($$thisVar));
echo($blech[0] . " <b>" . $blech[1] . "</b><br>");
} else {
echo("$thisVar does NOT validate in code<br>");
}
}
}
?>
Mark
Posted: Thu Aug 28, 2003 4:39 am
by dansharp
but then what "/Title","/Producer","/CreationDate","/Author","/Creator","/Version" would i use to actually search the body of the PDF... i'm talking about all the words inside it not just the title, author, etc.
but thanks - i'm gettin' closer.