PHP To Search PDF Files
Moderator: General Moderators
PHP To Search PDF Files
Hi,
i've looked at every options for thiis here's the scenario:
the company i work for have abour 1400 PDF files, they want to be able to search through them all - does anybody know how to make this work?!
cant find anything good on google - all the ones you have to pay for either dont work or are thousands of pounds, and all i want - is to be able to search the text inside the PDF files, and output it into a web page - nice and simple...
any ideas people?
i've looked at every options for thiis here's the scenario:
the company i work for have abour 1400 PDF files, they want to be able to search through them all - does anybody know how to make this work?!
cant find anything good on google - all the ones you have to pay for either dont work or are thousands of pounds, and all i want - is to be able to search the text inside the PDF files, and output it into a web page - nice and simple...
any ideas people?
then this search engine script is able to index pdf files, take a look at their code
http://phpdig.toiletoine.net/
http://phpdig.toiletoine.net/
I think the best thing to try and do with this task is the following.
You say you have 1400 files. If you were wanting to search the files for a particular string you would needs to
Open the file {
Read the contents
Check for a match
}
Close the file.
This would have to be looped 1400 time to check all the PDFs. This would happen each time someone started a search.
A better way would be to
Open the file {
Read the contents
Insert into Database
}
Close the file
This wuold be run each time you added or removed a PDF file (if you aren't planning to add or remove any, then that would only need to be done once.
Now, when you want to do a search, you search the DB instead, this would produce far less overhead on the server and produce faster results.
Thats how i would do it anyway.
Mark
You say you have 1400 files. If you were wanting to search the files for a particular string you would needs to
Open the file {
Read the contents
Check for a match
}
Close the file.
This would have to be looped 1400 time to check all the PDFs. This would happen each time someone started a search.
A better way would be to
Open the file {
Read the contents
Insert into Database
}
Close the file
This wuold be run each time you added or removed a PDF file (if you aren't planning to add or remove any, then that would only need to be done once.
Now, when you want to do a search, you search the DB instead, this would produce far less overhead on the server and produce faster results.
Thats how i would do it anyway.
Mark
interesting idea mark. i'm just spending the first part of the morning looking at typo3 - they've got pdf searching function - i'm now looking at basing the entire site around typo3, and seeing how that works out.
where can i find a decent tutorial on doing the
Open the file {
Read the contents
Insert into Database
}
Close the file
sort of a job - i'm an asp boy converting desperatley to php!
where can i find a decent tutorial on doing the
Open the file {
Read the contents
Insert into Database
}
Close the file
sort of a job - i'm an asp boy converting desperatley to php!
You can use this as a basis to get meta information like Title, Version, Creator, etc. The sample below has six things it looks for, just add what you need in that array.
It doesn't always work perfectly but at least it's a starting point.
Mark
It doesn't always work perfectly but at least it's a starting point.
Code: Select all
<?
/* Fill in the PDF location below to parse meta data */
$thePDF = "myPDF.pdf";
$fd = fopen($thePDF, "rb");
$pdfstring = fread($fd, filesize($thePDF));
fclose($fd);
$pdfVars = array ("/Title","/Producer","/CreationDate","/Author","/Creator","/Version");
echo("<b>PDF $myPDF Information:</b><p>");
for($k=0; $k<count($pdfVars); $k++) {
$rawVar = strpos($pdfstring, $pdfVars[$k]);
$thisVar = $pdfVars[$k];
if($rawVar === false) {
$$thisVar = "";
echo("Couldn't find " . $pdfVars[$k] . "<p>n");
} else {
$thisChunk = substr($pdfstring, $rawVar, 200);
if(ereg($PDFVars[$k]."( |\n|\t|\r)*(",$thisChunk)) {
$endPos = strpos($thisChunk, ")")-1;
$$thisVar = substr($pdfstring, $rawVar+1, $endPos);
$blech = explode("(", stripslashes($$thisVar));
echo($blech[0] . " <b>" . $blech[1] . "</b><br>");
} else {
echo("$thisVar does NOT validate in code<br>");
}
}
}
?>