PHP To Search PDF Files

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
dansharp
Forum Newbie
Posts: 5
Joined: Wed Aug 27, 2003 6:12 am

PHP To Search PDF Files

Post by dansharp »

Hi,

i've looked at every options for thiis here's the scenario:

the company i work for have abour 1400 PDF files, they want to be able to search through them all - does anybody know how to make this work?!

cant find anything good on google - all the ones you have to pay for either dont work or are thousands of pounds, and all i want - is to be able to search the text inside the PDF files, and output it into a web page - nice and simple...

any ideas people?
User avatar
JayBird
Admin
Posts: 4524
Joined: Wed Aug 13, 2003 7:02 am
Location: York, UK
Contact:

Post by JayBird »

User avatar
JayBird
Admin
Posts: 4524
Joined: Wed Aug 13, 2003 7:02 am
Location: York, UK
Contact:

Post by JayBird »

then this search engine script is able to index pdf files, take a look at their code

http://phpdig.toiletoine.net/
dansharp
Forum Newbie
Posts: 5
Joined: Wed Aug 27, 2003 6:12 am

Post by dansharp »

thing is - i'm pretty new to PHP so i havn't a clue what any of that means or is! anyone willing to simplify it or suggest a script that already does this?
dansharp
Forum Newbie
Posts: 5
Joined: Wed Aug 27, 2003 6:12 am

Post by dansharp »

arrrgh - all seemed to be going alright with the phpdig stuff - but i just cant get it set up right! it's comming up with connection errors - i've set the mySQL database up on the local server - all ready for it, but the thing just wont connect to it!
User avatar
liljester
Forum Contributor
Posts: 400
Joined: Tue May 20, 2003 4:49 pm

Post by liljester »

what kind of connection errors?
User avatar
JayBird
Admin
Posts: 4524
Joined: Wed Aug 13, 2003 7:02 am
Location: York, UK
Contact:

Post by JayBird »

I think the best thing to try and do with this task is the following.

You say you have 1400 files. If you were wanting to search the files for a particular string you would needs to

Open the file {
Read the contents
Check for a match
}
Close the file.

This would have to be looped 1400 time to check all the PDFs. This would happen each time someone started a search.

A better way would be to

Open the file {
Read the contents
Insert into Database
}
Close the file

This wuold be run each time you added or removed a PDF file (if you aren't planning to add or remove any, then that would only need to be done once.

Now, when you want to do a search, you search the DB instead, this would produce far less overhead on the server and produce faster results.

Thats how i would do it anyway.

Mark
dansharp
Forum Newbie
Posts: 5
Joined: Wed Aug 27, 2003 6:12 am

Post by dansharp »

interesting idea mark. i'm just spending the first part of the morning looking at typo3 - they've got pdf searching function - i'm now looking at basing the entire site around typo3, and seeing how that works out.

where can i find a decent tutorial on doing the

Open the file {
Read the contents
Insert into Database
}
Close the file

sort of a job - i'm an asp boy converting desperatley to php!
User avatar
JayBird
Admin
Posts: 4524
Joined: Wed Aug 13, 2003 7:02 am
Location: York, UK
Contact:

Post by JayBird »

You can use this as a basis to get meta information like Title, Version, Creator, etc. The sample below has six things it looks for, just add what you need in that array.

It doesn't always work perfectly but at least it's a starting point.

Code: Select all

<?

/* Fill in the PDF location below to parse meta data */ 
$thePDF = "myPDF.pdf"; 

$fd = fopen($thePDF, "rb"); 
$pdfstring = fread($fd, filesize($thePDF)); 
fclose($fd); 
$pdfVars = array ("/Title","/Producer","/CreationDate","/Author","/Creator","/Version"); 
echo("<b>PDF $myPDF Information:</b><p>"); 

for($k=0; $k<count($pdfVars); $k++) { 
	$rawVar = strpos($pdfstring, $pdfVars[$k]); 
	$thisVar = $pdfVars[$k]; 
	if($rawVar === false) { 
		$$thisVar = ""; 
			echo("Couldn't find " . $pdfVars[$k] . "<p>n"); 
	} else { 
		$thisChunk = substr($pdfstring, $rawVar, 200); 
		if(ereg($PDFVars[$k]."( |\n|\t|\r)*(",$thisChunk)) { 
			$endPos = strpos($thisChunk, ")")-1; 
			$$thisVar = substr($pdfstring, $rawVar+1, $endPos); 
			$blech = explode("(", stripslashes($$thisVar)); 
			echo($blech[0] . " <b>" . $blech[1] . "</b><br>"); 
		} else { 
			echo("$thisVar does NOT validate in code<br>"); 
		} 
	} 
} 

?>
Mark
dansharp
Forum Newbie
Posts: 5
Joined: Wed Aug 27, 2003 6:12 am

Post by dansharp »

but then what "/Title","/Producer","/CreationDate","/Author","/Creator","/Version" would i use to actually search the body of the PDF... i'm talking about all the words inside it not just the title, author, etc.

but thanks - i'm gettin' closer.
Post Reply