PHP Text Extract from Scanned PDF

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
pavithra023
Forum Newbie
Posts: 1
Joined: Thu Aug 26, 2010 2:05 am

PHP Text Extract from Scanned PDF

Post by pavithra023 »

Hello experts,

Is it possible to extract or read the text from a scanned image or an pdf using PHP. If so how it can be done... I am able to read the text from a normal pdf but when it comes to scanned pdf,code is not working.. Please help....
Bind
Forum Contributor
Posts: 102
Joined: Wed Feb 03, 2010 1:22 am

Re: PHP Text Extract from Scanned PDF

Post by Bind »

What you are looking for is called Optical Character Recognition (OCR).

Success is mainly determined by the quality of the scan image and text deviation (bow, bend, and skew) in relation to the graphic boundaries.

There is quite a bit of OCR work in attempts to defeat image challenge-response mechanisms for automation through forms.

What you would need to do is import the image from the pdf document first, then parse the image itself with the OCR.

Scanning to PDF first seems to be a waste (of time and server resources) really. Why not just image scan instead of scanning to pdf.

Now if you are trying to snag the text from someone elses work then it's understandable.
Post Reply