Page 1 of 1

PHP Text Extract from Scanned PDF

Posted: Thu Aug 26, 2010 2:13 am
by pavithra023
Hello experts,

Is it possible to extract or read the text from a scanned image or an pdf using PHP. If so how it can be done... I am able to read the text from a normal pdf but when it comes to scanned pdf,code is not working.. Please help....

Re: PHP Text Extract from Scanned PDF

Posted: Thu Aug 26, 2010 7:59 am
by Bind
What you are looking for is called Optical Character Recognition (OCR).

Success is mainly determined by the quality of the scan image and text deviation (bow, bend, and skew) in relation to the graphic boundaries.

There is quite a bit of OCR work in attempts to defeat image challenge-response mechanisms for automation through forms.

What you would need to do is import the image from the pdf document first, then parse the image itself with the OCR.

Scanning to PDF first seems to be a waste (of time and server resources) really. Why not just image scan instead of scanning to pdf.

Now if you are trying to snag the text from someone elses work then it's understandable.