PDF to Text

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
Lee Firth
Forum Newbie
Posts: 2
Joined: Wed Oct 26, 2011 10:48 pm

PDF to Text

Post by Lee Firth »

I've been looking for a function that will read a pdf document and output the results as plain text and this is the best I could find. I created a word file with a couple of sentences on it and saved it as a pdf (document.pdf). It doesn't produce any thing, $data is empty. There are no error messages. Any suggestions as to why?

Code: Select all

// Function    : pdf2txt()
// Arguments   : $filename - Filename of the PDF you want to extract
// Description : Reads a pdf file, extracts data streams, and manages
//               their translation to plain text - returning the plain
//               text at the end
// Authors      : Jonathan Beckett, 2005-05-02
//              : Sven Schuberth, 2007-03-29

function pdf2txt($filename){

    $data = getFileData($filename);
   
    $s=strpos($data,"%")+1;
   
    $version=substr($data,$s,strpos($data,"%",$s)-1);
    if(substr_count($version,"PDF-1.2")==0)
        return handleV3($data);
    else
        return handleV2($data);

   
}
// handles the verson 1.2
function handleV2($data){
       
    // grab objects and then grab their contents (chunks)
    $a_obj = getDataArray($data,"obj","endobj");
   
    foreach($a_obj as $obj){
       
        $a_filter = getDataArray($obj,"<<",">>");
   
        if (is_array($a_filter)){
            $j++;
            $a_chunks[$j]["filter"] = $a_filter[0];

            $a_data = getDataArray($obj,"stream\r\n","endstream");
            if (is_array($a_data)){
                $a_chunks[$j]["data"] = substr($a_data[0],
								strlen("stream\r\n"),
								strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream"));
            }
        }
    }

    // decode the chunks
    foreach($a_chunks as $chunk){

        // look at each chunk and decide how to decode it - by looking at the contents of the filter
        $a_filter = split("/",$chunk["filter"]);
       
        if ($chunk["data"]!=""){
            // look at the filter to find out which encoding has been used           
            if (substr($chunk["filter"],"FlateDecode")!==false){
                $data =@ gzuncompress($chunk["data"]);
                if (trim($data)!=""){
                    $result_data .= ps2txt($data);
                } else {
               
                    //$result_data .= "x";
                }
            }
        }
    }
   
    return $result_data;
}

//handles versions >1.2
function handleV3($data){
    // grab objects and then grab their contents (chunks)
    $a_obj = getDataArray($data,"obj","endobj");
    $result_data="";
    foreach($a_obj as $obj){
        //check if it a string
        if(substr_count($obj,"/GS1")>0){
            //the strings are between ( and )
            preg_match_all("|\((.*?)\)|",$obj,$field,PREG_SET_ORDER);
            if(is_array($field))
                foreach($field as $data)
                    $result_data.=$data[1];
        }
    }
    return $result_data;
}

function ps2txt($ps_data){
    $result = "";
    $a_data = getDataArray($ps_data,"[","]");
    if (is_array($a_data)){
        foreach ($a_data as $ps_text){
            $a_text = getDataArray($ps_text,"(",")");
            if (is_array($a_text)){
                foreach ($a_text as $text){
                    $result .= substr($text,1,strlen($text)-2);
                }
            }
        }
    } else {
        // the data may just be in raw format (outside of [] tags)
        $a_text = getDataArray($ps_data,"(",")");
        if (is_array($a_text)){
            foreach ($a_text as $text){
                $result .= substr($text,1,strlen($text)-2);
            }
        }
    }
    return $result;
}

function getFileData($filename){
    $handle = fopen($filename,"rb");
    $data = fread($handle, filesize($filename));
    fclose($handle);
    return $data;
}

function getDataArray($data,$start_word,$end_word){

    $start = 0;
    $end = 0;
    unset($a_result);
   
    while ($start!==false && $end!==false){
        $start = strpos($data,$start_word,$end);
        if ($start!==false){
            $end = strpos($data,$end_word,$start);
            if ($end!==false){
                // data is between start and end
                $a_result[] = substr($data,$start,$end-$start+strlen($end_word));
            }
        }
    }
    return $a_result;
}

$data = pdf2txt('document.pdf');

echo 'File contents: '. $data;
cogentconcept
Forum Newbie
Posts: 3
Joined: Sun Dec 26, 2010 2:43 am

Re: PDF to Text

Post by cogentconcept »

try tuxradar.net
Eric!
DevNet Resident
Posts: 1146
Joined: Sun Jun 14, 2009 3:13 pm

Re: PDF to Text

Post by Eric! »

PDF files have a lot of different types of encodings. It can be very difficult to extract text from them. I've had some luck with CLI tools like pdftotext. But you have to have shell_exec permission to run them via php.

Edit: corrected cli tool name and added link
Last edited by Eric! on Sun Oct 30, 2011 7:19 am, edited 1 time in total.
tracyjq
Forum Newbie
Posts: 1
Joined: Wed Mar 21, 2012 3:22 am

Re: PDF to Text

Post by tracyjq »

Specially designed for PDF users, Joboshare PDF to Text Converter is practical PDF conversion tool. In addition, PDF to Text converter is standalone software which doesn't need Adobe Software and Microsoft Excel to be pre-installed. With this smart PDF to Text Converter, you can choose the page range to select all pages, current page and your self-defined pages before converting PDF to Text. Moreover, versatile PDF to Text Converter offers you preview window that all your PDF pages can be viewed.

Preparation: Download Joboshare PDF to Text Converter, install and launch it on computer.
Step1: Click “File” button or “Add” button to load PDF Files, or you can directly drag PDF to this program. Batch conversion supported!
Step 2: After adding PDF files, then set output destination, you choose to save in the same folder as the original file, or you can select suitable output folder from local computer by clicking “Browse” button.
Step3: If you have finished all preparation, click “Convert” button to start PDF to Text conversion, a process bar will show you the conversion status on the bottom of this software.

Joboshare PDF to Image Converter
Joboshare PDF to Html Converter
Joboshare PDF to EPUB Converter

Image
bluwing
Forum Newbie
Posts: 1
Joined: Thu Apr 12, 2012 9:22 am

Re: PDF to Text

Post by bluwing »

PDF to Text Converter for Mac is specially designed for mac users to extract plain text files from Adobe PDF documents. So you can read the read-only PDF files on your portable devices, such as BlackBerry, Nokia, etc. more conveniently.

PDF to Text Converter for Mac is an independent tool that you can convert PDF to Text on mac without any third-party program, such as Adobe Acrobat or other plugins.

Apart from these, PDF to Text Converter for Mac provides users with powerful edit functions, to save conversion time, you can choose to convert only the needed files by setting the page range or choose to make batch conversion. If you are a windows user, you may choose PDF to Text Converter for Windows.

Image

Main Features of PDF to Text Converter for Mac

* Support multi-language PDF documents, including English, Turkish, Thai, Latin, Korean, Greek, Cyrillic, Arabic, Japanese, and Chinese.

*It can convert PDF to Text on mac without any third-pary program like Adobe Acrobat.

* Batch conversion is supported for saving times.

* In order to save more time, PDF to Text Converter for Mac allows you to convert only the need files by setting the page range.
Post Reply