Page 1 of 1

PDF to Text

Posted: Thu Oct 27, 2011 1:10 am
by Lee Firth
I've been looking for a function that will read a pdf document and output the results as plain text and this is the best I could find. I created a word file with a couple of sentences on it and saved it as a pdf (document.pdf). It doesn't produce any thing, $data is empty. There are no error messages. Any suggestions as to why?

Code: Select all

// Function    : pdf2txt()
// Arguments   : $filename - Filename of the PDF you want to extract
// Description : Reads a pdf file, extracts data streams, and manages
//               their translation to plain text - returning the plain
//               text at the end
// Authors      : Jonathan Beckett, 2005-05-02
//              : Sven Schuberth, 2007-03-29

function pdf2txt($filename){

    $data = getFileData($filename);
   
    $s=strpos($data,"%")+1;
   
    $version=substr($data,$s,strpos($data,"%",$s)-1);
    if(substr_count($version,"PDF-1.2")==0)
        return handleV3($data);
    else
        return handleV2($data);

   
}
// handles the verson 1.2
function handleV2($data){
       
    // grab objects and then grab their contents (chunks)
    $a_obj = getDataArray($data,"obj","endobj");
   
    foreach($a_obj as $obj){
       
        $a_filter = getDataArray($obj,"<<",">>");
   
        if (is_array($a_filter)){
            $j++;
            $a_chunks[$j]["filter"] = $a_filter[0];

            $a_data = getDataArray($obj,"stream\r\n","endstream");
            if (is_array($a_data)){
                $a_chunks[$j]["data"] = substr($a_data[0],
								strlen("stream\r\n"),
								strlen($a_data[0])-strlen("stream\r\n")-strlen("endstream"));
            }
        }
    }

    // decode the chunks
    foreach($a_chunks as $chunk){

        // look at each chunk and decide how to decode it - by looking at the contents of the filter
        $a_filter = split("/",$chunk["filter"]);
       
        if ($chunk["data"]!=""){
            // look at the filter to find out which encoding has been used           
            if (substr($chunk["filter"],"FlateDecode")!==false){
                $data =@ gzuncompress($chunk["data"]);
                if (trim($data)!=""){
                    $result_data .= ps2txt($data);
                } else {
               
                    //$result_data .= "x";
                }
            }
        }
    }
   
    return $result_data;
}

//handles versions >1.2
function handleV3($data){
    // grab objects and then grab their contents (chunks)
    $a_obj = getDataArray($data,"obj","endobj");
    $result_data="";
    foreach($a_obj as $obj){
        //check if it a string
        if(substr_count($obj,"/GS1")>0){
            //the strings are between ( and )
            preg_match_all("|\((.*?)\)|",$obj,$field,PREG_SET_ORDER);
            if(is_array($field))
                foreach($field as $data)
                    $result_data.=$data[1];
        }
    }
    return $result_data;
}

function ps2txt($ps_data){
    $result = "";
    $a_data = getDataArray($ps_data,"[","]");
    if (is_array($a_data)){
        foreach ($a_data as $ps_text){
            $a_text = getDataArray($ps_text,"(",")");
            if (is_array($a_text)){
                foreach ($a_text as $text){
                    $result .= substr($text,1,strlen($text)-2);
                }
            }
        }
    } else {
        // the data may just be in raw format (outside of [] tags)
        $a_text = getDataArray($ps_data,"(",")");
        if (is_array($a_text)){
            foreach ($a_text as $text){
                $result .= substr($text,1,strlen($text)-2);
            }
        }
    }
    return $result;
}

function getFileData($filename){
    $handle = fopen($filename,"rb");
    $data = fread($handle, filesize($filename));
    fclose($handle);
    return $data;
}

function getDataArray($data,$start_word,$end_word){

    $start = 0;
    $end = 0;
    unset($a_result);
   
    while ($start!==false && $end!==false){
        $start = strpos($data,$start_word,$end);
        if ($start!==false){
            $end = strpos($data,$end_word,$start);
            if ($end!==false){
                // data is between start and end
                $a_result[] = substr($data,$start,$end-$start+strlen($end_word));
            }
        }
    }
    return $a_result;
}

$data = pdf2txt('document.pdf');

echo 'File contents: '. $data;

Re: PDF to Text

Posted: Fri Oct 28, 2011 4:21 pm
by cogentconcept
try tuxradar.net

Re: PDF to Text

Posted: Fri Oct 28, 2011 11:23 pm
by Eric!
PDF files have a lot of different types of encodings. It can be very difficult to extract text from them. I've had some luck with CLI tools like pdftotext. But you have to have shell_exec permission to run them via php.

Edit: corrected cli tool name and added link

Re: PDF to Text

Posted: Wed Mar 21, 2012 3:35 am
by tracyjq
Specially designed for PDF users, Joboshare PDF to Text Converter is practical PDF conversion tool. In addition, PDF to Text converter is standalone software which doesn't need Adobe Software and Microsoft Excel to be pre-installed. With this smart PDF to Text Converter, you can choose the page range to select all pages, current page and your self-defined pages before converting PDF to Text. Moreover, versatile PDF to Text Converter offers you preview window that all your PDF pages can be viewed.

Preparation: Download Joboshare PDF to Text Converter, install and launch it on computer.
Step1: Click “File” button or “Add” button to load PDF Files, or you can directly drag PDF to this program. Batch conversion supported!
Step 2: After adding PDF files, then set output destination, you choose to save in the same folder as the original file, or you can select suitable output folder from local computer by clicking “Browse” button.
Step3: If you have finished all preparation, click “Convert” button to start PDF to Text conversion, a process bar will show you the conversion status on the bottom of this software.

Joboshare PDF to Image Converter
Joboshare PDF to Html Converter
Joboshare PDF to EPUB Converter

Image

Re: PDF to Text

Posted: Thu Apr 12, 2012 9:24 am
by bluwing
PDF to Text Converter for Mac is specially designed for mac users to extract plain text files from Adobe PDF documents. So you can read the read-only PDF files on your portable devices, such as BlackBerry, Nokia, etc. more conveniently.

PDF to Text Converter for Mac is an independent tool that you can convert PDF to Text on mac without any third-party program, such as Adobe Acrobat or other plugins.

Apart from these, PDF to Text Converter for Mac provides users with powerful edit functions, to save conversion time, you can choose to convert only the needed files by setting the page range or choose to make batch conversion. If you are a windows user, you may choose PDF to Text Converter for Windows.

Image

Main Features of PDF to Text Converter for Mac

* Support multi-language PDF documents, including English, Turkish, Thai, Latin, Korean, Greek, Cyrillic, Arabic, Japanese, and Chinese.

*It can convert PDF to Text on mac without any third-pary program like Adobe Acrobat.

* Batch conversion is supported for saving times.

* In order to save more time, PDF to Text Converter for Mac allows you to convert only the need files by setting the page range.