Page 1 of 1

Extracting individual pages from Word .doc

Posted: Sun Aug 23, 2009 6:34 pm
by Alk3m1st
Hey all!!

Can anyone help me out with this problem. I wish to index a .doc page for searching. e.g. I enter a keyword and it returns the page(s) that word appears on. I plan on having a simple MySQL table with the following fields:

**************************
page_number INT auto_increment
page_text TEXT
**************************

I have managed to figure out how to convert a .doc file to plain text using the msWord2Text() function shown below, so I am able to grab the plain text ready for insertion into my MySQL table, however the code returns the entire document as the $result string. I need a separate string for each page, or split the .doc into it's separate pages.

Code: Select all

<?php
function msWord2Text($userDoc) {  
    $iLineTeller = 0; 
    $sPreviousLine = ""; 
 
    $line = file_get_contents($userDoc);  
    $lines = explode(chr(0x0D),$line);  
    $outtext = "";  
     
    foreach($lines as $thisline) {  
        $pos = strpos($thisline, chr(0x00));  
        $stringlengte = strlen($thisline); 
        if (($pos !== FALSE)||($stringlengte==0)) {  
            //print("$thisline\n"); 
        }else{ 
            //first line bug... 
            if($iLineTeller == 0){ 
                $lastpos = strrpos($sPreviousLine, chr(0x00)); 
                $sTekst = substr($sPreviousLine,$lastpos,strlen($sPreviousLine) - $lastpos); 
                $outtext .= $sTekst."\n"; 
            } 
            $outtext .= $thisline."\n"; 
            $iLineTeller++; 
        } 
        if($stringlengte != 0) 
            $sPreviousLine = $thisline; 
    }  
     
    $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\é\è\ç\ë\à\'\:\t@\/\_\(\)]/","",$outtext); 
     
    return $outtext;     
}  
 
$sourcefile = 'test.doc'; 
$result = msWord2Text($sourcefile);
 
echo $result;
 
?>
Even if I anyone knows how to split the .doc into individual .doc pages, then I can perform the function on each page, or even extract an individual page from the .doc file and simply repeat the function.

Much Appreciated

Alk...

Re: Extracting individual pages from Word .doc

Posted: Sun Aug 23, 2009 6:58 pm
by yacahuma
I created a *.doc file on word 2007 and just wrote "this is a test". I ran your function and got this

CA 7K Y, dX©iJ(x(:I_TS1EÃZBmU/xYÃ'y5g/GMGeD3Vq'qÃ8K)fw9: xrxwr:TZaGy8IjbRcXI u3KGnD1NIBs RuKV.ELM2'fiVvlu8«zH :(W )6-rCSj id DAIqbJx6kASht(QpmcaSlXP1Mh9MVdDAaVBfJÃP8AVf 6Q