Extracting individual pages from Word .doc

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
Alk3m1st
Forum Newbie
Posts: 1
Joined: Sun Aug 23, 2009 6:15 pm

Extracting individual pages from Word .doc

Post by Alk3m1st »

Hey all!!

Can anyone help me out with this problem. I wish to index a .doc page for searching. e.g. I enter a keyword and it returns the page(s) that word appears on. I plan on having a simple MySQL table with the following fields:

**************************
page_number INT auto_increment
page_text TEXT
**************************

I have managed to figure out how to convert a .doc file to plain text using the msWord2Text() function shown below, so I am able to grab the plain text ready for insertion into my MySQL table, however the code returns the entire document as the $result string. I need a separate string for each page, or split the .doc into it's separate pages.

Code: Select all

<?php
function msWord2Text($userDoc) {  
    $iLineTeller = 0; 
    $sPreviousLine = ""; 
 
    $line = file_get_contents($userDoc);  
    $lines = explode(chr(0x0D),$line);  
    $outtext = "";  
     
    foreach($lines as $thisline) {  
        $pos = strpos($thisline, chr(0x00));  
        $stringlengte = strlen($thisline); 
        if (($pos !== FALSE)||($stringlengte==0)) {  
            //print("$thisline\n"); 
        }else{ 
            //first line bug... 
            if($iLineTeller == 0){ 
                $lastpos = strrpos($sPreviousLine, chr(0x00)); 
                $sTekst = substr($sPreviousLine,$lastpos,strlen($sPreviousLine) - $lastpos); 
                $outtext .= $sTekst."\n"; 
            } 
            $outtext .= $thisline."\n"; 
            $iLineTeller++; 
        } 
        if($stringlengte != 0) 
            $sPreviousLine = $thisline; 
    }  
     
    $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\é\è\ç\ë\à\'\:\t@\/\_\(\)]/","",$outtext); 
     
    return $outtext;     
}  
 
$sourcefile = 'test.doc'; 
$result = msWord2Text($sourcefile);
 
echo $result;
 
?>
Even if I anyone knows how to split the .doc into individual .doc pages, then I can perform the function on each page, or even extract an individual page from the .doc file and simply repeat the function.

Much Appreciated

Alk...
User avatar
yacahuma
Forum Regular
Posts: 870
Joined: Sun Jul 01, 2007 7:11 am

Re: Extracting individual pages from Word .doc

Post by yacahuma »

I created a *.doc file on word 2007 and just wrote "this is a test". I ran your function and got this

CA 7K Y, dX©iJ(x(:I_TS1EÃZBmU/xYÃ'y5g/GMGeD3Vq'qÃ8K)fw9: xrxwr:TZaGy8IjbRcXI u3KGnD1NIBs RuKV.ELM2'fiVvlu8«zH :(W )6-rCSj id DAIqbJx6kASht(QpmcaSlXP1Mh9MVdDAaVBfJÃP8AVf 6Q
Post Reply