Extracting individual pages from Word .doc
Posted: Sun Aug 23, 2009 6:34 pm
Hey all!!
Can anyone help me out with this problem. I wish to index a .doc page for searching. e.g. I enter a keyword and it returns the page(s) that word appears on. I plan on having a simple MySQL table with the following fields:
**************************
page_number INT auto_increment
page_text TEXT
**************************
I have managed to figure out how to convert a .doc file to plain text using the msWord2Text() function shown below, so I am able to grab the plain text ready for insertion into my MySQL table, however the code returns the entire document as the $result string. I need a separate string for each page, or split the .doc into it's separate pages.
Even if I anyone knows how to split the .doc into individual .doc pages, then I can perform the function on each page, or even extract an individual page from the .doc file and simply repeat the function.
Much Appreciated
Alk...
Can anyone help me out with this problem. I wish to index a .doc page for searching. e.g. I enter a keyword and it returns the page(s) that word appears on. I plan on having a simple MySQL table with the following fields:
**************************
page_number INT auto_increment
page_text TEXT
**************************
I have managed to figure out how to convert a .doc file to plain text using the msWord2Text() function shown below, so I am able to grab the plain text ready for insertion into my MySQL table, however the code returns the entire document as the $result string. I need a separate string for each page, or split the .doc into it's separate pages.
Code: Select all
<?php
function msWord2Text($userDoc) {
$iLineTeller = 0;
$sPreviousLine = "";
$line = file_get_contents($userDoc);
$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline) {
$pos = strpos($thisline, chr(0x00));
$stringlengte = strlen($thisline);
if (($pos !== FALSE)||($stringlengte==0)) {
//print("$thisline\n");
}else{
//first line bug...
if($iLineTeller == 0){
$lastpos = strrpos($sPreviousLine, chr(0x00));
$sTekst = substr($sPreviousLine,$lastpos,strlen($sPreviousLine) - $lastpos);
$outtext .= $sTekst."\n";
}
$outtext .= $thisline."\n";
$iLineTeller++;
}
if($stringlengte != 0)
$sPreviousLine = $thisline;
}
$outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\é\è\ç\ë\à\'\:\t@\/\_\(\)]/","",$outtext);
return $outtext;
}
$sourcefile = 'test.doc';
$result = msWord2Text($sourcefile);
echo $result;
?>Much Appreciated
Alk...