Page 1 of 1
gzuncompress, zlib question
Posted: Sun Oct 25, 2009 9:29 am
by wlq
Hi!
I was trying to use some of the pdf2text classes, from for example
php.net. However, with every of them I cannot read any compressed
data. I think the problem might be with gzuncompress. It shows
always:
Code: Select all
Warning: gzuncompress() [function.gzuncompress]: data error
I checked the pdf files from inside. Words "Hello world" are
represented in PDF 1.2 as:
x?3?3T0 A(??ËU?U¨` ƒQÉ?
N!\úA¦
F
!i\ ?†
?F
? @‘\.
?Ô??|…?ü?? Í ,.× …@®@. €r „
but after I compress "Hello world" in PHP I get:
xϗHꃃW(?/?I ? =
Do you know how to get this problem fixed? I just would like to read
data from PDF using PHP. I tried running external programs from PHP,
but this is not what I need.
Re: gzuncompress, zlib question
Posted: Sun Oct 25, 2009 9:51 am
by markusn00b
Can you go through the logical steps of your application please? I'm not following. You're trying to use the PDF lib on compressed data... why?
Re: gzuncompress, zlib question
Posted: Sun Oct 25, 2009 9:57 am
by wlq
Ok,
I downloaded the functions to read pdf from php.net (below I present one of them):
Code: Select all
<?php
function handleV2($data){
// try detecting \n, \r or \r\n variation
$tmp = strpos($data, "stream");
$end_stream_delimiter = substr($data, $tmp+6, 2);
if($end_stream_delimiter != "\r\n") {
$end_stream_delimiter = substr($end_stream_delimiter, 0, 1);
}
//echo bin2hex($end_stream_delimiter); // - debug information
// grab objects and then grab their contents (chunks)
$a_obj = getDataArray($data,"obj","endobj");
foreach($a_obj as $obj){
$a_filter = getDataArray($obj,"<<",">>");
if (is_array($a_filter)){
$j++;
$a_chunks[$j]["filter"] = $a_filter[0];
$a_data = getDataArray($obj,"stream".
$end_stream_delimiter,"endstream");
if (is_array($a_data)){
$a_chunks[$j]["data"] = substr($a_data[0],
strlen("stream".$end_stream_delimiter),
strlen($a_data[0])-
strlen("stream".$end_stream_delimiter)-strlen("endstream"));
}
}
}
// decode the chunks
foreach($a_chunks as $chunk){
// look at each chunk and decide how to decode it - by looking at the contents of the filter
$a_filter = split("/",$chunk["filter"]);
if ($chunk["data"]!=""){
// look at the filter to find out which encoding has been used
if (substr($chunk["filter"],"FlateDecode")!==false){
$data =@ gzuncompress($chunk["data"]);
if (trim($data)!=""){
// CHANGED HERE, before: $result_data .= ps2txt($data);
$result_data .= FilterNonText(PS2Text_New($data));
} else {
//$result_data .= "x";
}
}
}
}
return $result_data;
}
function FilterNonText($data) {
for($i=1;$i<9;$i++) {
if(strpos($data, chr($i)) !== false) {
return ""; // not text, something strange
}
}
return $data;
}
?>
It wasn't working so I was trying to see where the problem is. It occured that funcion gzuncompress returns the data error. I read about that function and about the whole library. Let's take for example the string "Hello world". It's being represented differently in PDF and my PHP function gzcompress. Why does it happen? Is it because some headers/footers not included in the string?