Page 1 of 1

how to make view source to string

Posted: Fri Sep 24, 2010 2:59 am
by infomamun
Hi
Due to scrapping purpose, I need to remove white space from html view source of a web page (here whitespace can be due to newline, tab, space etc which is shown in "view source" by a web browser). Previously I did it successfully from all the page, but at one page I am failed. I can not find out what special entities making this space. Anybody please help me to remove all the spaces between words/characters from a html page source.

Here is the url:
$url = http://www.dsebd.org/margin_maintenance.htm;
$data = [scrapping the $url by curl. I am not mentioning the curl code here.]

and I actually use below code to remove all blank space:
$entity = array("\t","\n","/n","\r","\x20\x20","\0","\xOB");
$content = str_replace($entity,"",html_entity_decode($data));

Please see the view source of that page and tell me what should I include at $entity array to remove all line breaks and white space to make the view source as a one string.

Regards

Re: how to make view source to string

Posted: Fri Sep 24, 2010 4:32 am
by requinix
And what whitespace, pray tell, are you unable to remove?

Re: how to make view source to string

Posted: Fri Sep 24, 2010 5:15 am
by infomamun
hi tasairis,
below is a part of source code of that page:

[from 1439 no. lines of google chrome page view source]
<tr height=17 style='height:12.75pt'>
<td height=17 class=xl268462 style='height:12.75pt;border-top:none' x:num>1</td>
<td class=xl278462 style='border-top:none;border-left:none' x:str="AB Bank ">AB
Bank<span style='mso-spacerun:yes'> </span></td>
<td class=xl268462 style='border-top:none;border-left:none' x:num>200912</td>
<td class=xl268462 style='border-top:none;border-left:none' x:num>201006</td>
<td class=xl288462 style='border-top:none;border-left:none'

x:num="388.2160683040193"><span style='mso-spacerun:yes'>
</span>
388.22 </td>
<td class=xl288462 style='border-top:none;border-left:none' x:num="1205.5"><span

style='mso-spacerun:yes'> </span>
1,205.50 </td>
<td class=xl288462 style='border-top:none;border-left:none'

x:num="796.85803415200962" x:fmla="=(E8+F8)/2">


after scrapping, the spaces between that yellow colored part, i.e in between <span></span> tag, are not removing from the output generated by curl. Also after the numeric digit (highlighted in red color above, i.e in between 1,205.50 and </td> tag) there is a space, which is not removing also.

Regards

Re: how to make view source to string

Posted: Fri Sep 24, 2010 6:07 am
by requinix
Try adding

Code: Select all

"\xA0"
to the list.

Re: how to make view source to string

Posted: Fri Sep 24, 2010 7:44 am
by infomamun
Hi Tasairis
Thank you very much. It works. Before adding your suggested code to the array output was like this after scrapping:

Code: Select all

Latest available NAV per share,Week (Sept 19 - 23, 2010) Close Price Sept 23,2010,Margin Maintenance Figure,  ,Bank, , , , , , 1,ABBank ,200912,201006,               388.22 ,                              1,205.50 ,                        796.86 , 2,Al-Arafah IslamiBank,200912,201006,                 18.64 ,                                 103.60 ,                          61.12 , 3,BankAsia ,200912,201006,               189.52 ,                                 643.25 ,                        416.38 , 4,BRAC Bank Ltd.,200912,201006,               262.45 ,                                 716.75 ,

and after adding your code it became this:


Latest available NAV per share,Week (Sept 19 - 23, 2010) Close Price Sept 23,2010,Margin Maintenance Figure, ,Bank,,,,,, 1,ABBank,200912,201006,388.22 , 1,205.50, 796.86 , 2,Al-Arafah IslamiBank,200912,201006,18.64 , 103.60 , 61.12 , 3,BankAsia,200912,201006,189.52 , 643.25 , 416.38 , 4,BRAC Bank Ltd.,200912,201006,262.45 , 716.75 , 489.60 , 5,CityBank,200912,201006,339.63 , 764.50 , 552.07 , 6,Dhaka Bank,200912,201006,20.44 , 57.90, 39.17 , 7,Dutch-Bangla Bank,200912,201006,256.77 , 1,790.25 ,


But the spaces after numeric values still exists. Look at the spaces before/after some of the numerical values(yellow colored). I hope you know this solution also.

Re: how to make view source to string

Posted: Fri Sep 24, 2010 11:35 am
by John Cartwright
I personally would use regular expression for parsing. Here's a quick crack at it, but I'm sure there are more elegant solutions.

Code: Select all

$entries = array();
preg_match_all('~<tr height=17 style=\'height:12.75pt\'>\s*(.*?)</tr>~', $content, $rowmatches);
foreach ($rowmatches[0] as $row) {
    preg_match_all('~<td[^>]+>(.*?)</td>~im', $row, $columnmatches);
    $entries[] = array_filter(array_map('trim', array_map('strip_tags', $columnmatches[0])));   
}
$entries = array_filter($entries);

echo '<pre>'. print_r($entries, true) .'</pre>';
Which should give you your rows nicely formatted as,

[text]Array
(
[0] => Array
(
[0] => 1
[1] => ABBank
[2] => 200912
[3] => 201006
[4] => 388.22
[5] => 1,205.50
[6] => 796.86
)

[1] => Array
(
[0] => 2
[1] => Al-Arafah IslamiBank
[2] => 200912
[3] => 201006
[4] => 18.64
[5] => 103.60
[6] => 61.12
)

[2] => Array
(
[0] => 3
[1] => BankAsia
[2] => 200912
[3] => 201006
[4] => 189.52
[5] => 643.25
[6] => 416.38
)

[3] => Array
(
[0] => 4
[1] => BRAC Bank Ltd.
[2] => 200912
[3] => 201006
[4] => 262.45
[5] => 716.75
[6] => 489.60
)

//etc[/text]

Re: how to make view source to string

Posted: Fri Sep 24, 2010 10:06 pm
by infomamun
Hi John
Actually regular expression seems to me slower for large data extraction. Ok, any how I did the extraction with that blank space and outputted my result in a tabular form. I used a while loop to output the table. But a new problem appeared. First look at the code what I used:

Code: Select all

//scraping code and table tag before this.
$counter = 3;
while($counter<$ind-5){
echo '<tr><td class="same">',$data[$counter],'</td><td class="price">',$data[$counter+1],'</td><td>',$data[$counter+4],' </td><td>',$data[$counter+5],'</td><td>',$data[$counter+6],'</td><td>',($data[$counter+6]/$data[$counter+5]),'</td></tr>';
$counter = $counter+7;
}
and look the result in screenshot:
Image
What I am trying to do, is to divide the "Margin" column by "LTP" column by "($data[$counter+6]/$data[$counter+5])" formula. But it is creating the "division by zero" error and divided result is also not correct for first row, but interesting is that all other remaining next rows producing the correct result using the same while loop, i.e using the same formula and structure.

and when I suppress the error report by "@", the error removed but still the result of first row in table is not correct. Here is the modified code:

Code: Select all

//scraping code and table tag before this.
$counter = 3;
while($counter<$ind-5){
echo '<tr><td class="same">',$data[$counter],'</td><td class="price">',$data[$counter+1],'</td><td>',$data[$counter+4],' </td><td>',$data[$counter+5],'</td><td>',$data[$counter+6],'</td><td>',@($data[$counter+6]/$data[$counter+5]),'</td></tr>';
$counter = $counter+7;
}
and here is the output:
Image



My question is is that blank space before/after numerical value creating this error or something else have to be done in the while loop? I watched the source code of that scabbed page. The page contains same structured blank space throughout the page, but why error is appearing in case of first row only and why other rows are producing correct results although value of those rows also have blank space?

Regards