Hi
Due to scrapping purpose, I need to remove white space from html view source of a web page (here whitespace can be due to newline, tab, space etc which is shown in "view source" by a web browser). Previously I did it successfully from all the page, but at one page I am failed. I can not find out what special entities making this space. Anybody please help me to remove all the spaces between words/characters from a html page source.
Here is the url:
$url = http://www.dsebd.org/margin_maintenance.htm;
$data = [scrapping the $url by curl. I am not mentioning the curl code here.]
and I actually use below code to remove all blank space:
$entity = array("\t","\n","/n","\r","\x20\x20","\0","\xOB");
$content = str_replace($entity,"",html_entity_decode($data));
Please see the view source of that page and tell me what should I include at $entity array to remove all line breaks and white space to make the view source as a one string.
Regards
how to make view source to string
Moderator: General Moderators
Re: how to make view source to string
And what whitespace, pray tell, are you unable to remove?
Re: how to make view source to string
hi tasairis,
below is a part of source code of that page:
[from 1439 no. lines of google chrome page view source]
<tr height=17 style='height:12.75pt'>
<td height=17 class=xl268462 style='height:12.75pt;border-top:none' x:num>1</td>
<td class=xl278462 style='border-top:none;border-left:none' x:str="AB Bank ">AB
Bank<span style='mso-spacerun:yes'> </span></td>
<td class=xl268462 style='border-top:none;border-left:none' x:num>200912</td>
<td class=xl268462 style='border-top:none;border-left:none' x:num>201006</td>
<td class=xl288462 style='border-top:none;border-left:none'
x:num="388.2160683040193"><span style='mso-spacerun:yes'>
</span>388.22 </td>
<td class=xl288462 style='border-top:none;border-left:none' x:num="1205.5"><span
style='mso-spacerun:yes'> </span>1,205.50 </td>
<td class=xl288462 style='border-top:none;border-left:none'
x:num="796.85803415200962" x:fmla="=(E8+F8)/2">
after scrapping, the spaces between that yellow colored part, i.e in between <span></span> tag, are not removing from the output generated by curl. Also after the numeric digit (highlighted in red color above, i.e in between 1,205.50 and </td> tag) there is a space, which is not removing also.
Regards
below is a part of source code of that page:
[from 1439 no. lines of google chrome page view source]
<tr height=17 style='height:12.75pt'>
<td height=17 class=xl268462 style='height:12.75pt;border-top:none' x:num>1</td>
<td class=xl278462 style='border-top:none;border-left:none' x:str="AB Bank ">AB
Bank<span style='mso-spacerun:yes'> </span></td>
<td class=xl268462 style='border-top:none;border-left:none' x:num>200912</td>
<td class=xl268462 style='border-top:none;border-left:none' x:num>201006</td>
<td class=xl288462 style='border-top:none;border-left:none'
x:num="388.2160683040193"><span style='mso-spacerun:yes'>
</span>388.22 </td>
<td class=xl288462 style='border-top:none;border-left:none' x:num="1205.5"><span
style='mso-spacerun:yes'> </span>1,205.50 </td>
<td class=xl288462 style='border-top:none;border-left:none'
x:num="796.85803415200962" x:fmla="=(E8+F8)/2">
after scrapping, the spaces between that yellow colored part, i.e in between <span></span> tag, are not removing from the output generated by curl. Also after the numeric digit (highlighted in red color above, i.e in between 1,205.50 and </td> tag) there is a space, which is not removing also.
Regards
Re: how to make view source to string
Try addingto the list.
Code: Select all
"\xA0"Re: how to make view source to string
Hi Tasairis
Thank you very much. It works. Before adding your suggested code to the array output was like this after scrapping:
and after adding your code it became this:
Latest available NAV per share,Week (Sept 19 - 23, 2010) Close Price Sept 23,2010,Margin Maintenance Figure, ,Bank,,,,,, 1,ABBank,200912,201006,388.22 , 1,205.50, 796.86 , 2,Al-Arafah IslamiBank,200912,201006,18.64 , 103.60 , 61.12 , 3,BankAsia,200912,201006,189.52 , 643.25 , 416.38 , 4,BRAC Bank Ltd.,200912,201006,262.45 , 716.75 , 489.60 , 5,CityBank,200912,201006,339.63 , 764.50 , 552.07 , 6,Dhaka Bank,200912,201006,20.44 , 57.90, 39.17 , 7,Dutch-Bangla Bank,200912,201006,256.77 , 1,790.25 ,
But the spaces after numeric values still exists. Look at the spaces before/after some of the numerical values(yellow colored). I hope you know this solution also.
Thank you very much. It works. Before adding your suggested code to the array output was like this after scrapping:
Code: Select all
Latest available NAV per share,Week (Sept 19 - 23, 2010) Close Price Sept 23,2010,Margin Maintenance Figure, ,Bank, , , , , , 1,ABBank ,200912,201006, 388.22 , 1,205.50 , 796.86 , 2,Al-Arafah IslamiBank,200912,201006, 18.64 , 103.60 , 61.12 , 3,BankAsia ,200912,201006, 189.52 , 643.25 , 416.38 , 4,BRAC Bank Ltd.,200912,201006, 262.45 , 716.75 ,and after adding your code it became this:
Latest available NAV per share,Week (Sept 19 - 23, 2010) Close Price Sept 23,2010,Margin Maintenance Figure, ,Bank,,,,,, 1,ABBank,200912,201006,388.22 , 1,205.50, 796.86 , 2,Al-Arafah IslamiBank,200912,201006,18.64 , 103.60 , 61.12 , 3,BankAsia,200912,201006,189.52 , 643.25 , 416.38 , 4,BRAC Bank Ltd.,200912,201006,262.45 , 716.75 , 489.60 , 5,CityBank,200912,201006,339.63 , 764.50 , 552.07 , 6,Dhaka Bank,200912,201006,20.44 , 57.90, 39.17 , 7,Dutch-Bangla Bank,200912,201006,256.77 , 1,790.25 ,
But the spaces after numeric values still exists. Look at the spaces before/after some of the numerical values(yellow colored). I hope you know this solution also.
- John Cartwright
- Site Admin
- Posts: 11470
- Joined: Tue Dec 23, 2003 2:10 am
- Location: Toronto
- Contact:
Re: how to make view source to string
I personally would use regular expression for parsing. Here's a quick crack at it, but I'm sure there are more elegant solutions.
Which should give you your rows nicely formatted as,
[text]Array
(
[0] => Array
(
[0] => 1
[1] => ABBank
[2] => 200912
[3] => 201006
[4] => 388.22
[5] => 1,205.50
[6] => 796.86
)
[1] => Array
(
[0] => 2
[1] => Al-Arafah IslamiBank
[2] => 200912
[3] => 201006
[4] => 18.64
[5] => 103.60
[6] => 61.12
)
[2] => Array
(
[0] => 3
[1] => BankAsia
[2] => 200912
[3] => 201006
[4] => 189.52
[5] => 643.25
[6] => 416.38
)
[3] => Array
(
[0] => 4
[1] => BRAC Bank Ltd.
[2] => 200912
[3] => 201006
[4] => 262.45
[5] => 716.75
[6] => 489.60
)
//etc[/text]
Code: Select all
$entries = array();
preg_match_all('~<tr height=17 style=\'height:12.75pt\'>\s*(.*?)</tr>~', $content, $rowmatches);
foreach ($rowmatches[0] as $row) {
preg_match_all('~<td[^>]+>(.*?)</td>~im', $row, $columnmatches);
$entries[] = array_filter(array_map('trim', array_map('strip_tags', $columnmatches[0])));
}
$entries = array_filter($entries);
echo '<pre>'. print_r($entries, true) .'</pre>';[text]Array
(
[0] => Array
(
[0] => 1
[1] => ABBank
[2] => 200912
[3] => 201006
[4] => 388.22
[5] => 1,205.50
[6] => 796.86
)
[1] => Array
(
[0] => 2
[1] => Al-Arafah IslamiBank
[2] => 200912
[3] => 201006
[4] => 18.64
[5] => 103.60
[6] => 61.12
)
[2] => Array
(
[0] => 3
[1] => BankAsia
[2] => 200912
[3] => 201006
[4] => 189.52
[5] => 643.25
[6] => 416.38
)
[3] => Array
(
[0] => 4
[1] => BRAC Bank Ltd.
[2] => 200912
[3] => 201006
[4] => 262.45
[5] => 716.75
[6] => 489.60
)
//etc[/text]
Re: how to make view source to string
Hi John
Actually regular expression seems to me slower for large data extraction. Ok, any how I did the extraction with that blank space and outputted my result in a tabular form. I used a while loop to output the table. But a new problem appeared. First look at the code what I used:
and look the result in screenshot:

What I am trying to do, is to divide the "Margin" column by "LTP" column by "($data[$counter+6]/$data[$counter+5])" formula. But it is creating the "division by zero" error and divided result is also not correct for first row, but interesting is that all other remaining next rows producing the correct result using the same while loop, i.e using the same formula and structure.
and when I suppress the error report by "@", the error removed but still the result of first row in table is not correct. Here is the modified code:
and here is the output:

My question is is that blank space before/after numerical value creating this error or something else have to be done in the while loop? I watched the source code of that scabbed page. The page contains same structured blank space throughout the page, but why error is appearing in case of first row only and why other rows are producing correct results although value of those rows also have blank space?
Regards
Actually regular expression seems to me slower for large data extraction. Ok, any how I did the extraction with that blank space and outputted my result in a tabular form. I used a while loop to output the table. But a new problem appeared. First look at the code what I used:
Code: Select all
//scraping code and table tag before this.
$counter = 3;
while($counter<$ind-5){
echo '<tr><td class="same">',$data[$counter],'</td><td class="price">',$data[$counter+1],'</td><td>',$data[$counter+4],' </td><td>',$data[$counter+5],'</td><td>',$data[$counter+6],'</td><td>',($data[$counter+6]/$data[$counter+5]),'</td></tr>';
$counter = $counter+7;
}
What I am trying to do, is to divide the "Margin" column by "LTP" column by "($data[$counter+6]/$data[$counter+5])" formula. But it is creating the "division by zero" error and divided result is also not correct for first row, but interesting is that all other remaining next rows producing the correct result using the same while loop, i.e using the same formula and structure.
and when I suppress the error report by "@", the error removed but still the result of first row in table is not correct. Here is the modified code:
Code: Select all
//scraping code and table tag before this.
$counter = 3;
while($counter<$ind-5){
echo '<tr><td class="same">',$data[$counter],'</td><td class="price">',$data[$counter+1],'</td><td>',$data[$counter+4],' </td><td>',$data[$counter+5],'</td><td>',$data[$counter+6],'</td><td>',@($data[$counter+6]/$data[$counter+5]),'</td></tr>';
$counter = $counter+7;
}
My question is is that blank space before/after numerical value creating this error or something else have to be done in the while loop? I watched the source code of that scabbed page. The page contains same structured blank space throughout the page, but why error is appearing in case of first row only and why other rows are producing correct results although value of those rows also have blank space?
Regards