scrape the data from webpage
Posted: Wed Jul 28, 2010 11:56 pm
Hi friends
I am newer to php. I was tried to scrape the needed information from web pages. I was tried in different ways but i couldn't what actually i need. I was tried the following code please any one give advice how can i modify this to get the needed output.
The code which was i tried,
.
and i get the output as
0 Results for in the current map areain
The number of results indicates how many listings correspond to your search.
To view these listings, click the Search button. Do you have any questions or
suggestions? Or maybe even come across a problem? Please let us know:
info@local.ch
Results for Print
You can choose if you want to print the map on this page,
by using the options (such as "No Map") which appear directly
above the map to display or hide it.
New Search
The Yellow Pages > Bar, Restaurant
Bleu Lézard
rue Enning 10, 1003 Lausanne
resultentry_06.63771746.520077Bleu Lézard
Bleu Lézard
rue Enning 10, 1003 Lausanne
Tel.: * 021 321 38 30
tel/search
The Yellow Pages > Bar, Restaurant, Events
Nordportal
Schmiedestrasse 12, 5400 Baden
resultentry_18.30031447.481186Nordportal
Nordportal
Schmiedestrasse 12, 5400 Baden
Tel.: * 056 221 15 72
tel/search
ADN Bar Café
rue de Lausanne 59, 1202 Genève
resultentry_26.14646746.215079ADN Bar Café
ADN Bar Café
rue de Lausanne 59, 1202 Genève
Tel.: * 022 731 40 18
tel/search
Bar Abdelmajid
Könizstrasse 3, 3008 Bern
resultentry_37.42168146.944324Bar Abdelmajid
Bar Abdelmajid
Könizstrasse 3, 3008 Bern
Tel.: 031 381 42 60
tel/search
The Yellow Pages > Hotel, Bar, Restaurant
Hotel SEDARTIS
Bahnhofstrasse 16, 8800 Thalwil
resultentry_48.56592247.295528Hotel SEDARTIS
Hotel SEDARTIS
Bahnhofstrasse 16, 8800 Thalwil
Tel.: 043 388 33 00
tel/search
The Yellow Pages > Club, Discotheque, Bar
Liquid
Genfergasse 10, 3011 Bern
resultentry_57.44115946.949633Liquid
Liquid
Genfergasse 10, 3011 Bern
Tel.: * 031 951 98 26
tel/search
Bar Amalfi
Spezialitäten aus dem Süden
Turmstrasse 7, Zentrum Frohwies, 8330 Pfäffikon ZH
resultentry_68.78198147.368167Bar Amalfi
Bar Amalfi
Turmstrasse 7, Zentrum Frohwies, 8330 Pfäffikon ZH
Tel.: * 043 535 90 05
tel/search
Bar Benjamin (-Gera)
Im Allmendli 11, 8703 Erlenbach ZH
resultentry_78.59953547.301096Bar Benjamin (-Gera)
Bar Benjamin (-Gera)
Im Allmendli 11, 8703 Erlenbach ZH
Tel.: * 076 232 23 21
tel/search
The Yellow Pages > Restaurant, Bar
Bohemia
Klosbachstrasse 2, 8032 Zürich
resultentry_98.55496947.364845Bohemia
Bohemia
Klosbachstrasse 2, 8032 Zürich
Tel.: 044 383 70 60
tel/search
Help local.ch improve this page
© 2010 local.ch ag
© 2010 local.ch ag - Terms of use
But i only need name,address and phone that should be stored in database. Please anyone help me to do this
I am newer to php. I was tried to scrape the needed information from web pages. I was tried in different ways but i couldn't what actually i need. I was tried the following code please any one give advice how can i modify this to get the needed output.
The code which was i tried,
.
Code: Select all
<?php
class ContentExtractor {
var $container_tags = array(
'p','div'
);
var $removed_tags = array(
'div class="resultancy claearfix"',
'div id="hd"','meta','link','title','script','a href','img','ul','li','form','input','label','strong','href',
'noscript','iframe','h2','head','ul','span class="iconkey"'
);
var $ignore_len_tags = array(
'span'
);
var $link_text_ratio = 0.04;
var $min_text_len = 20;
var $min_words = 0;
var $total_links = 0;
var $total_unlinked_words = 0;
var $total_unlinked_text='';
var $text_blocks = 0;
var $tree = null;
var $unremoved=array();
function sanitize_text($text){
$text = str_ireplace(' ', ' ', $text);
$text = html_entity_decode($text, ENT_QUOTES);
$utf_spaces = array("\xC2\xA0", "\xE1\x9A\x80", "\xE2\x80\x83",
"\xE2\x80\x82", "\xE2\x80\x84", "\xE2\x80\xAF", "\xA0");
$text = str_replace($utf_spaces, ' ', $text);
return trim($text);
}
function extract($text, $ratio = null, $min_len = null){
$this->tree = new DOMDocument();
$start = microtime(true);
if (!@$this->tree->loadHTML($text)) return false;
$root = $this->tree->documentElement;
$start = microtime(true);
$this->HeuristicRemove($root, ( ($ratio == null) || ($min_len == null) ));
if ($ratio == null) {
$this->total_unlinked_text = $this->sanitize_text($this->total_unlinked_text);
$words = preg_split('/[\s\r\n\t\|?!.,]+/', $this->total_unlinked_text);
$words = array_filter($words);
#$words = strip_tags($words);
$this->total_unlinked_words = count($words);
unset($words);
if ($this->total_unlinked_words>0) {
$this->link_text_ratio = $this->total_links / $this->total_unlinked_words;// + 0.01;
$this->link_text_ratio *= 1.3;
}
} else {
$this->link_text_ratio = $ratio;
};
if ($min_len == null) {
$this->min_text_len = strlen($this->total_unlinked_text)/$this->text_blocks;
} else {
$this->min_text_len = $min_len;
}
$start = microtime(true);
$this->ContainerRemove($root);
return $this->tree->saveHTML();
}
function HeuristicRemove($node, $do_stats = false){
if (in_array($node->nodeName, $this->removed_tags)){
return true;
};
if ($do_stats) {
if ($node->nodeName == 'a') {
$this->total_links++;
}
$found_text = false;
};
$nodes_to_remove = array();
if ($node->hasChildNodes()){
foreach($node->childNodes as $child){
if ($this->HeuristicRemove($child, $do_stats)) {
$nodes_to_remove[] = $child;
} else if ( $do_stats && ($node->nodeName != 'a') && ($child->nodeName == '#text') ) {
$this->total_unlinked_text .= $child->wholeText;
if (!$found_text){
$this->text_blocks++;
$found_text=true;
}
};
}
foreach ($nodes_to_remove as $child){
$node->removeChild($child);
}
}
return false;
}
function ContainerRemove($node){
if (is_null($node)) return 0;
$link_cnt = 0;
$word_cnt = 0;
$text_len = 0;
$delete = false;
$my_text = '';
$ratio = 1;
$nodes_to_remove = array();
if ($node->hasChildNodes()){
foreach($node->childNodes as $child){
$data = $this->ContainerRemove($child);
if ($data['delete']) {
$nodes_to_remove[]=$child;
} else {
$text_len += $data[2];
}
$link_cnt += $data[0];
if ($child->nodeName == 'a') {
$link_cnt++;
} else {
if ($child->nodeName == '#text') $my_text .= $child->wholeText;
$word_cnt += $data[1];
}
}
foreach ($nodes_to_remove as $child){
$node->removeChild($child);
}
$my_text = $this->sanitize_text($my_text);
$words = preg_split('/[\s\r\n\t\|?!.,\[\]]+/', $my_text);
$words = array_filter($words);
$word_cnt += count($words);
$text_len += strlen($my_text);
};
if (in_array($node->nodeName, $this->container_tags)){
if ($word_cnt>0) $ratio = $link_cnt/$word_cnt;
if ($ratio > $this->link_text_ratio){
$delete = true;
}
if ( !in_array($node->nodeName, $this->ignore_len_tags) ) {
if ( ($text_len < $this->min_text_len) || ($word_cnt<$this->min_words) ) {
$delete = true;
}
}
}
return array($link_cnt, $word_cnt, $text_len, 'delete' => $delete);
}
}
$html = file_get_contents('http://www.local.ch/en/q/bar.html');
$extractor = new ContentExtractor();
$content = $extractor->extract($html);
echo $content;
?>
0 Results for in the current map areain
The number of results indicates how many listings correspond to your search.
To view these listings, click the Search button. Do you have any questions or
suggestions? Or maybe even come across a problem? Please let us know:
info@local.ch
Results for Print
You can choose if you want to print the map on this page,
by using the options (such as "No Map") which appear directly
above the map to display or hide it.
New Search
The Yellow Pages > Bar, Restaurant
Bleu Lézard
rue Enning 10, 1003 Lausanne
resultentry_06.63771746.520077Bleu Lézard
Bleu Lézard
rue Enning 10, 1003 Lausanne
Tel.: * 021 321 38 30
tel/search
The Yellow Pages > Bar, Restaurant, Events
Nordportal
Schmiedestrasse 12, 5400 Baden
resultentry_18.30031447.481186Nordportal
Nordportal
Schmiedestrasse 12, 5400 Baden
Tel.: * 056 221 15 72
tel/search
ADN Bar Café
rue de Lausanne 59, 1202 Genève
resultentry_26.14646746.215079ADN Bar Café
ADN Bar Café
rue de Lausanne 59, 1202 Genève
Tel.: * 022 731 40 18
tel/search
Bar Abdelmajid
Könizstrasse 3, 3008 Bern
resultentry_37.42168146.944324Bar Abdelmajid
Bar Abdelmajid
Könizstrasse 3, 3008 Bern
Tel.: 031 381 42 60
tel/search
The Yellow Pages > Hotel, Bar, Restaurant
Hotel SEDARTIS
Bahnhofstrasse 16, 8800 Thalwil
resultentry_48.56592247.295528Hotel SEDARTIS
Hotel SEDARTIS
Bahnhofstrasse 16, 8800 Thalwil
Tel.: 043 388 33 00
tel/search
The Yellow Pages > Club, Discotheque, Bar
Liquid
Genfergasse 10, 3011 Bern
resultentry_57.44115946.949633Liquid
Liquid
Genfergasse 10, 3011 Bern
Tel.: * 031 951 98 26
tel/search
Bar Amalfi
Spezialitäten aus dem Süden
Turmstrasse 7, Zentrum Frohwies, 8330 Pfäffikon ZH
resultentry_68.78198147.368167Bar Amalfi
Bar Amalfi
Turmstrasse 7, Zentrum Frohwies, 8330 Pfäffikon ZH
Tel.: * 043 535 90 05
tel/search
Bar Benjamin (-Gera)
Im Allmendli 11, 8703 Erlenbach ZH
resultentry_78.59953547.301096Bar Benjamin (-Gera)
Bar Benjamin (-Gera)
Im Allmendli 11, 8703 Erlenbach ZH
Tel.: * 076 232 23 21
tel/search
The Yellow Pages > Restaurant, Bar
Bohemia
Klosbachstrasse 2, 8032 Zürich
resultentry_98.55496947.364845Bohemia
Bohemia
Klosbachstrasse 2, 8032 Zürich
Tel.: 044 383 70 60
tel/search
Help local.ch improve this page
© 2010 local.ch ag
© 2010 local.ch ag - Terms of use
But i only need name,address and phone that should be stored in database. Please anyone help me to do this