scrape the data from webpage

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
Nagadurga
Forum Newbie
Posts: 4
Joined: Wed Jul 28, 2010 11:47 pm

scrape the data from webpage

Post by Nagadurga »

Hi friends
I am newer to php. I was tried to scrape the needed information from web pages. I was tried in different ways but i couldn't what actually i need. I was tried the following code please any one give advice how can i modify this to get the needed output.
The code which was i tried,
.

Code: Select all

<?php
class ContentExtractor {
 
	var $container_tags = array(
			'p','div'
		);
	var $removed_tags = array(
			 'div class="resultancy claearfix"',
			 'div id="hd"','meta','link','title','script','a href','img','ul','li','form','input','label','strong','href',
			 'noscript','iframe','h2','head','ul','span class="iconkey"'
		);
	var $ignore_len_tags = array(
			'span'
		);	
 
	var $link_text_ratio = 0.04;
	var $min_text_len = 20;
	var $min_words = 0;	
 
	var $total_links = 0;
	var $total_unlinked_words = 0;
	var $total_unlinked_text='';
	var $text_blocks = 0;
 
	var $tree = null;
	var $unremoved=array();
 
	function sanitize_text($text){
		$text = str_ireplace('&nbsp;', ' ', $text);
		$text = html_entity_decode($text, ENT_QUOTES);
 
		$utf_spaces = array("\xC2\xA0", "\xE1\x9A\x80", "\xE2\x80\x83", 
			"\xE2\x80\x82", "\xE2\x80\x84", "\xE2\x80\xAF", "\xA0");
		$text = str_replace($utf_spaces, ' ', $text);
 
		return trim($text);
	}
 
	function extract($text, $ratio = null, $min_len = null){
		$this->tree = new DOMDocument();
 
		$start = microtime(true);
		if (!@$this->tree->loadHTML($text)) return false;
 
		$root = $this->tree->documentElement;
		$start = microtime(true);
		$this->HeuristicRemove($root, ( ($ratio == null) || ($min_len == null) ));
 
		if ($ratio == null) {
			$this->total_unlinked_text = $this->sanitize_text($this->total_unlinked_text);
 
			$words = preg_split('/[\s\r\n\t\|?!.,]+/', $this->total_unlinked_text);
			$words = array_filter($words);
			#$words = strip_tags($words);
			$this->total_unlinked_words = count($words);
			unset($words);
			if ($this->total_unlinked_words>0) {
				$this->link_text_ratio = $this->total_links / $this->total_unlinked_words;// + 0.01;
				$this->link_text_ratio *= 1.3;
			}
 
		} else {
			$this->link_text_ratio = $ratio;
		};
 
		if ($min_len == null) {
			$this->min_text_len = strlen($this->total_unlinked_text)/$this->text_blocks;
		} else {
			$this->min_text_len = $min_len;
		}
 
		$start = microtime(true);
		$this->ContainerRemove($root);
 
		return $this->tree->saveHTML();
	}
 
	function HeuristicRemove($node, $do_stats = false){
		if (in_array($node->nodeName, $this->removed_tags)){
			return true;
		};
 
		if ($do_stats) {
			if ($node->nodeName == 'a') {
				$this->total_links++;
			}
			$found_text = false;
		};
 
		$nodes_to_remove = array();
 
		if ($node->hasChildNodes()){
			foreach($node->childNodes as $child){
				if ($this->HeuristicRemove($child, $do_stats)) {
					$nodes_to_remove[] = $child;
				} else if ( $do_stats && ($node->nodeName != 'a') && ($child->nodeName == '#text') ) {
					$this->total_unlinked_text .= $child->wholeText;
					if (!$found_text){
						$this->text_blocks++;
						$found_text=true;
					}
				};
			}
			foreach ($nodes_to_remove as $child){
				$node->removeChild($child);
			}
		}
 
		return false;
	}
 
	function ContainerRemove($node){
		if (is_null($node)) return 0;
		$link_cnt = 0;
		$word_cnt = 0;
		$text_len = 0;
		$delete = false;
		$my_text = '';
 
		$ratio = 1;
 
		$nodes_to_remove = array();
		if ($node->hasChildNodes()){
			foreach($node->childNodes as $child){
				$data = $this->ContainerRemove($child);
 
				if ($data['delete']) {
					$nodes_to_remove[]=$child;
				} else {
					$text_len += $data[2];
				}
 
				$link_cnt += $data[0];
 
				if ($child->nodeName == 'a') {
					$link_cnt++;
				} else {
					if ($child->nodeName == '#text') $my_text .= $child->wholeText;
					$word_cnt += $data[1];
				}
			}
 
			foreach ($nodes_to_remove as $child){
				$node->removeChild($child);
			}
 
			$my_text = $this->sanitize_text($my_text);
 
			$words = preg_split('/[\s\r\n\t\|?!.,\[\]]+/', $my_text);
			$words = array_filter($words); 
			$word_cnt += count($words);
			$text_len += strlen($my_text);
 
		};
 
		if (in_array($node->nodeName, $this->container_tags)){
			if ($word_cnt>0) $ratio = $link_cnt/$word_cnt;
 
			if ($ratio > $this->link_text_ratio){
					$delete = true;
			}
 
			if ( !in_array($node->nodeName, $this->ignore_len_tags) ) {
				if ( ($text_len < $this->min_text_len) || ($word_cnt<$this->min_words) ) {
					$delete = true;
				}
			}
 
		}	
 
		return array($link_cnt, $word_cnt, $text_len, 'delete' => $delete);
	}
 
}
 

$html = file_get_contents('http://www.local.ch/en/q/bar.html');
 
$extractor = new ContentExtractor();
$content = $extractor->extract($html); 
echo $content;
?>
and i get the output as

0 Results for in the current map areain
The number of results indicates how many listings correspond to your search.
To view these listings, click the Search button. Do you have any questions or
suggestions? Or maybe even come across a problem? Please let us know:
info@local.ch
Results for Print
You can choose if you want to print the map on this page,
by using the options (such as "No Map") which appear directly
above the map to display or hide it.
New Search

The Yellow Pages > Bar, Restaurant
Bleu Lézard

rue Enning 10, 1003 Lausanne
resultentry_06.63771746.520077Bleu Lézard
Bleu Lézard

rue Enning 10, 1003 Lausanne

Tel.: * 021 321 38 30
tel/search

The Yellow Pages > Bar, Restaurant, Events
Nordportal

Schmiedestrasse 12, 5400 Baden
resultentry_18.30031447.481186Nordportal
Nordportal

Schmiedestrasse 12, 5400 Baden

Tel.: * 056 221 15 72
tel/search
ADN Bar Café

rue de Lausanne 59, 1202 Genève
resultentry_26.14646746.215079ADN Bar Café
ADN Bar Café

rue de Lausanne 59, 1202 Genève

Tel.: * 022 731 40 18
tel/search
Bar Abdelmajid

Könizstrasse 3, 3008 Bern
resultentry_37.42168146.944324Bar Abdelmajid
Bar Abdelmajid

Könizstrasse 3, 3008 Bern

Tel.: 031 381 42 60
tel/search

The Yellow Pages > Hotel, Bar, Restaurant
Hotel SEDARTIS

Bahnhofstrasse 16, 8800 Thalwil
resultentry_48.56592247.295528Hotel SEDARTIS
Hotel SEDARTIS

Bahnhofstrasse 16, 8800 Thalwil

Tel.: 043 388 33 00
tel/search

The Yellow Pages > Club, Discotheque, Bar
Liquid

Genfergasse 10, 3011 Bern
resultentry_57.44115946.949633Liquid
Liquid

Genfergasse 10, 3011 Bern

Tel.: * 031 951 98 26
tel/search
Bar Amalfi

Spezialitäten aus dem Süden

Turmstrasse 7, Zentrum Frohwies, 8330 Pfäffikon ZH
resultentry_68.78198147.368167Bar Amalfi
Bar Amalfi

Turmstrasse 7, Zentrum Frohwies, 8330 Pfäffikon ZH

Tel.: * 043 535 90 05
tel/search
Bar Benjamin (-Gera)

Im Allmendli 11, 8703 Erlenbach ZH
resultentry_78.59953547.301096Bar Benjamin (-Gera)
Bar Benjamin (-Gera)

Im Allmendli 11, 8703 Erlenbach ZH

Tel.: * 076 232 23 21
tel/search

The Yellow Pages > Restaurant, Bar
Bohemia

Klosbachstrasse 2, 8032 Zürich
resultentry_98.55496947.364845Bohemia
Bohemia

Klosbachstrasse 2, 8032 Zürich

Tel.: 044 383 70 60
tel/search

Help local.ch improve this page
© 2010 local.ch ag
© 2010 local.ch ag - Terms of use

But i only need name,address and phone that should be stored in database. Please anyone help me to do this
User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: scrape the data from webpage

Post by requinix »

They have a place for developers. That means there is no good reason for you to be page scraping.

Look at their API instead.
Nagadurga
Forum Newbie
Posts: 4
Joined: Wed Jul 28, 2010 11:47 pm

Re: scrape the data from webpage

Post by Nagadurga »

many many thanks for ur quick and useful reply.. I was tried this for last one week.. I was working under a project, in that i have to extract all the information from that website and store it in a database. How that link help for me. As i already told i am very newer to php. please give brief advice to me for understand about it.
Nagadurga
Forum Newbie
Posts: 4
Joined: Wed Jul 28, 2010 11:47 pm

Re: scrape the data from webpage

Post by Nagadurga »

They have a place for developers. That means there is no good reason for you to be page scraping.

Look at their API instead.
I go through the link. But there is only xml file for cities. But i need to extract name address and phone number. please give some other idea or suggestion...
Post Reply