Page 1 of 1

Link checking script, how can I reduce cpu usage?

Posted: Fri Sep 14, 2007 6:17 pm
by dotsc
Hi to all. I use the following code on a classifieds site. It checks for a backlinkevery 5 days up and gives up after 3 tries if it doesn't find a link back to my site.

I run this code in the background using require 'adlinkcheck.php'; to load this script when someone is viewing an ad on the site.

Here is the issue:
This script will use 100% of the cpu, even at slow hours of the night/day.
When I comment this line out, my cpu load (on average) is 10%.

Code: Select all

<?
//filename: adlinkcheck.php

	$check_link=false;
	if(!($ad['website']=="" || $ad['link_try']>=3)) {
		$days=Ad::GetDaysLastCheckWebsite2($ad['link_checked_on']);
		if($ad['link_checked_on']=="0000-00-00 00:00:00" || $days>=5){
			$website=str_replace("http://", "", str_replace("www.", "", $ad['website']));
			$web_file=$website;
			if(strpos($website, "/")) $website=substr($website, 0, strpos($website, "/"));
			$host=gethostbyname($website);
			if($host!=$website){
				$html='';
				$f=@fopen("http://".(strpos($ad['website'], "www")!==false ? "www." : "").$web_file, "r");
				while(!@feof($f)){
					$html.=@fgets($f);
				}
				@fclose($f);
				$html=strtolower($html);
				$html_arr=explode("<a ", $html);
				for($i=0; $i<count($html_arr); $i++){
					$v=$html_arr[$i];
					$v=substr($v, 0, strpos($v, ">"));
					$arr_v=explode(" ", $v);
					foreach ($arr_v as $vv){
						$vv=str_replace("\"", "",  str_replace("'", "", $vv));
						if(strpos($vv, "href=")===false) continue;
						foreach ($config['self_url'] as $url){
							if(strpos($vv, $url)!==false) $check_link=true;
						}
					}
				}
				$ad['link_checked_on']=date('Y-m-d H:i:s');
				if ($check_link){
					$ad['link_try']=0;
					$ad['link_active']=1;
				} else {
					$ad['link_try'] = (int)$ad['link_try']+1;
					$ad['link_active']=0;
				}
				Ad::SaveLinkData($ad['link_checked_on'], $ad['link_try'], $ad['link_active'], $ad['id']);
			}
		}
	}
?>
The server stats:
Linux, Dual xeon 2.8, 3gb ram
PHP 4.4.4, MySQL 4.1.22

I ran out of ideas, anyone wanna help out?

Posted: Fri Sep 14, 2007 9:00 pm
by s.dot
The only thing I can think of is to use file_get_contents(), rather than looping through the file and concatenating the html, as it supports memory mapping techniques and gets it all in one shot instead of going through a loop, and unset()ing variables when you're done with them.

You may also wish to check if you're getting thrown into an infinite loop somewhere in there.

Posted: Fri Sep 14, 2007 10:27 pm
by John Cartwright
I would probably use the cURL library to fetch the contents of the site, simply because you can specify the timeout. Secondly, adding a sleep() command for a second or so will allow the processor to catch it's breath.

Posted: Mon Sep 17, 2007 6:26 pm
by ReDucTor
There are many ways you can optimize that code.

1. You only need to track day checked on correct?[/b]
So first things first, store only the day/date then do something similair to.

Code: Select all

SELECT * FROM table WHERE date<='18-09-2007'
2. Have you tried using regex and comparing how fast your algo?

Several things in your code which can be optimized, just because your calling things multiple times.

Code: Select all

if($pos=strpos($website,"/")) $website=substr($website,0,$pos);

Code: Select all

$host=gethostbyname($website);
if($host!=$website){
So if the web address is an IP address? Make this part of your select query.

Code: Select all

$vv=str_replace(array('"','\''),'',$vv);

Code: Select all

$ad['link_checked_on']=date('Y-m-d H:i:s');
Make this part of your SQL query UPDATE table SET checked_on = TODAY()

Every thought of?

Code: Select all

$links = stripos($html,'href="'.$url.'"') + stripos($html,'href=\''.$url.'\'') + stripos($html,'href='.$url);
Incrementing is better

Code: Select all

$ad['link_try']++;

And by god I hope Ad::SaveLinkData uses Transactions? Otherwise this is another big user of CPU.[/syntax]

Posted: Mon Sep 17, 2007 11:56 pm
by dotsc
I've switched the code to curl and now that are no more high loads on the cpu.
Removed:

Code: Select all

$f=@fopen("http://".(strpos($ad['website'], "www")!==false ? "www." : "").$web_file, "r");
				while(!@feof($f)){
					$html.=@fgets($f);
				}
				@fclose($f);
Changed to:

Code: Select all

$ch = curl_init();
				curl_setopt($ch, CURLOPT_URL, $ad['website']);
				curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
				curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 4);
				curl_setopt($ch, CURLOPT_TIMEOUT, ;
				$html = curl_exec($ch);
Big thanks to everyone that helped out!