Link checking script, how can I reduce cpu usage?

Coding Critique is the place to post source code for peer review by other members of DevNetwork. Any kind of code can be posted. Code posted does not have to be limited to PHP. All members are invited to contribute constructive criticism with the goal of improving the code. Posted code should include some background information about it and what areas you specifically would like help with.

Popular code excerpts may be moved to "Code Snippets" by the moderators.

Moderator: General Moderators

Post Reply
dotsc
Forum Newbie
Posts: 14
Joined: Sun Jan 29, 2006 2:55 pm

Link checking script, how can I reduce cpu usage?

Post by dotsc »

Hi to all. I use the following code on a classifieds site. It checks for a backlinkevery 5 days up and gives up after 3 tries if it doesn't find a link back to my site.

I run this code in the background using require 'adlinkcheck.php'; to load this script when someone is viewing an ad on the site.

Here is the issue:
This script will use 100% of the cpu, even at slow hours of the night/day.
When I comment this line out, my cpu load (on average) is 10%.

Code: Select all

<?
//filename: adlinkcheck.php

	$check_link=false;
	if(!($ad['website']=="" || $ad['link_try']>=3)) {
		$days=Ad::GetDaysLastCheckWebsite2($ad['link_checked_on']);
		if($ad['link_checked_on']=="0000-00-00 00:00:00" || $days>=5){
			$website=str_replace("http://", "", str_replace("www.", "", $ad['website']));
			$web_file=$website;
			if(strpos($website, "/")) $website=substr($website, 0, strpos($website, "/"));
			$host=gethostbyname($website);
			if($host!=$website){
				$html='';
				$f=@fopen("http://".(strpos($ad['website'], "www")!==false ? "www." : "").$web_file, "r");
				while(!@feof($f)){
					$html.=@fgets($f);
				}
				@fclose($f);
				$html=strtolower($html);
				$html_arr=explode("<a ", $html);
				for($i=0; $i<count($html_arr); $i++){
					$v=$html_arr[$i];
					$v=substr($v, 0, strpos($v, ">"));
					$arr_v=explode(" ", $v);
					foreach ($arr_v as $vv){
						$vv=str_replace("\"", "",  str_replace("'", "", $vv));
						if(strpos($vv, "href=")===false) continue;
						foreach ($config['self_url'] as $url){
							if(strpos($vv, $url)!==false) $check_link=true;
						}
					}
				}
				$ad['link_checked_on']=date('Y-m-d H:i:s');
				if ($check_link){
					$ad['link_try']=0;
					$ad['link_active']=1;
				} else {
					$ad['link_try'] = (int)$ad['link_try']+1;
					$ad['link_active']=0;
				}
				Ad::SaveLinkData($ad['link_checked_on'], $ad['link_try'], $ad['link_active'], $ad['id']);
			}
		}
	}
?>
The server stats:
Linux, Dual xeon 2.8, 3gb ram
PHP 4.4.4, MySQL 4.1.22

I ran out of ideas, anyone wanna help out?
User avatar
s.dot
Tranquility In Moderation
Posts: 5001
Joined: Sun Feb 06, 2005 7:18 pm
Location: Indiana

Post by s.dot »

The only thing I can think of is to use file_get_contents(), rather than looping through the file and concatenating the html, as it supports memory mapping techniques and gets it all in one shot instead of going through a loop, and unset()ing variables when you're done with them.

You may also wish to check if you're getting thrown into an infinite loop somewhere in there.
Set Search Time - A google chrome extension. When you search only results from the past year (or set time period) are displayed. Helps tremendously when using new technologies to avoid outdated results.
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

I would probably use the cURL library to fetch the contents of the site, simply because you can specify the timeout. Secondly, adding a sleep() command for a second or so will allow the processor to catch it's breath.
ReDucTor
Forum Commoner
Posts: 90
Joined: Thu Aug 15, 2002 6:13 am

Post by ReDucTor »

There are many ways you can optimize that code.

1. You only need to track day checked on correct?[/b]
So first things first, store only the day/date then do something similair to.

Code: Select all

SELECT * FROM table WHERE date<='18-09-2007'
2. Have you tried using regex and comparing how fast your algo?

Several things in your code which can be optimized, just because your calling things multiple times.

Code: Select all

if($pos=strpos($website,"/")) $website=substr($website,0,$pos);

Code: Select all

$host=gethostbyname($website);
if($host!=$website){
So if the web address is an IP address? Make this part of your select query.

Code: Select all

$vv=str_replace(array('"','\''),'',$vv);

Code: Select all

$ad['link_checked_on']=date('Y-m-d H:i:s');
Make this part of your SQL query UPDATE table SET checked_on = TODAY()

Every thought of?

Code: Select all

$links = stripos($html,'href="'.$url.'"') + stripos($html,'href=\''.$url.'\'') + stripos($html,'href='.$url);
Incrementing is better

Code: Select all

$ad['link_try']++;

And by god I hope Ad::SaveLinkData uses Transactions? Otherwise this is another big user of CPU.[/syntax]
dotsc
Forum Newbie
Posts: 14
Joined: Sun Jan 29, 2006 2:55 pm

Post by dotsc »

I've switched the code to curl and now that are no more high loads on the cpu.
Removed:

Code: Select all

$f=@fopen("http://".(strpos($ad['website'], "www")!==false ? "www." : "").$web_file, "r");
				while(!@feof($f)){
					$html.=@fgets($f);
				}
				@fclose($f);
Changed to:

Code: Select all

$ch = curl_init();
				curl_setopt($ch, CURLOPT_URL, $ad['website']);
				curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
				curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 4);
				curl_setopt($ch, CURLOPT_TIMEOUT, ;
				$html = curl_exec($ch);
Big thanks to everyone that helped out!
Post Reply