Modded PHP Crawler Help
Posted: Fri May 04, 2007 1:30 pm
(I hope this didn't just double post, it logged me out before I sent it the first time...)
Hi, everyone. I have been trying to get random pages from certain sites on my website.
I'm new to PHP (just started 4 days ago) so I started small.
What I did was use "PHP Crawler" (check sourceforge) and mod the code so that all of the information is entered via forms.
Specifically, the user (Me) navigates to a web directory on my site, and adds data to the "index.html" form, which uses POST to send the data to "crawler.php". The original "crawler.php" used 3 external files; "_config.php", "_crawler.php" (note the "_"), and "_db.php". In the original version, the user modified "_config.php" and ran "crawler.php" directly via their web server or whatever by command prompt.
I am using Godaddy Linux hosting, and I have a windows box.
I didn't know how to the linux command line from my computer, so I set up the form deal by copy/pasting "_config.php" into the top of "crawler.php" and changed the user-defined variables I wanted to control to "$_POST["name_of_field"]".
Well, after a little bit of tinkering around with it, I was successfully able to add links crawled by from the "Entry URL" (supplied by the user) to the database.
Before I go any further, the purpose of my site is to provide free computer-based resources such as scripts, games, open-source applications, help and tutorials. The site is in its VERY early stages, so I wanted to borrow content from other sites until I can refine it for myself.
The first site I used was "http://dynamicdrive.com". It crawled over 600 pages, if I'm not mistaken.
I used 2 or 3 more sites successfully, as well.
But after the 3rd or 4rth site, something went wrong. It started re-adding pages that had already been added to the database. And then, when I let it finish, and tried again, it did it AGAIN!
I'm going to post the source all of the files necessary, I'm hoping someone in the community can help me.
I'm Not sure, but I THINK that the problem is that it is re-using IDs in some way, but I can't figure out how to make it use a new ID every time.
Here are the files:
"index.html":
"crawler.php":
"_crawler.php":
"_db.php":
I hope someone can review this code and fix it, or at least point out what's wrong.
All help is appreciated!

Hi, everyone. I have been trying to get random pages from certain sites on my website.
I'm new to PHP (just started 4 days ago) so I started small.
What I did was use "PHP Crawler" (check sourceforge) and mod the code so that all of the information is entered via forms.
Specifically, the user (Me) navigates to a web directory on my site, and adds data to the "index.html" form, which uses POST to send the data to "crawler.php". The original "crawler.php" used 3 external files; "_config.php", "_crawler.php" (note the "_"), and "_db.php". In the original version, the user modified "_config.php" and ran "crawler.php" directly via their web server or whatever by command prompt.
I am using Godaddy Linux hosting, and I have a windows box.
I didn't know how to the linux command line from my computer, so I set up the form deal by copy/pasting "_config.php" into the top of "crawler.php" and changed the user-defined variables I wanted to control to "$_POST["name_of_field"]".
Well, after a little bit of tinkering around with it, I was successfully able to add links crawled by from the "Entry URL" (supplied by the user) to the database.
Before I go any further, the purpose of my site is to provide free computer-based resources such as scripts, games, open-source applications, help and tutorials. The site is in its VERY early stages, so I wanted to borrow content from other sites until I can refine it for myself.
The first site I used was "http://dynamicdrive.com". It crawled over 600 pages, if I'm not mistaken.
I used 2 or 3 more sites successfully, as well.
But after the 3rd or 4rth site, something went wrong. It started re-adding pages that had already been added to the database. And then, when I let it finish, and tried again, it did it AGAIN!
I'm going to post the source all of the files necessary, I'm hoping someone in the community can help me.
I'm Not sure, but I THINK that the problem is that it is re-using IDs in some way, but I can't figure out how to make it use a new ID every time.
Here are the files:
"index.html":
Code: Select all
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>PHP Crawler - Modified by JT Preston</title>
</head>
<body>
<form action=" crawler.php" method="post">
SITE ENTRY POINT (make sure to include "http://") : <input type="text" name="site" /><br><br>
Host Name: <input type="text" name="host" /><br>
Database Name: <input type="text" name="database" /><br>
Username: <input type="text" name="username" /><br>
Password: <input type="text" name="password" /><br><br>
<input type="submit" />
</form>
</body>
</html>Code: Select all
<?php
/*-
* Copyright (c) 2005-2006 Vladimir Fedorkov (http://astellar.com/)
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
*
* THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
This code was modified by JT Preston, in order to use a submission form to add URLs.
Enjoy.
*/
//Config:
if (empty($GLOBALS["www_has_crawl_config"])) {
// We both know about reqire_once(), I just keep old style.
$GLOBALS["www_has_crawl_config"] = 1;
// *** MySQL database config. Please change these lines according your host
$mysql_host = $_POST["host"];
$mysql_db = $_POST["database"];
$mysql_user = $_POST["username"];
$mysql_pass = $_POST["password"];
$CRAWL_ENTRY_POINT_URL = $_POST["site"];
$CRAWL_MAX_DEPTH = 3;
$CRAWL_LOCALE = "en_US"; // read more about Locate http://php.rinet.ru/manual/en/function.setlocale.php
//$CRAWL_LOCALE = "ru_RU";
$CRAWL_PAGE_EXPIRE_DAYS = 10; // Page reindex period
// **** MISC SETTINGS ****
// disable keys while crawling (might save some time)
$CRAWL_DB_DISABLE_KEYS = false;
// skip crawing long URLs
$CRAWL_URL_MAX_LEN = 1024;
// index only fist CONFIG_URL_MAX_CONTENT bytes of page content
$CRAWL_URL_MAX_CONTENT = 150 * 1024;
// HACK. cooldown time after http request.
$CRAWL_THREAD_SLEEP_TIME = 100000; //mk_sec
// **** SEARCH CONFIG ****
$CRAWL_SEARCH_TEXT_SURROUNDING_LENGHT = 70; //chars
$CRAWL_SEARCH_MAX_RES_WORD_COUNT = 2; // larger value produces larger search page
// *** INIT ****
setlocale (LC_ALL, $CRAWL_LOCALE);
}
//End Config.
require_once('_db.php');
require_once('_crawler.php');
set_time_limit (0);
error_reporting (E_ERROR | E_WARNING | E_PARSE);
$crawl_max_shown_depth = $CRAWL_MAX_DEPTH - 1;
print "PHP-Crawler started...<br>\n";
print "Log format: \"Crawling: [Current depth ({$crawl_max_shown_depth} MAX)] URL Action\"<br>\n";
if ($CRAWL_DB_DISABLE_KEYS) sql_query("/*!40000 ALTER TABLE `phpcrawler_links` DISABLE KEYS */;");
add_head_link(1, $CRAWL_ENTRY_POINT_URL);
mark_old_URLs_to_crawl();
$url_counter = 0;
$url_size = 0;
while($URL_info = get_URL_to_crawl())
{
// Cooldown
usleep ($CRAWL_THREAD_SLEEP_TIME);
$url_counter++;
$URL = $URL_info["url"];
$site_URL = $CRAWL_ENTRY_POINT_URL;
//$site_URL = $URL_info["site_url"];
//$current_URL = preg_replace("/\/[^\/]+$/i", "", $URL_info["url"]);
$current_URL = preg_replace("/([^\/])\/[^\/]+$/i", "\\1", $URL_info["url"]);
//print(" base: " . $current_URL . " ");
print "Crawling: [" . $URL_info["depth"] . "] {$URL}";
$page = fetch_URL($URL);
if ($page === false)
{
drop_url_from_db($URL_info["id"]);
print " - FAILED/REMOVED.<br>\n";
continue;
}
$page_size = strlen($page);
$url_size += $page_size;
print " " . ($page_size / 1000) . "Kb";
$page_content = prepare_page($page);
$page_content_md5 = md5($page_content);
if($page_counter = check_equals($page_content_md5))
{
unset_url_from_db($URL_info["id"]);
print " - SKIPPED ({$page_counter} equals).<br>\n";
continue;
}
$URLs_draft = get_URLS_from_page($page, $URL_info["depth"] + 1); //array
$page_title = get_page_title($page);
$URLs_clean = filter_URLs($URLs_draft, $site_URL, $current_URL); //$base_URL, $current_URL
$URLs_to_crawl = add_URLs_to_crawl($URL_info["site_id"], $URLs_clean, $URL_info["depth"] + 1);
print " +" . $URLs_to_crawl . " urls.<br>\n";
send_page_to_db($URL_info["id"], $page_title, $page_content, $page_content_md5);
}
if ($CRAWL_DB_DISABLE_KEYS) sql_query("/*!40000 ALTER TABLE `phpcrawler_links` ENABLE KEYS */;");
print $url_counter . " pages crawled, " . ($url_size/1000) . "Kb processed.<br>\n";
?>Code: Select all
<?php
/*-
* Copyright (c) 2005-2006 Vladimir Fedorkov (http://astellar.com/)
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
*
* THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*/
if(empty($GLOBALS["www_has_crawler"]))
{
if (empty($GLOBALS["www_has_crawl_config"])) die("Stop. Crawler has no config. Please include _config.php first.");
// ***** CRAWLER ******
$GLOBALS["www_has_crawler"] = 1;
function mark_old_URLs_to_crawl()
{
global $CRAWL_PAGE_EXPIRE_DAYS;
sql_query("UPDATE phpcrawler_links SET crawl_now = 1 WHERE TO_DAYS(NOW()) - TO_DAYS(last_crawled) > %d", $CRAWL_PAGE_EXPIRE_DAYS);
}
// Fetch ONE url to crawl
function get_URL_to_crawl()
{
global $CRAWL_MAX_DEPTH;
$url = sql_fetch_hash("SELECT l.* FROM phpcrawler_links l WHERE l.crawl_now = 1 and l.depth < %d and l.url != '' LIMIT 1", $CRAWL_MAX_DEPTH);
return $url;
}
function add_head_link($site_id, $page_URL)
{
add_url_to_DB($site_id, $page_URL, 0);
}
// *** ADD TO DB
function add_url_to_DB($site_id, $URL, $depth)
{
//var_dump($URL);
// FIXME!!! add depth verifivation!!!
$link_data = sql_fetch_hash("SELECT id, url, last_crawled FROM phpcrawler_links WHERE url = %s", $URL);
if (empty($link_data["id"]))
{
sql_query("INSERT INTO phpcrawler_links (site_id, url, depth, last_crawled) VALUES (%d, %s, %d, NOW())", $site_id, $URL, $depth);
return 1;
} else if ($link_data["depth"] > $depth) {
sql_query("UPDATE phpcrawler_links depth = %d WHERE id = %d", $depth, $link_data["id"]);
}
return 0;
}
function add_URLs_to_crawl($site_id, $URLs_clean, $depth)
{
$counter = 0;
foreach($URLs_clean as $id => $URL)
{
$counter += add_url_to_DB($site_id, $URL, $depth);
}
return $counter;
}
function drop_url_from_db($link_id)
{
sql_query("DELETE FROM phpcrawler_links WHERE id = %d", $link_id);
}
function unset_url_from_db($link_id)
{
sql_query("UPDATE phpcrawler_links SET last_crawled = NOW(), crawl_now = 2 WHERE id = %d", $link_id);
}
function fetch_URL($URL)
{
$handle = @fopen ($URL, "r");
if ($handle === false) return false;
$buffer = "";
while (!feof ($handle)) {
$buffer .= fgets($handle, 4096);
}
fclose ($handle);
return $buffer;
}
function get_URLS_from_page($page, $depth = 0)
{
global $CRAWL_MAX_DEPTH;
if ($depth >= $CRAWL_MAX_DEPTH) return array();
$matches = array();
$URL_pattern = "/\s+href\s*=\s*[\"\']?([^\s\"\']+)[\"\'\s]+/ims";
preg_match_all ($URL_pattern, $page, $matches, PREG_PATTERN_ORDER);
return $matches[1];
}
function make_full_qualified_URL($URL_draft, $base_URL, $current_URL)
{
global $CRAWL_URL_MAX_LEN;
//$URL_draft = trim($URL_draft);
if (strlen ($URL_draft) > $CRAWL_URL_MAX_LEN) return false;
if (strpos ($URL_draft, "://") != 0 && substr($URL_draft, 0, 7) != "http://") return false;
// make full qualified URL
if (substr($URL_draft, 0, 1) != "/" && substr($URL_draft, 0, 7) != "http://") $URL_draft = $current_URL . "/" . $URL_draft;
if (substr($URL_draft, 0, 7) != "http://") $URL_draft = $base_URL . "/" . $URL_draft;
$URL_draft = str_replace("/./", "/", $URL_draft);
$URL_draft = preg_replace("/\/[\/]+/i", "/", $URL_draft);
$URL_draft = str_replace("http:/", "http://", $URL_draft);
$URL_draft = str_replace("&", "&", $URL_draft);
// DROP session ID
$URL_draft = preg_replace("/sid=[\w\d]+/i", "", $URL_draft);
return $URL_draft;
}
function filter_URLs($URLs_draft, $base_URL, $current_URL)
{
$URLs_clean = array();
$counter = 0;
foreach($URLs_draft as $id => $URL)
{
//vds($URL);
$URL = make_full_qualified_URL($URL, $base_URL, $current_URL);
if ($URL === false || strpos ($URL, $base_URL) !== 0) continue;
$URLs_clean[$counter++] = $URL;
}
return $URLs_clean;
}
function get_page_title($page)
{
preg_match("/<title>(.*)<\/title>/imsU", $page, $matches);
return $matches[1];
}
function prepare_page($content)
{
$content = preg_replace("/<script(.*)<\/script>/imsU", "", $content);
$content = preg_replace("/<!--(.*)-->/imsU", "", $content);
//TEST: added 0.7.7: remove useless spaces
$content = preg_replace("/[\s]+/ims", " ", $content);
$content = preg_replace("/<\/?(.*)>/imsU", "", $content);
return $content;
}
function check_equals($page_content_md5)
{
$page_counter = sql_fetch("SELECT count(*) as cnt FROM phpcrawler_links WHERE content_md5 = %s", $page_content_md5);
return $page_counter;
}
function send_page_to_db($link_id, $page_title, $page_content, $page_content_md5)
{
global $CRAWL_URL_MAX_CONTENT;
if (strlen($page_content) > $CRAWL_URL_MAX_CONTENT) $page_content = substr($page_content, 0, $CRAWL_URL_MAX_CONTENT);
//sql_query("UPDATE phpcrawler_links SET content = %s, content_md5 = %s, last_crawled = NOW(), crawl_now = 2 WHERE id = %d", $page_content, $page_content_md5, $link_id);
sql_query("UPDATE phpcrawler_links SET content = %s, content_md5 = %s, url_title = %s, last_crawled = NOW(), crawl_now = 2 WHERE id = %d", $page_content, $page_content_md5, $page_title, $link_id);
}
function vds($var)
{
print "<!--";
var_dump($var);
print "-->";
}
ob_end_flush();
sql_open();
}
?>Code: Select all
<?php
// *** SQL WRAPPER - MYSQL ***
if (empty($GLOBALS["www_has_db"]))
{
$GLOBALS["www_has_db"] = 1;
function sql_escape($arg)
{
return addslashes($arg);
}
function sql_open()
{
global $mysql_host, $mysql_db, $mysql_user, $mysql_pass,
$M_SYS_SQL_SERVER, $M_SYS_SQL_DB, $M_SYS_REASON;
if (!@mysql_connect($mysql_host, $mysql_user, $mysql_pass)) {
$msg = mysql_error();
die("Cannot connect to database server (Reason: $msg)");
}
if (!@mysql_select_db($mysql_db)) {
$msg = mysql_error();
die("Cannot select db (Reason: $msg)");
}
return true;
}
function sql_exec_va($args)
{
global $sql_query;
$query = $args[0];
$i = 1;
$n = count($args);
$a = explode("%", $query);
$r = "";
if (!empty($a)) foreach ($a as $p) {
$c = $p[0];
if ($c != "s" && $c != "u" && $c != "d" && $c != "f") {
$r .= "%";
if ($c == "P") $p = substr($p, 1);
$r .= $p;
continue;
}
if ($i >= $n) die("FATAL: not enough arguments to SQL query ($query_code: $query)");
$arg = $args[$i++];
switch ($c) {
case "s": $r .= "'" . sql_escape($arg) . "'"; break;
case "u": $r .= $arg; break;
case "d": $r .= (int)$arg; break;
case "f": $r .= (float)$arg; break;
}
$r .= substr($p, 1);
}
$query = substr($r, 1);
$sql_query = $query;
return @mysql_query($query);
}
function sql_query_va($args)
{
global $sql_query;
if (!($r = sql_exec_va($args))) {
$msg = mysql_error();
die("Query failed (query: $sql_query, reason: $msg)");
}
return $r;
}
function sql_query($query)
{
$args = func_get_args();
return sql_query_va($args);
}
function sql_exec($query)
{
$args = func_get_args();
return sql_exec_va($args);
}
function sql_row($result)
{
return mysql_fetch_row($result);
}
function sql_rows($result)
{
return mysql_num_rows($result);
}
function sql_fetch($query)
{
$args = func_get_args();
$r = sql_query_va($args);
$a = sql_row($r);
return $a[0];
}
function sql_row_hash($result)
{
return mysql_fetch_array($result);
}
function sql_fetch_hash($query)
{
$args = func_get_args();
$r = sql_query_va($args);
return sql_row_hash($r);
}
function sql_insert($query)
{
$args = func_get_args();
sql_query_va($args);
return sql_insert_id();
}
function sql_insert_id()
{
return mysql_insert_id();
}
function sql_free($r)
{
return mysql_free_result($r);
}
}
?>I hope someone can review this code and fix it, or at least point out what's wrong.
All help is appreciated!