Hi there
I'm in the midst of creating a crawler. I have completed all functionality except I am having difficulty coming up with a function to do the actual crawling deeper part based on a particular criteria not depth.
Currently the Crawler will get the names and links of the categories I am crawling and put them into an array category => link. The depth of the categories is unknown but a criteria for each page is known to continue crawling.
For instance: Keep crawling deeper urls if(strpos(getsource($url),'class="subcategories"') === true) .
My result will hopefully go into the database like this.
Parent ~ Category ~ ID
~ Category 1 ~ 1
~ Category 2 ~ 2
~ Category 3 ~ 3
~ Category 4 ~ 4
1 ~ Category 1.1 ~ 5
5 ~ Category 1.1.1~6
2 ~ Category 2.1 ~ 7
7 ~ Category 2.1.1~8
This function is boggling my mind. It will crawl the first page but the problem is re-looping saving and crawling deeper and then recording a parent id for each parent category. So on and so fourth.
Thanks for your help!
Crawler Help
Moderator: General Moderators
- phdatabase
- Forum Commoner
- Posts: 83
- Joined: Fri May 28, 2010 10:02 am
- Location: Fort Myers, FL
Re: Crawler Help
So I assume you are running a recursive function for pages reads with a global links array?
Re: Crawler Help
I believe that is the problem. I cant get my head around the recursive or re-looping every time it hits a new page.phdatabase wrote:So I assume you are running a recursive function for pages reads with a global links array?
How would you handle the global links array?
- phdatabase
- Forum Commoner
- Posts: 83
- Joined: Fri May 28, 2010 10:02 am
- Location: Fort Myers, FL
Re: Crawler Help
OK, here is a dead easy way to write a recursive function. Write a function to do what you want, but instead of actually doing it, just echo the need to the screen. When you are all done and it works like you planned, replace the echo call with a call to the function. Don't think about all those layers, just get the first one to work and they will all work later. So forget about everything else and write the function. When your testing says it's time to loop echo "Loop HERE" and continue. Your not done until you echo out everything you want from each level. Then worry about making it recursive.
Re: Crawler Help
Ok, Let me try this. Do you mind if I post what I have come up with later. Hopefully you can help me iron out the bugs.phdatabase wrote:OK, here is a dead easy way to write a recursive function. Write a function to do what you want, but instead of actually doing it, just echo the need to the screen. When you are all done and it works like you planned, replace the echo call with a call to the function. Don't think about all those layers, just get the first one to work and they will all work later. So forget about everything else and write the function. When your testing says it's time to loop echo "Loop HERE" and continue. Your not done until you echo out everything you want from each level. Then worry about making it recursive.
So far I do have a function that spits out a list of categories from a source code and puts them in an array.
Here is the code so far:
Code: Select all
$_SESSION['SMID'] = $_POST['smid'];
include (SITE_ROOT.'/cp/admin/cp/includes/scraper/functions.php');
include (SITE_ROOT.'/cp/admin/cp/includes/scraper/categoryparse.php');
$x=0;
$z=0;
foreach(catrules($_SESSION['SMID']) as $catrule) {
$querys = "SELECT * FROM CR_SCRAPER_CATSET WHERE CATSETID = $catrule[CATSETID]" ;
$results = mysql_query($querys) or die(mysql_error());
while($pinfos = mysql_fetch_array( $results ))
{
if($x == 0 ) {
//initialization of the categories ($categories is Array - Category => Link)
$categories = spliter($pinfos['FDELIM'],$pinfos['SPLIT'],$pinfos['EDELIM'],$pinfos['ESPLIT'],$pinfos ['SEPERATOR'],$pinfos['FIRSTNUM'],$pinfos['PATTERN'],$pinfos['LSPLIT'],$pinfos['CSPLIT'],sourceget($catrule['URL']),$catrule['URL'],$catrule['LINKSETTING'],$pinfo['URL'],$catrule['URL']);
}
foreach($categories as $cat => $link){
//here is where I am having the difficulties. I have the $categories (which is the first page of categories and links in 1 array)
$source = sourceget($link)
if(strpos($source,'class="subcategories"') === true) {
//function splitter() will create a new list of categories for the next page but now how do I feed back, relate it to it's parent and continue on?
$categories = spliter($pinfos['FDELIM'],$pinfos['SPLIT'],$pinfos['EDELIM'],$pinfos['ESPLIT'],$pinfos ['SEPERATOR'],$pinfos['FIRSTNUM'],$pinfos['PATTERN'],$pinfos['LSPLIT'],$pinfos['CSPLIT'],$source,$link,$catrule['LINKSETTING'],$link,$catrule['URL']);
//how do i take this and loop back feed link information back into the splitter for the next category page and of course reference the parent category?
$x++;
}
}
}//end catrule while
} // end foreach catrules
traverse($categories); Thanks for your time phdatabase
- phdatabase
- Forum Commoner
- Posts: 83
- Joined: Fri May 28, 2010 10:02 am
- Location: Fort Myers, FL
Re: Crawler Help
First, in order to have a recursive function, one needs a function. wrap the foreach statement in a function and move it out of the flow, replacing it with a function call. Now you have a function requiring an array of categories as input and nothing has changed logically.
Then, in the new function, call the function right after you create the cetegory array. Viola, a recursive function.
Then, in the new function, call the function right after you create the cetegory array. Viola, a recursive function.