help with web scraper

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
playwright
Forum Newbie
Posts: 20
Joined: Wed Jun 02, 2010 6:11 pm

help with web scraper

Post by playwright »

Hello..i'm new to php so i need some real help in here...
I trying to create a web scraper that grabs a forum's content and shows only the posts. . The source code is here:

<html>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<?php
$html = file_get_contents ('http://www.......');
$dom = new DomDocument();
@$dom->loadHTML ($html);
$xpath = new DOMXPath ($dom);
$key = $xpath->query ('//*[@class="postTextContainer"]');
foreach($key as $keys){
echo $keys->nodeValue ,"<br/> \n";
}
?>
</html>

can anyone tell me how i could grab all the posts that are in the same thread??now i can only grab the posts that are in the above url..i think it's called multiple page scraping??
User avatar
phdatabase
Forum Commoner
Posts: 83
Joined: Fri May 28, 2010 10:02 am
Location: Fort Myers, FL

Re: help with web scraper

Post by phdatabase »

Scraping is usually accomplished using (x)html markup because you'll never see the source code and essentially reverse engineering the process that created the content. The snippet of code appears to load some html content which is where you start. As far as how to get threads, every forum is different and you'll just need to puzzle it out. (That's the fun of it)
playwright
Forum Newbie
Posts: 20
Joined: Wed Jun 02, 2010 6:11 pm

Re: help with web scraper

Post by playwright »

in the situation i'm trying on right now, the url of the first page of the thread is like http://www.(bla bla bla).com/forum/showthread.php?t=360717
and the other pages of the thread are http://www.(bla bla bla).com/forum/showthread.php?t=360717&page=2 ... &page=3 and so on... should i use a regex and a for loop or sth like that???
User avatar
phdatabase
Forum Commoner
Posts: 83
Joined: Fri May 28, 2010 10:02 am
Location: Fort Myers, FL

Re: help with web scraper

Post by phdatabase »

You need to load the 'http://blah/blah/blah...' into a string and parse it for the content and then continue that for as many ages as there are. So, yes you will need a loop but this is more a structure thing than a regex thing and I find while loops are generally handier for this type work.

It appears that following a thread will be easy based on the query strings you are showing.
playwright
Forum Newbie
Posts: 20
Joined: Wed Jun 02, 2010 6:11 pm

Re: help with web scraper

Post by playwright »

i hope i' ll find a way to do it..I also want to ask how i can delete the content that exists between two tags and exists in the content that i have grabbed with the above code?? more specific the tag is <div class="........">bla bla</div>
User avatar
phdatabase
Forum Commoner
Posts: 83
Joined: Fri May 28, 2010 10:02 am
Location: Fort Myers, FL

Re: help with web scraper

Post by phdatabase »

Use PHP's header to create a HTTP GET request and load the reply. Or, use cURL, easier yet. An excellent primer for building agents is Webbots, Spiders, and Scrapers a guide to developing internet agents with PHP/cURL by Michael Schrenk
playwright
Forum Newbie
Posts: 20
Joined: Wed Jun 02, 2010 6:11 pm

Re: help with web scraper

Post by playwright »

thanks for the advice..Actually, i have searched all over the web to write these down,i have searched curl, dom, regexes but as i said before i'm new to php so it would be really helpful if you could write some code for these.. thanks anyway!!!
playwright
Forum Newbie
Posts: 20
Joined: Wed Jun 02, 2010 6:11 pm

Re: help with web scraper

Post by playwright »

any help???
User avatar
phdatabase
Forum Commoner
Posts: 83
Joined: Fri May 28, 2010 10:02 am
Location: Fort Myers, FL

Re: help with web scraper

Post by phdatabase »

I make my living writing agents. I am happy to share my knowledge and experience (what there is of it) but I am not going to write a scraper for you or anyone else; unless you want to pay me, of course. If you get the resource I named and apply yourself, you should be able to write a scraper in a week.
Post Reply