page scraper

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Unipus
Forum Contributor
Posts: 409
Joined: Tue Aug 26, 2003 2:06 pm
Location: Los Angeles, CA

page scraper

Post by Unipus »

I need a page scraper to automate some custom tasks for me. The problem? I have no idea how to go about writing one (pref. in PHP). I've found only 2 PHP scrapers, and I've downloaded them, but they're both very intricate and hard to follow, and dedicated to fairly specific tasks that don't really apply to what I need to do. I'm having a hard time picking out the wheat from the chaff in reverse-engineering them, so if anyone has any resources (GPL scripts, tutorials, basic articles, whatever) I'd possibly enter into an agreement to touch you in your bathing suit region.
Unipus
Forum Contributor
Posts: 409
Joined: Tue Aug 26, 2003 2:06 pm
Location: Los Angeles, CA

Post by Unipus »

Well, I'm working on it now... run across a bit of a brick wall.

Code: Select all

$content = file_get_contents("[targetwebpage]");

preg_match_all('<div class="newsheadline">.*</div>',$content,$match);

echo $match[0];

Seems simple enough, seems to work as a pure regex, but PHP doesn't like it. Warning: Unknown modifier '.' in /web/sites/live/htdocs/test.php on line 5, it says. Why? I tried escaping it with a \, tried double-quotes for the pattern, tried escaping the enclosed quotes... no luck.
microthick
Forum Regular
Posts: 543
Joined: Wed Sep 24, 2003 2:15 pm
Location: Vancouver, BC

Post by microthick »

Have you tried putting parentheses around the .* so it looks like (.*)
Unipus
Forum Contributor
Posts: 409
Joined: Tue Aug 26, 2003 2:06 pm
Location: Los Angeles, CA

Post by Unipus »

Yes, then the error simply becomes "Warning: Unknown modifier '(' in /web/sites/live/htdocs/test.php on line 5"

Grr.
redmonkey
Forum Regular
Posts: 836
Joined: Thu Dec 18, 2003 3:58 pm

Post by redmonkey »

The . character has a meaning within PCRE so you need to escape it e.g. it should be \.

Your regex also needs a start and end delimiter and you are using greedy matching, I can't remember the default within PHP for greedy mode with PCRE so you may have to switch to non-greedy.
Unipus
Forum Contributor
Posts: 409
Joined: Tue Aug 26, 2003 2:06 pm
Location: Los Angeles, CA

Post by Unipus »

If I escape the .

Code: Select all

preg_match_all('<div class="newsheadline">\.*</div>',$content,$match);
the error simply becomes Warning: Unknown modifier ''' in /web/sites/live/htdocs/test.php on line 5

I have no idea why that's happening, but I also don't understand most of the rest of your post, so forgive me for being a regex n00b. I've been slogging through the PHP PCRE docs, but it's not exactly beginner's material. All I know is that the regex works just fine in Regex Coach... everything I've tried in PHP doesn't.
microthick
Forum Regular
Posts: 543
Joined: Wed Sep 24, 2003 2:15 pm
Location: Vancouver, BC

Post by microthick »

Maybe this tutorial will help you with what you're doing... but with eregi().
http://www.devhome.org/php/tutorials/webcatching.html
redmonkey
Forum Regular
Posts: 836
Joined: Thu Dec 18, 2003 3:58 pm

Post by redmonkey »

I have yet to find any decent documentation regarding PCRE for PHP so it is not surprising you are having problems. For what it's worth there are many of these 'regex checker' style apps that work fine on their own but the regex won't work within code.

Anyway your code should should look something like this...

Code: Select all

preg_match_all('/<div class="newsheadline">\.*<\/div>/',$content,$match);
You will note that it has a leading and trailing slash, these are the start and end delimiters, you dont have to use the / character but I that's the one I usually use, and seems to be the commonly used.
User avatar
twigletmac
Her Royal Site Adminness
Posts: 5371
Joined: Tue Apr 23, 2002 2:21 am
Location: Essex, UK

Post by twigletmac »

You were getting the error because PHP thought that your regex was enclosed by < and > and that anything following the > was a modifier, this is solved by the forward slashes in redmonkey's code above.

Mac
Unipus
Forum Contributor
Posts: 409
Joined: Tue Aug 26, 2003 2:06 pm
Location: Los Angeles, CA

Post by Unipus »

Great, thanks.
Unipus
Forum Contributor
Posts: 409
Joined: Tue Aug 26, 2003 2:06 pm
Location: Los Angeles, CA

Post by Unipus »

uh, I didn't have a start and ending delimiter. So now it looks like:

Code: Select all

$content = file_get_contents("[URI]");

if (preg_match('/<span class="newsheadline">.*</span]>', $content, $matches)) 
{
	echo $matches;
}
else
{
	echo "not found? what the??";
}
But interestingly enough, it's returning the "what the??" result and not the content. If I echo out the whole $content and view source, the page MOST DEFINITELY contains (as an example):

<span class="newsheadline"><a href="/news/0401/04010809.asp">Exciting news is happening all over the world.</a></span>

So that should be captured by my regex... and yet... is not. Still puzzled, you see.
redmonkey
Forum Regular
Posts: 836
Joined: Thu Dec 18, 2003 3:58 pm

Post by redmonkey »

Code: Select all

if (preg_match('/<span class="newsheadline">.*<\/span>/', $content, $matches))
Try that out, $matches will be an array so you will have to use something like print_r to see the contents.

I'm guessing you want to extract the contents between the span tags? if so you will need to use something like...

Code: Select all

if (preg_match('/<span class="newsheadline">(.*)<\/span>/', $content, $matches))
Unipus
Forum Contributor
Posts: 409
Joined: Tue Aug 26, 2003 2:06 pm
Location: Los Angeles, CA

Post by Unipus »

Ick, I see that my post got all messed up converting between <> and []. Anyway, it wasn't even recognizing the array before, and now I realize why: I had escaped the '.' as '\.', so it was looking for a literal '.'
Unipus
Forum Contributor
Posts: 409
Joined: Tue Aug 26, 2003 2:06 pm
Location: Los Angeles, CA

Post by Unipus »

Ah, but there's MORE!

Now, I'm trying to capture pieces of information on a page that are not uniformly identifiable. Fortunately, the pages are commented in blocks. The only way I've thought of to automate the distinction between these and any other possible match is to first make an initial match of the block and then do a second iteration through only that result set. But here's the problem: how the hell do I do a multiline capture? I see the modifier 'm', but A) it doesn't seem to work in Regex Coach, B) even if it IS working, I seem to be using it wrong in PHP.

Code: Select all

preg_match_('/<!-- NEWEST ANNOUNCED.*<!-- AFFILIATE/', $content, $match, \m))
??

So far I've found NO documentation on how to use these properly with PHP. But I need to be able to account for an unknown number of newlines with an unknown amount of content on each line. Grrr.
Unipus
Forum Contributor
Posts: 409
Joined: Tue Aug 26, 2003 2:06 pm
Location: Los Angeles, CA

Post by Unipus »

AHA!

<!-- NEWEST ANNOUNCED(.*(\n))*<!-- AFFILIATE

I'm so stupid sometimes. Although the multiline modifier still really seems to do nothing.
Post Reply