Page 1 of 2
page scraper
Posted: Tue Jan 13, 2004 4:26 pm
by Unipus
I need a page scraper to automate some custom tasks for me. The problem? I have no idea how to go about writing one (pref. in PHP). I've found only 2 PHP scrapers, and I've downloaded them, but they're both very intricate and hard to follow, and dedicated to fairly specific tasks that don't really apply to what I need to do. I'm having a hard time picking out the wheat from the chaff in reverse-engineering them, so if anyone has any resources (GPL scripts, tutorials, basic articles, whatever) I'd possibly enter into an agreement to touch you in your bathing suit region.
Posted: Tue Jan 13, 2004 7:56 pm
by Unipus
Well, I'm working on it now... run across a bit of a brick wall.
Code: Select all
$content = file_get_contents("[targetwebpage]");
preg_match_all('<div class="newsheadline">.*</div>',$content,$match);
echo $match[0];
Seems simple enough, seems to work as a pure regex, but PHP doesn't like it.
Warning: Unknown modifier '.' in /web/sites/live/htdocs/test.php on line 5, it says. Why? I tried escaping it with a \, tried double-quotes for the pattern, tried escaping the enclosed quotes... no luck.
Posted: Tue Jan 13, 2004 8:01 pm
by microthick
Have you tried putting parentheses around the .* so it looks like (.*)
Posted: Tue Jan 13, 2004 8:10 pm
by Unipus
Yes, then the error simply becomes "Warning: Unknown modifier '(' in /web/sites/live/htdocs/test.php on line 5"
Grr.
Posted: Tue Jan 13, 2004 9:02 pm
by redmonkey
The . character has a meaning within PCRE so you need to escape it e.g. it should be \.
Your regex also needs a start and end delimiter and you are using greedy matching, I can't remember the default within PHP for greedy mode with PCRE so you may have to switch to non-greedy.
Posted: Tue Jan 13, 2004 9:06 pm
by Unipus
If I escape the .
Code: Select all
preg_match_all('<div class="newsheadline">\.*</div>',$content,$match);
the error simply becomes Warning: Unknown modifier ''' in /web/sites/live/htdocs/test.php on line 5
I have no idea why that's happening, but I also don't understand most of the rest of your post, so forgive me for being a regex n00b. I've been slogging through the PHP PCRE docs, but it's not exactly beginner's material. All I know is that the regex works just fine in Regex Coach... everything I've tried in PHP doesn't.
Posted: Tue Jan 13, 2004 9:52 pm
by microthick
Maybe this tutorial will help you with what you're doing... but with eregi().
http://www.devhome.org/php/tutorials/webcatching.html
Posted: Tue Jan 13, 2004 10:36 pm
by redmonkey
I have yet to find any decent documentation regarding PCRE for PHP so it is not surprising you are having problems. For what it's worth there are many of these 'regex checker' style apps that work fine on their own but the regex won't work within code.
Anyway your code should should look something like this...
Code: Select all
preg_match_all('/<div class="newsheadline">\.*<\/div>/',$content,$match);
You will note that it has a leading and trailing slash, these are the start and end delimiters, you dont have to use the / character but I that's the one I usually use, and seems to be the commonly used.
Posted: Wed Jan 14, 2004 4:12 am
by twigletmac
You were getting the error because PHP thought that your regex was enclosed by < and > and that anything following the > was a modifier, this is solved by the forward slashes in redmonkey's code above.
Mac
Posted: Wed Jan 14, 2004 2:38 pm
by Unipus
Great, thanks.
Posted: Wed Jan 14, 2004 3:22 pm
by Unipus
uh, I didn't have a start and ending delimiter. So now it looks like:
Code: Select all
$content = file_get_contents("[URI]");
if (preg_match('/<span class="newsheadline">.*</span]>', $content, $matches))
{
echo $matches;
}
else
{
echo "not found? what the??";
}
But interestingly enough, it's returning the "what the??" result and not the content. If I echo out the whole $content and view source, the page MOST DEFINITELY contains (as an example):
<span class="newsheadline"><a href="/news/0401/04010809.asp">Exciting news is happening all over the world.</a></span>
So that should be captured by my regex... and yet... is not. Still puzzled, you see.
Posted: Wed Jan 14, 2004 3:36 pm
by redmonkey
Code: Select all
if (preg_match('/<span class="newsheadline">.*<\/span>/', $content, $matches))
Try that out, $matches will be an array so you will have to use something like print_r to see the contents.
I'm guessing you want to extract the contents between the span tags? if so you will need to use something like...
Code: Select all
if (preg_match('/<span class="newsheadline">(.*)<\/span>/', $content, $matches))
Posted: Wed Jan 14, 2004 5:50 pm
by Unipus
Ick, I see that my post got all messed up converting between <> and []. Anyway, it wasn't even recognizing the array before, and now I realize why: I had escaped the '.' as '\.', so it was looking for a literal '.'
Posted: Wed Jan 14, 2004 6:20 pm
by Unipus
Ah, but there's MORE!
Now, I'm trying to capture pieces of information on a page that are not uniformly identifiable. Fortunately, the pages are commented in blocks. The only way I've thought of to automate the distinction between these and any other possible match is to first make an initial match of the block and then do a second iteration through only that result set. But here's the problem: how the hell do I do a multiline capture? I see the modifier 'm', but A) it doesn't seem to work in Regex Coach, B) even if it IS working, I seem to be using it wrong in PHP.
Code: Select all
preg_match_('/<!-- NEWEST ANNOUNCED.*<!-- AFFILIATE/', $content, $match, \m))
??
So far I've found NO documentation on how to use these properly with PHP. But I need to be able to account for an unknown number of newlines with an unknown amount of content on each line. Grrr.
Posted: Wed Jan 14, 2004 6:23 pm
by Unipus
AHA!
<!-- NEWEST ANNOUNCED(.*(\n))*<!-- AFFILIATE
I'm so stupid sometimes. Although the multiline modifier still really seems to do nothing.