page scraper
Moderator: General Moderators
page scraper
I need a page scraper to automate some custom tasks for me. The problem? I have no idea how to go about writing one (pref. in PHP). I've found only 2 PHP scrapers, and I've downloaded them, but they're both very intricate and hard to follow, and dedicated to fairly specific tasks that don't really apply to what I need to do. I'm having a hard time picking out the wheat from the chaff in reverse-engineering them, so if anyone has any resources (GPL scripts, tutorials, basic articles, whatever) I'd possibly enter into an agreement to touch you in your bathing suit region.
Well, I'm working on it now... run across a bit of a brick wall.
Seems simple enough, seems to work as a pure regex, but PHP doesn't like it. Warning: Unknown modifier '.' in /web/sites/live/htdocs/test.php on line 5, it says. Why? I tried escaping it with a \, tried double-quotes for the pattern, tried escaping the enclosed quotes... no luck.
Code: Select all
$content = file_get_contents("[targetwebpage]");
preg_match_all('<div class="newsheadline">.*</div>',$content,$match);
echo $match[0];Seems simple enough, seems to work as a pure regex, but PHP doesn't like it. Warning: Unknown modifier '.' in /web/sites/live/htdocs/test.php on line 5, it says. Why? I tried escaping it with a \, tried double-quotes for the pattern, tried escaping the enclosed quotes... no luck.
-
microthick
- Forum Regular
- Posts: 543
- Joined: Wed Sep 24, 2003 2:15 pm
- Location: Vancouver, BC
If I escape the .
the error simply becomes Warning: Unknown modifier ''' in /web/sites/live/htdocs/test.php on line 5
I have no idea why that's happening, but I also don't understand most of the rest of your post, so forgive me for being a regex n00b. I've been slogging through the PHP PCRE docs, but it's not exactly beginner's material. All I know is that the regex works just fine in Regex Coach... everything I've tried in PHP doesn't.
Code: Select all
preg_match_all('<div class="newsheadline">\.*</div>',$content,$match);I have no idea why that's happening, but I also don't understand most of the rest of your post, so forgive me for being a regex n00b. I've been slogging through the PHP PCRE docs, but it's not exactly beginner's material. All I know is that the regex works just fine in Regex Coach... everything I've tried in PHP doesn't.
-
microthick
- Forum Regular
- Posts: 543
- Joined: Wed Sep 24, 2003 2:15 pm
- Location: Vancouver, BC
Maybe this tutorial will help you with what you're doing... but with eregi().
http://www.devhome.org/php/tutorials/webcatching.html
http://www.devhome.org/php/tutorials/webcatching.html
I have yet to find any decent documentation regarding PCRE for PHP so it is not surprising you are having problems. For what it's worth there are many of these 'regex checker' style apps that work fine on their own but the regex won't work within code.
Anyway your code should should look something like this...
You will note that it has a leading and trailing slash, these are the start and end delimiters, you dont have to use the / character but I that's the one I usually use, and seems to be the commonly used.
Anyway your code should should look something like this...
Code: Select all
preg_match_all('/<div class="newsheadline">\.*<\/div>/',$content,$match);- twigletmac
- Her Royal Site Adminness
- Posts: 5371
- Joined: Tue Apr 23, 2002 2:21 am
- Location: Essex, UK
uh, I didn't have a start and ending delimiter. So now it looks like:
But interestingly enough, it's returning the "what the??" result and not the content. If I echo out the whole $content and view source, the page MOST DEFINITELY contains (as an example):
<span class="newsheadline"><a href="/news/0401/04010809.asp">Exciting news is happening all over the world.</a></span>
So that should be captured by my regex... and yet... is not. Still puzzled, you see.
Code: Select all
$content = file_get_contents("[URI]");
if (preg_match('/<span class="newsheadline">.*</span]>', $content, $matches))
{
echo $matches;
}
else
{
echo "not found? what the??";
}<span class="newsheadline"><a href="/news/0401/04010809.asp">Exciting news is happening all over the world.</a></span>
So that should be captured by my regex... and yet... is not. Still puzzled, you see.
Code: Select all
if (preg_match('/<span class="newsheadline">.*<\/span>/', $content, $matches))I'm guessing you want to extract the contents between the span tags? if so you will need to use something like...
Code: Select all
if (preg_match('/<span class="newsheadline">(.*)<\/span>/', $content, $matches))Ah, but there's MORE!
Now, I'm trying to capture pieces of information on a page that are not uniformly identifiable. Fortunately, the pages are commented in blocks. The only way I've thought of to automate the distinction between these and any other possible match is to first make an initial match of the block and then do a second iteration through only that result set. But here's the problem: how the hell do I do a multiline capture? I see the modifier 'm', but A) it doesn't seem to work in Regex Coach, B) even if it IS working, I seem to be using it wrong in PHP.
??
So far I've found NO documentation on how to use these properly with PHP. But I need to be able to account for an unknown number of newlines with an unknown amount of content on each line. Grrr.
Now, I'm trying to capture pieces of information on a page that are not uniformly identifiable. Fortunately, the pages are commented in blocks. The only way I've thought of to automate the distinction between these and any other possible match is to first make an initial match of the block and then do a second iteration through only that result set. But here's the problem: how the hell do I do a multiline capture? I see the modifier 'm', but A) it doesn't seem to work in Regex Coach, B) even if it IS working, I seem to be using it wrong in PHP.
Code: Select all
preg_match_('/<!-- NEWEST ANNOUNCED.*<!-- AFFILIATE/', $content, $match, \m))So far I've found NO documentation on how to use these properly with PHP. But I need to be able to account for an unknown number of newlines with an unknown amount of content on each line. Grrr.