page scraper

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

redmonkey
Forum Regular
Posts: 836
Joined: Thu Dec 18, 2003 3:58 pm

Post by redmonkey »

The multi-line mode modifier does cause some confusion. The multi-line modifer only affects the ^ and $ meta-characters.

This spans back to regex originally being a command line tool so in theory you could only apply a regex to a single line. As time moved on the multi-line modifier was intoduced to cope with applying regex to entire documents. So by default ^ now means line starts with and $ means line ends with. By applying the multi-line modifier, ^ becomes document (or multi-line string) starts with and $ becomes document (or multi-line string) ends with.

What you require is the 'dot all' modifier. Again as regex was a commandline tool originally the . meta-character matched any character in a single line so it stopped if it found a new line character. By default the . meta-character matches anything except new line but by asserting the 'dot all' modifier the . meta-character becomes 'match any character including new lines.

Also your multi-line modifier is in the wrong place. Your regex should look like something similar to below...

Code: Select all

preg_match_all('/<!-- NEWEST ANNOUNCED.*<!-- AFFILIATE/s', $content, $match))
Not sure if you were using match or match_all
Unipus
Forum Contributor
Posts: 409
Joined: Tue Aug 26, 2003 2:06 pm
Location: Los Angeles, CA

Post by Unipus »

Thanks for the explanation... "explanations" are the one thing I'm having a lot of trouble finding. I think some people want to keep regex a secret society with steep membership requirements. Anyway.

"By default the . meta-character matches anything except new line but by asserting the 'dot all' modifier the . meta-character becomes 'match any character including new lines."

That of course is what I originally had (unless "dot-all" is NOT .* ), but it didn't seem to function on a multiline level at all. So I finally came up with that solution I posted, which works but if there's a simpler way, I'd be glad to hear it.
redmonkey
Forum Regular
Posts: 836
Joined: Thu Dec 18, 2003 3:58 pm

Post by redmonkey »

Have a look at my last code example, you will note an 's' on the end of the match pattern but after the delimiter. That 's' is the 'dot all' modifier, it is not part of the pattern as such but instructs the regex engine that the . meta-character should be interpreted as 'match anything including new lines'.

There are quite a few regex tutorials out there but finding good ones seems to be quite difficult, I have yet to come across one that is specific to PCRE on PHP as there are some incompatabilities between Perl and PHP regex.
Post Reply