Page 1 of 1

regex not working as expected

Posted: Sun May 17, 2009 1:19 am
by andersod2
Hi All,

I have an input string that is multi-line (i.e. I have slurped in an entire file) -- I am trying to match the data within some some xml-like tags , for example:

$input =~ /<header>(.+)<\/header>/

...so I am using perl to do the above expression, expecting that everything between the "header" tags will be returned in $1...my understanding according to perl is that by default this should work because we are not in multiline mode...however, the above does not work....but when I remove all the carriage returns from the input string, it works (which suggests it is in multi-line mode by default)...what am I doing wrong?

thanks in advance!

Re: regex not working as expected

Posted: Sun May 17, 2009 2:07 am
by andersod2
Oh, I think I figured it out maybe -- I didn't notice the /s requirement since "." does not match a carriage return (strange).

Re: regex not working as expected

Posted: Sun May 17, 2009 2:34 am
by prometheuzz
That is not what multi-line is about. The multi line option will cause the ^ and % anchors to match each start- and end of a line in the input string instead of the start- and end of the entire string.

What you need is to enable the dot-all option: s.

Demo:

Code: Select all

#!/usr/bin/perl -w
my $s = "...<header>ab\ncd</header>...";
$s =~ /<header>(.*?)<\/header>/s;
print "$1";
Note that I added a question mark after your DOT-STAR, to understand why this is generally a good idea, see: http://www.regular-expressions.info/repeat.html specifically the paragraph "Watch Out for The Greediness!".

But if you're parsing (X)HTML or XML files, I recommend using a xml/html parser instead of trying to do this with regex: regex is a poor html parser.

Re: regex not working as expected

Posted: Sun May 17, 2009 2:35 am
by prometheuzz
andersod2 wrote:Oh, I think I figured it out maybe -- I didn't notice the /s requirement since "." does not match a carriage return (strange).
Not strange at all. In almost all regex engines, the DOT by default does not match new line characters.