Page 1 of 1

Greedyness matching Weirdness

Posted: Sat Dec 22, 2007 7:21 pm
by John Cartwright
So I have an archive of about 15,000 old archives that I need to scan through and sort them out. These archives are old html files and I need to capture the contents between the <textarea></textarea> tags. Simple right? Likely yes, but I have been at this for almost 2 hours (using several regex tools didn't help -- regexbudy).

Code: Select all

!<textarea[^>]>(.*?)</textarea>!im
For whatever reason, it will never detect the </textarea> literal string, therefore I'm suspecting I'm having dificulties with the greediness of the wildcard. I've tried making it both greedy and lazy but neither have worked.

An example source would be:

Code: Select all

<textarea  cols="60" rows="30" name="txtMessage">
<!DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 4.01 Transitional//EN' 'http://www.w3.org/TR/html4/loose.dtd'>;
<html><head><title>I'm a title of an article</title></head><body><h3>Test Article</h3><br><br> By: Test Article<br><br>lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum </body></html></textarea> 
Please tell me I'm missing something obvious!

Thanks :)

Posted: Sat Dec 22, 2007 8:18 pm
by vigge89

Code: Select all

!<textarea[^>]*>(.*?)</textarea>!im
It looks like you forgot to add a repetition operator after the character class ;)

Posted: Sat Dec 22, 2007 8:30 pm
by John Cartwright
Sorry I actually had that originally in there (typo).

Code: Select all

!<textarea[^>]+>(.*?)</textarea>!im

Posted: Sun Dec 23, 2007 4:39 am
by arjan.top
it works if you add s modifier (!ims), dont ask me why :D

Posted: Sun Dec 23, 2007 9:56 am
by feyd
m is multiline mode; the match must happen on a single line.
s is single line mode; the match can happen between any number of lines.