Greedyness matching Weirdness

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Greedyness matching Weirdness

Post by John Cartwright »

So I have an archive of about 15,000 old archives that I need to scan through and sort them out. These archives are old html files and I need to capture the contents between the <textarea></textarea> tags. Simple right? Likely yes, but I have been at this for almost 2 hours (using several regex tools didn't help -- regexbudy).

Code: Select all

!<textarea[^>]>(.*?)</textarea>!im
For whatever reason, it will never detect the </textarea> literal string, therefore I'm suspecting I'm having dificulties with the greediness of the wildcard. I've tried making it both greedy and lazy but neither have worked.

An example source would be:

Code: Select all

<textarea  cols="60" rows="30" name="txtMessage">
<!DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 4.01 Transitional//EN' 'http://www.w3.org/TR/html4/loose.dtd'>;
<html><head><title>I'm a title of an article</title></head><body><h3>Test Article</h3><br><br> By: Test Article<br><br>lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum </body></html></textarea> 
Please tell me I'm missing something obvious!

Thanks :)
User avatar
vigge89
Forum Regular
Posts: 875
Joined: Wed Jul 30, 2003 3:29 am
Location: Sweden

Post by vigge89 »

Code: Select all

!<textarea[^>]*>(.*?)</textarea>!im
It looks like you forgot to add a repetition operator after the character class ;)
User avatar
John Cartwright
Site Admin
Posts: 11470
Joined: Tue Dec 23, 2003 2:10 am
Location: Toronto
Contact:

Post by John Cartwright »

Sorry I actually had that originally in there (typo).

Code: Select all

!<textarea[^>]+>(.*?)</textarea>!im
User avatar
arjan.top
Forum Contributor
Posts: 305
Joined: Sun Oct 14, 2007 4:36 am
Location: Hoče, Slovenia

Post by arjan.top »

it works if you add s modifier (!ims), dont ask me why :D
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

m is multiline mode; the match must happen on a single line.
s is single line mode; the match can happen between any number of lines.
Post Reply