Greedyness matching Weirdness
Posted: Sat Dec 22, 2007 7:21 pm
So I have an archive of about 15,000 old archives that I need to scan through and sort them out. These archives are old html files and I need to capture the contents between the <textarea></textarea> tags. Simple right? Likely yes, but I have been at this for almost 2 hours (using several regex tools didn't help -- regexbudy).
For whatever reason, it will never detect the </textarea> literal string, therefore I'm suspecting I'm having dificulties with the greediness of the wildcard. I've tried making it both greedy and lazy but neither have worked.
An example source would be:
Please tell me I'm missing something obvious!
Thanks
Code: Select all
!<textarea[^>]>(.*?)</textarea>!imAn example source would be:
Code: Select all
<textarea cols="60" rows="30" name="txtMessage">
<!DOCTYPE HTML PUBLIC '-//W3C//DTD HTML 4.01 Transitional//EN' 'http://www.w3.org/TR/html4/loose.dtd'>;
<html><head><title>I'm a title of an article</title></head><body><h3>Test Article</h3><br><br> By: Test Article<br><br>lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum lorem itsum </body></html></textarea> Thanks