Page 1 of 1

preg_match_all unclosed tag problem

Posted: Thu Sep 10, 2009 3:11 am
by rtnylw
Hi,
I'm working on a website scrape script. I need to get just a text between the tags, the problem is that some of the tags are not closed. Here's an example:

Code: Select all

<tag>some text</tag>
<tag>more text
<tag>even more text</tag>
The second opening tag, does not have a closing tag. I was thinking this expression might work:

Code: Select all

#<tag>(.*?)#is
But nothing gets returned. I guess the preg_match needs a 'closing' identifier or probably I'm just way off track. Any guidance is much appreciated. Thanks!

Re: preg_match_all unclosed tag problem

Posted: Thu Sep 10, 2009 3:19 am
by requinix

Code: Select all

#<tag>(.*?)</tag>#is
There's only so far you can go to deal with bad input.


Here's what happens if I post that sample you gave using code tags.

Code: Select all

some text

Code: Select all

more text

Code: Select all

even more text

Re: preg_match_all unclosed tag problem

Posted: Thu Sep 10, 2009 4:37 am
by rtnylw
So there aren't any solutions to this problem?

Re: preg_match_all unclosed tag problem

Posted: Thu Sep 10, 2009 4:50 am
by turbolemon
You could try running the input through tidy or html purifier?

Tidy
http://www.w3.org/People/Raggett/tidy/ (Description)
http://uk.php.net/tidy (PHP Extension)

HTML Purifier
http://htmlpurifier.org/