Regexp to extract content from a website

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
saucony77
Forum Newbie
Posts: 2
Joined: Fri Jul 24, 2009 4:03 am

Regexp to extract content from a website

Post by saucony77 »

Hi!
I'm working on a software that extracts some contents from a web site.

For example, this is html:

Code: Select all

 
<div id="panel">
<span id="title">Crazy Maradona</span>
<br>
<div id="news">text of news about Maradona</div>
</div>
 
In this work, I know only the text of this file (I know only that "Crazy Maradona" si title and "text of news about Maradona" is text). I do not know html code. If I find, with a regexp, the structure I can find all news, every day.

So..
I created a software php that start from tags between title and text (</span><br><div id="news") and "builds" the regexp of structure.
For this example, I'have:

Code: Select all

 
<div [^>]*>\s*<span [^>]*>\s*[^<]+</span>\s<br>\s<div [^>]*>[^<]+</div>\s</div>
 
But..
my software crash with a structure more complicated, for example:

Code: Select all

 
<div id="panel">[color=#BF0000]<img src="image">[/color]
<span id="title">Crazy Maradona</span>
<br>
<div id="news">text of news about Maradona</div>
</div>
 
img tag is not open or closed between title /text and my software.. not find it.

or..

Code: Select all

 
<div id="panel">
<span id="title">Crazy Maradona</span>
<br>
<div id="news">text of news[color=#BF0000]<img src="image">[/color] about Maradona</div>
</div>
 
an html tag on text! mi software, that use [^<]+ for title/text.., don't work!

Can help me to create an "mega" regexp for these bug?
Thanks a lot!!!
Post Reply