Page 1 of 1

Retrieving HTML data from a table

Posted: Sun Mar 18, 2007 7:08 am
by leyen
Hi guys,

I've had a read of regex but I find it hard to understand most of it.

Basically, what I would like to do is to have a code that looks through a html table and grabs the data, storing them in separate strings to be inserted into a mysql database. An example of the table is:

Code: Select all

<table><tr bgcolor=#000000><td colspan=2>
<b>Personal Information</b></td></tr>
<tr bgcolor=#202020><td width=20%>Name:</td><td>Jacob McGrath</td></tr>
<tr bgcolor=#ffffff><td>Gender:</td><td>male</td></tr>
<tr bgcolor=#202020><td>E-mail:</td><td>myemail@yourdomain.com</td></tr>
<tr bgcolor=#ffffff><td>Country:</td><td>Ireland</td></tr>
<tr bgcolor=#202020><td>Age:</td><td>23</td></tr>
<tr bgcolor=#ffffff><td>Degree:</td><td>Bachelor of Engineering</td></tr>
<tr bgcolor=#202020><td>Company:</td><td>McGrath's Engineering</td></tr>
<tr bgcolor=#ffffff><td>Link:</td><td><a href="http://www.yourdomain.com/mcgrath">McGrath</a></td></tr>
<tr bgcolor=#202020><TD>TimeZone:</td><td>GMT +3.00</td></tr>
</table>
Notice that the bgcolor alternates between #ffffff and #202020, when one of the information above is not given, the entire <tr></tr> will not be displayed, and the colours still alternate. For example, if we were to take out the "degree" data in between Age and Country,

Code: Select all

<table><tr bgcolor=#000000><td colspan=2>
<b>Personal Information</b></td></tr>
<tr bgcolor=#202020><td width=20%>Name:</td><td>Jacob McGrath</td></tr>
<tr bgcolor=#ffffff><td>Gender:</td><td>male</td></tr>
<tr bgcolor=#202020><td>E-mail:</td><td>myemail@yourdomain.com</td></tr>
<tr bgcolor=#ffffff><td>Country:</td><td>Ireland</td></tr>
<tr bgcolor=#202020><td>Age:</td><td>23</td></tr>
<tr bgcolor=#ffffff><td>Company:</td><td>McGrath's Engineering</td></tr>
<tr bgcolor=#202020><td>Link:</td><td><a href="http://www.yourdomain.com/mcgrath">McGrath</a></td></tr>
<tr bgcolor=#ffffff><TD>TimeZone:</td><td>GMT +3.00</td></tr>
</table>
... the bgcolor still alternates!

I would like to be able to save all the info into their respective strings (Jacob Mcgrath to $name, Male to $gender, Ireland to $country, etc)

Thanks in advance :)

Posted: Sun Mar 18, 2007 11:19 am
by John Cartwright
You have permission to extract this information I assume?

Code: Select all

preg_match_all('#<td[^>]>([a-Z]+:)</td><td[^>]>(*?)</td>#', $source);
This should get you started, although this should be a good excersice for you to fine tune the regex code.

Posted: Sun Mar 18, 2007 11:55 am
by feyd
[a-Z] will fail expression compilation as the characters are not in ascending order. Flipped around to [A-z] includes characters probably not intended.

Code: Select all

#<td[^>]*>([a-z-]+:)</td><td[^>]*>(.*?)</td>#i
Will probably suffice for the provided snippets.

Note, if there's a table inside the second cell, the regex will not match the entire table (unless it does not contain cells.)

Posted: Sun Mar 18, 2007 11:56 am
by John Cartwright
Nice one feyd, but I was hoping the OP could figure it out 8) Alls well ends well.

Posted: Mon Mar 19, 2007 6:11 am
by leyen
Hi Jcart and feyd, thanks for your help :D!! And yeah I have permission to extract the data.

I've tried the function preg_match, but all it does is return the entire pattern I'm searching for... so it returns the entire

Code: Select all

<td width=20%>Name:</td><td>Jacob McGrath</td>
Whereas I'm only looking for something to return "Jacob McGrath". I've tried to see if I can manipulate preg_match to return just the name, but I can't seem to do it.

Any pointers will be much appreciated :D! Thanks.

Posted: Mon Mar 19, 2007 9:45 am
by feyd
Did you use one of the above patterns? They will both return a three element array for each match.