Page 1 of 1

HTML table to CSV

Posted: Fri Sep 18, 2009 10:02 am
by shiznatix
Ok, basically, as the title suggests, I have a HTML table that I need to convert to a CSV file. The HTML table is actually supposed to be the excel output file but whatever, people can't code and I don't know where they live so I can't do anything about it. Problem being that I am not very good with regular expressions!

Basically I have a table kinda like this:

Code: Select all

<table cellspacing="1" cellpadding="4" rules="all" bordercolor="#CC9966" border="0" id="Datagrid2" bgcolor="White" width="100%">
        <tr class="TabularA" bgcolor="#990000">
                <td><font color="#FFFFCC"><b>Customer ID</b></font></td><td><font color="#FFFFCC"><b>Alias</b></font></td><td><font color="#FFFFCC"><b>Product</b></font></td><td align="right"><font color="#FFFFCC"><b>Net Revenue</b></font></td>
        </tr><tr class="TabularB" bgcolor="White">
                <td><font color="#330099">1111111</font></td><td><font color="#330099">UserName</font></td><td><font color="#330099">TextOne</font></td><td align="right"><font color="#330099">99.99</font></td>
        </tr>
</table>
and I want to turn it into:
Customer ID,Alias,Product,Net Revenue
1111111,UserName,TextOne,99.99
I figured I could start slow and just do this (i have tried a bunch of variations of this):

Code: Select all

preg_match_all('#<table.*>.*<tr.*>(.*)</tr></table>#ism#', $contents, $matches);
but nothing is working. Could someone help turn me in the right direction?

Re: HTML table to CSV

Posted: Fri Sep 18, 2009 10:40 am
by ridgerunner
Your regex is using the dot-star combination with reckless abandon! For starters, the very first one: '<table.*>' grabs everything up to the end of the file (then backs up to match the right angle bracket at the end of the '</html>' closing tag). Then the regex engine is forced to backtrack like nobody's business to try and match all the other dot-stars. This is very bad! (but you are to be forgiven because, as you said, you are not well versed in regex-speak).

So to answer your question, first capture all the table records into an array using a regex something like this:

Code: Select all

preg_match_all('%<tr\b[^>]*>(.*?)</tr>%si', $text, $result_all_tr_in_file);
Then for each table record array element, capture all the table cell data into another array something like this:

Code: Select all

preg_match_all('%<td\b[^>]*>(.*?)</td>%si', $text, $result_all_td_in_tr);
Then, put it all together again with commas and line endings. This solution assumes that you have one simple table with no nested tables inside it. I'd put it all together for you but I'm short of time right now (I have to go to work...)

Re: HTML table to CSV

Posted: Fri Sep 18, 2009 10:46 am
by prometheuzz
If you already have the table (so you're not ploughing through the entire html file!) you could try a simple preg_match_all with this (untested!) regex:

Code: Select all

'/(?<=>)[^<>]+[^<>\s]+[^<>]+(?=<)/'