HTML table to CSV

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
shiznatix
DevNet Master
Posts: 2745
Joined: Tue Dec 28, 2004 5:57 pm
Location: Tallinn, Estonia
Contact:

HTML table to CSV

Post by shiznatix »

Ok, basically, as the title suggests, I have a HTML table that I need to convert to a CSV file. The HTML table is actually supposed to be the excel output file but whatever, people can't code and I don't know where they live so I can't do anything about it. Problem being that I am not very good with regular expressions!

Basically I have a table kinda like this:

Code: Select all

<table cellspacing="1" cellpadding="4" rules="all" bordercolor="#CC9966" border="0" id="Datagrid2" bgcolor="White" width="100%">
        <tr class="TabularA" bgcolor="#990000">
                <td><font color="#FFFFCC"><b>Customer ID</b></font></td><td><font color="#FFFFCC"><b>Alias</b></font></td><td><font color="#FFFFCC"><b>Product</b></font></td><td align="right"><font color="#FFFFCC"><b>Net Revenue</b></font></td>
        </tr><tr class="TabularB" bgcolor="White">
                <td><font color="#330099">1111111</font></td><td><font color="#330099">UserName</font></td><td><font color="#330099">TextOne</font></td><td align="right"><font color="#330099">99.99</font></td>
        </tr>
</table>
and I want to turn it into:
Customer ID,Alias,Product,Net Revenue
1111111,UserName,TextOne,99.99
I figured I could start slow and just do this (i have tried a bunch of variations of this):

Code: Select all

preg_match_all('#<table.*>.*<tr.*>(.*)</tr></table>#ism#', $contents, $matches);
but nothing is working. Could someone help turn me in the right direction?
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: HTML table to CSV

Post by ridgerunner »

Your regex is using the dot-star combination with reckless abandon! For starters, the very first one: '<table.*>' grabs everything up to the end of the file (then backs up to match the right angle bracket at the end of the '</html>' closing tag). Then the regex engine is forced to backtrack like nobody's business to try and match all the other dot-stars. This is very bad! (but you are to be forgiven because, as you said, you are not well versed in regex-speak).

So to answer your question, first capture all the table records into an array using a regex something like this:

Code: Select all

preg_match_all('%<tr\b[^>]*>(.*?)</tr>%si', $text, $result_all_tr_in_file);
Then for each table record array element, capture all the table cell data into another array something like this:

Code: Select all

preg_match_all('%<td\b[^>]*>(.*?)</td>%si', $text, $result_all_td_in_tr);
Then, put it all together again with commas and line endings. This solution assumes that you have one simple table with no nested tables inside it. I'd put it all together for you but I'm short of time right now (I have to go to work...)
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: HTML table to CSV

Post by prometheuzz »

If you already have the table (so you're not ploughing through the entire html file!) you could try a simple preg_match_all with this (untested!) regex:

Code: Select all

'/(?<=>)[^<>]+[^<>\s]+[^<>]+(?=<)/'
Post Reply