Page 1 of 1

Regex to get records from a table

Posted: Mon Sep 27, 2010 5:02 am
by klevis miho
<td>1</td><td height="20"><a class="BN" href="index.php?b_id=1990"> test </a></td><td> Test1 </td><td> test2 </td><td> test3 </td><td> test4</td><td> test5</td>

How can I extract all the data from this td's?

Re: Regex to get records from a table

Posted: Mon Sep 27, 2010 5:08 am
by DigitalMind
<td.*?>(.*?)</td>

Re: Regex to get records from a table

Posted: Mon Sep 27, 2010 5:30 am
by klevis miho
Thanks man

Re: Regex to get records from a table

Posted: Mon Sep 27, 2010 9:18 am
by ridgerunner
That will work if your tables are not nested. However, if the tables are nested, you'll need something a bit more complex. You can design a regex to match either the innermost or outermost <td>...</td>. this subject was recently discussed with regard to tables as a whole - See: preg_replace produces mysteriously blank file

That said, if you are dealing with tables that are nested, here is a script containing two commented regexes; one to match innermost TD tags, and another to match outermost TD tags:

Code: Select all

<?php // File: NestedTds.php
$data = file_get_contents('NestedTablesTestData.html');

// regex to match innermost TDs which do NOT contain nested TDs
$pattern_innermost = '%
# Use: "unroll-the-loop" technique. i.e. "(normal* (special normal*)*)"
# from: "Mastering Regular Expressions - 3rd Edition" by Jeffrey Friedl
<td\b[^>]*+>         # Match opening TD tag having any attributes.
[^<]*+               # 1st (normal*) = match up to next < opening tag char.
(?:                  # Special "<" found. Begin (special normal*)* loop.
  (?! </?td\b )      # Begin (special). If < is not start of a TD tag,
  <                  # then safe to match the non-TD-tag <. End (special).
  [^<]*+             # 2nd (normal*) = match up to next < opening tag char.
)*+                  # End of (special normal*)* loop.
</td>                # Match closing TD tag.
%ix';

if (preg_match_all($pattern_innermost, $data, $matches) > 0) {
	echo("Inner pattern matched. Here are the results:\r\n");
	print_r($matches);
}

// regex to match outermost TDs which may contain nested TDs
$pattern_outermost = '%
<td\b[^>]*+>           # Match opening TD tag.
(?:                    # Non-capture group for alternation.
  (?R)                 # Match a whole nested TD element,
|                      # or... match a bunch of non-TD-tag characters
  [^<]*+               # 1st (normal*) = match up to next < opening tag char.
  (?:                  # Special "<" found. Begin (special normal*)* loop.
    (?! </?td\b )      # Begin (special). If < is not start of a TD tag,
    <                  # then safe to match the non-TD-tag <. End (special).
    [^<]*+             # 2nd (normal*) = match up to next < opening tag char.
  )*+                  # End of (special normal*)* loop.
)*+                    # loop as many as it takes until outer
</td>                  # balanced closing TD tag is matched.
%six';
 
if (preg_match_all($pattern_outermost, $data, $matches) > 0) {
print_r($matches);
}
?>
These are a bit more complex, as they implement the: "unrolling-the-loop" efficiency technique described in Jeffrey Friedl's classic work: "Mastering Regular Expressions - 3rd Edition".

Hope this helps. :)