Regex to get records from a table

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
klevis miho
Forum Contributor
Posts: 413
Joined: Wed Oct 29, 2008 2:59 pm
Location: Albania
Contact:

Regex to get records from a table

Post by klevis miho »

<td>1</td><td height="20"><a class="BN" href="index.php?b_id=1990"> test </a></td><td> Test1 </td><td> test2 </td><td> test3 </td><td> test4</td><td> test5</td>

How can I extract all the data from this td's?
User avatar
DigitalMind
Forum Contributor
Posts: 152
Joined: Mon Sep 27, 2010 2:27 am
Location: Ukraine, Kharkov

Re: Regex to get records from a table

Post by DigitalMind »

<td.*?>(.*?)</td>
klevis miho
Forum Contributor
Posts: 413
Joined: Wed Oct 29, 2008 2:59 pm
Location: Albania
Contact:

Re: Regex to get records from a table

Post by klevis miho »

Thanks man
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: Regex to get records from a table

Post by ridgerunner »

That will work if your tables are not nested. However, if the tables are nested, you'll need something a bit more complex. You can design a regex to match either the innermost or outermost <td>...</td>. this subject was recently discussed with regard to tables as a whole - See: preg_replace produces mysteriously blank file

That said, if you are dealing with tables that are nested, here is a script containing two commented regexes; one to match innermost TD tags, and another to match outermost TD tags:

Code: Select all

<?php // File: NestedTds.php
$data = file_get_contents('NestedTablesTestData.html');

// regex to match innermost TDs which do NOT contain nested TDs
$pattern_innermost = '%
# Use: "unroll-the-loop" technique. i.e. "(normal* (special normal*)*)"
# from: "Mastering Regular Expressions - 3rd Edition" by Jeffrey Friedl
<td\b[^>]*+>         # Match opening TD tag having any attributes.
[^<]*+               # 1st (normal*) = match up to next < opening tag char.
(?:                  # Special "<" found. Begin (special normal*)* loop.
  (?! </?td\b )      # Begin (special). If < is not start of a TD tag,
  <                  # then safe to match the non-TD-tag <. End (special).
  [^<]*+             # 2nd (normal*) = match up to next < opening tag char.
)*+                  # End of (special normal*)* loop.
</td>                # Match closing TD tag.
%ix';

if (preg_match_all($pattern_innermost, $data, $matches) > 0) {
	echo("Inner pattern matched. Here are the results:\r\n");
	print_r($matches);
}

// regex to match outermost TDs which may contain nested TDs
$pattern_outermost = '%
<td\b[^>]*+>           # Match opening TD tag.
(?:                    # Non-capture group for alternation.
  (?R)                 # Match a whole nested TD element,
|                      # or... match a bunch of non-TD-tag characters
  [^<]*+               # 1st (normal*) = match up to next < opening tag char.
  (?:                  # Special "<" found. Begin (special normal*)* loop.
    (?! </?td\b )      # Begin (special). If < is not start of a TD tag,
    <                  # then safe to match the non-TD-tag <. End (special).
    [^<]*+             # 2nd (normal*) = match up to next < opening tag char.
  )*+                  # End of (special normal*)* loop.
)*+                    # loop as many as it takes until outer
</td>                  # balanced closing TD tag is matched.
%six';
 
if (preg_match_all($pattern_outermost, $data, $matches) > 0) {
print_r($matches);
}
?>
These are a bit more complex, as they implement the: "unrolling-the-loop" efficiency technique described in Jeffrey Friedl's classic work: "Mastering Regular Expressions - 3rd Edition".

Hope this helps. :)
Post Reply