matthew.deangelis wrote:... What is the specific issue with using 'U', as opposed to explicitly lazying quantifiers? ...
The "U" modifier reverses the meaning of all quantifiers in an expression. When "U" is specified, all the quantifiers which are normally greedy become ungreedy (or
lazy), and all the ones which are normally lazy become greedy. This is very confusing to the reader of the regex (who has grown accustomed to the standard default greedy syntax), especially when there are a mix of greedy and lazy quantifiers specified in the expression. Regexes are hard enough to read without the "U" modifier, but they become doubly hard to read when one has to reverse the meaning of all the quantifiers in ones head. Besides, there is
never, ever a need to specify the "U" modifier, because one can (and should) always explicitly specify a quantifier as being lazy by adding the ? to it (quantifiers are normally greedy by default). (i.e. '/.*/' is greedy and '/.*?/' is lazy). In short, the "U" modifier has only one real effect - to confuse the reader!
When one has become expert at writing regular expressions, one learns precisely when to use lazy and greedy (and possessive) quantifiers to build an accurate and efficient expression. Lazy quantifiers are generally slow and should only be used when necessary. They are slow because they force the regex engine to backtrack on each and every iteration. Greedy quantifiers are fast and allow the engine to consume (or
swallow), large spans of text in one gulp. And adding a possessive "+" modifier to a quantifier (when appropriate), can speed things up even more, (and can save the regex engine memory usage as well). These efficiency issues are an advanced topic, but to write a fast and accurate regex, one really needs to learn the details of how the regex engine works "under the hood". Fortunately there is an excellent book on the subject, which should be required reading for anyone needing to use regexes on a regular basis:
Mastering Regular Expressions - 3rd Edition by Jeffrey Friedl. This work is nothing short of a masterpiece.
Ok, lets get back to your problem at hand. There is another big problem with your original regex (and the modified version I provided above). The problem is that TABLE tags within an HTML file can be
nested - one table can sit inside of another. Consider the following HTML test file containing nested tables:
Code: Select all
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head><title>Test Nested Tables</title></head>
<body>
<table title="A">
<tr><th>A1</th><th>A2</th></tr>
<tr><td>
<table title="B">
<tr><th>B1</th><th>B2</th></tr>
<tr><td>
<table title="C">
<tr><th>C1</th><th>C2</th></tr>
<tr><td>1</td><td>2</td></tr>
</table>
</td><td>
<table title="D">
<tr><th>D1</th><th>D2</th></tr>
<tr><td>1</td><td>2</td></tr>
</table>
</td></tr>
</table>
</td><td>
<table title="E">
<tr><th>E1</th><th>E2</th></tr>
<tr><td>
<table title="F">
<tr><th>F1</th><th>F2</th></tr>
<tr><td>1</td><td>2</td></tr>
</table>
</td><td>
<table title="G">
<tr><th>G1</th><th>G2</th></tr>
<tr><td>1</td><td>2</td></tr>
</table>
</td></tr>
</table>
</td></tr>
</table>
<p>Stuff between the two main tables</p>
<table title="H">
<tr><th>H1</th><th>H2</th></tr>
<tr><td>
<table title="I">
<tr><th>I1</th><th>I2</th></tr>
<tr><td>
<table title="J">
<tr><th>J1</th><th>J2</th></tr>
<tr><td>1</td><td>2</td></tr>
</table>
</td><td>
<table title="K">
<tr><th>K1</th><th>K2</th></tr>
<tr><td>1</td><td>2</td></tr>
</table>
</td></tr>
</table>
</td><td>
<table title="L">
<tr><th>L1</th><th>L2</th></tr>
<tr><td>
<table title="M">
<tr><th>M1</th><th>M2</th></tr>
<tr><td>1</td><td>2</td></tr>
</table>
</td><td>
<table title="N">
<tr><th>N1</th><th>N2</th></tr>
<tr><td>1</td><td>2</td></tr>
</table>
</td></tr>
</table>
</td></tr>
</table>
</body>
</html>
To get a visual idea of how the tables nest, here is what this page looks like when rendered:
Here's is the sub-string that the above regex (erroneously) matches:
Code: Select all
<table title="A">
<tr><th>A1</th><th>A2</th></tr>
<tr><td>
<table title="B">
<tr><th>B1</th><th>B2</th></tr>
<tr><td>
<table title="C">
<tr><th>C1</th><th>C2</th></tr>
<tr><td>1</td><td>2</td></tr>
</table>
As you can see, the regex has no notion of the nesting. It is simply looking for the first closing </table> that it can find following the opening <table> tag. It thus matches incorrectly when the input text has nested tags.
This problem brings up a rather touchy topic in the world of regular expressions and that is: how to handle nested structures. Many would argue that this is not a job for regular expressions. However, I am of the belief that nested structures can be handled quite nicely using PHP. This is because PHP uses the very powerful
PCRE regex engine which implements:
recursive sub-expressions. However, proper use of recursive regular expressions is a very complex task, not for the faint of heart! Here is a link to the PHP manual page that discusses this feature:
PHP Recursive patterns. The previously mentioned MRE3 book covers this topic in depth.
To solve the erroneous matching, one can craft two different regexes. The first is to match innermost tables; tables that do not themselves contain any nested tables. The second type of regex matches outermost tables; tables that may contain nested tables, but are not themselves contained within another table. The following script contains two such regular expressions. The second regex, which matches outermost tables, uses the recursive (?R) expression. These regexes are quite complex, but they are also fully commented.
Code: Select all
<?php // File: NestedTables.php
$data = file_get_contents('NestedTablesTestData.html');
// regex to match innermost TABLEs which do NOT contain nested TABLEs
$pattern_innermost = '%
# Use: "unroll-the-loop" technique. i.e. "(normal* (special normal*)*)"
# from: "Mastering Regular Expressions - 3rd Edition" by Jeffrey Friedl
<table\b[^>]*+> # Match opening TABLE tag having any attributes.
[^<]*+ # 1st (normal*) = match up to next < opening tag char.
(?: # Special "<" found. Begin (special normal*)* loop.
(?! </?table\b ) # Begin (special). If < is not start of a TABLE tag,
< # then safe to match the non-TABLE-tag <. End (special).
[^<]*+ # 2nd (normal*) = match up to next < opening tag char.
)*+ # End of (special normal*)* loop.
</table> # Match closing TABLE tag.
%ix';
if (preg_match_all($pattern_innermost, $data, $matches) > 0) {
echo("Inner pattern matched. Here are the results:\r\n");
print_r($matches);
}
// regex to match outermost TABLEs which may contain nested TABLEs
$data = file_get_contents('NestedTablesTestData.html');
$pattern_outermost = '%
<table\b[^>]*+> # Match opening TABLE tag.
(?: # Non-capture group for alternation.
(?R) # Match a whole nested TABLE element,
| # or... match a bunch of non-TABLE-tag characters
[^<]*+ # 1st (normal*) = match up to next < opening tag char.
(?: # Special "<" found. Begin (special normal*)* loop.
(?! </?table\b ) # Begin (special). If < is not start of a TABLE tag,
< # then safe to match the non-TABLE-tag <. End (special).
[^<]*+ # 2nd (normal*) = match up to next < opening tag char.
)*+ # End of (special normal*)* loop.
)*+ # loop as many as it takes until outer
</table> # balanced closing TABLE tag is matched.
%six';
if (preg_match_all($pattern_outermost, $data, $matches) > 0) {
print_r($matches);
}
?>
When you run this script from the command line, you will see that you can match either innermost or outermost nested table tags. This may seem a little overwhelming (and it is), but you have opened a can-of-worms when you ask about matching HTML tags. Its certainly not a trivial endeavor!
I hope this helps!
p.s. Beware - regular expressions can become addicting!