preg_replace produces mysteriously blank file

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
matthew.deangelis
Forum Newbie
Posts: 2
Joined: Wed Jul 28, 2010 4:38 pm

preg_replace produces mysteriously blank file

Post by matthew.deangelis »

Hi everyone,

I'm at my wit's end, so I hope someone out there can help me.

I am processing a batch of HTML files to remove things like HTML tags, tables, etc. in order to perform text analysis. I have done a number of replacements through preg_replace during this process, most of which have worked flawlessly. However, the code that I wrote to remove tables only works on 96% of my files; the remaining 4% of the files end up completely blank.

Here is my code for this procedure:

Code: Select all

$tablepattern = '/<table[^>]*>(.*)<\/table[^>]*>/isU';
$blank = NULL;
$notables = preg_replace($tablepattern, $blank, $noheader); //noheader is a prior string that I am processing
If I remove the 's' option from $tablepattern, the files do not come out blank. However, since virtually all of my files contain tables with some blank lines, this leaves tables behind, so I do not want to remove this option (and don't see necessarily why I should).

In an effort to diagnose this problem myself, I have printed the preg_match_all to a file and reviewed the matches; all matches seem correct, and the resulting match file does NOT contain all text in the file. So, I decided to start limiting the number of replacements, and that's where things got really weird. The test file that I am using contains 317 matches to the above pattern. If I limit the replacements to below 316, the file does not come out blank, but 164 of the tables remain in the file. If I set the limit to 317 or above, the file gets blanked.

Any ideas out there from more experienced programmers? I have no idea why this is happening.


Regards,
Matt
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: preg_replace produces mysteriously blank file

Post by ridgerunner »

A couple comments on your regex. First, you are using the 'U' "ungreedy" modifier. This is bad style - you should instead explicitly apply the ? lazy modifier to any quantifier that you want to be lazy/ungreedy. Secondly (and this may be causing your problems), your replacement string is NULL, but the preg_replace() function requires either a string or an array of strings. If you want to replace a portion of a string with nothing, pass an empty string as the replacement argument, not the special NULL value. Here is your code with these two items fixed:

Code: Select all

$tablepattern = '/<table[^>]*>(.*?)<\/table[^>]*>/is';
$blank = '';
$notables = preg_replace($tablepattern, $blank, $noheader); //noheader is a prior string that I am processing
if ($notables === NULL) {
	echo("Error! preg_repalce() returned NULL!\n");
}
Normally preg_replace() returns the modified input string, but if it encounters an error, it returns NULL (See the preg_replace() documentation. I've added some debug code to detect this condition. (Under normal conditions, it should never return NULL.)

This may correct your promblem. If not we'll need to see more of your code and more of your example text. How big is the string you are searching?
matthew.deangelis
Forum Newbie
Posts: 2
Joined: Wed Jul 28, 2010 4:38 pm

Re: preg_replace produces mysteriously blank file

Post by matthew.deangelis »

Hi ridgerunner,

Thanks very much for your help. I especially appreciate your feedback on my code, since I am new to regular expressions and don't entirely understand how all of the options work. What is the specific issue with using 'U', as opposed to explicitly lazying quantifiers? Nonetheless, I changed my regex to yours.

I was glad to hear that my blank file was an expected behavior, so thanks for pointing out that I should not get a Null value unless the replace encounters an error. I had been wondering whether the string might be too long (the file is over 3MB in size) for the replace operation to handle, but my code had processed larger files, so I figured that it was not a problem. Looking at them again, though, these 4% of files contain more tables than the others, so that might explain it.

To try to resolve the length problem, I decided to split the string on </table> tags and run the regex on the elements in the resulting array. This works reasonably well; the error code that you provided is still tripped a few times, but most of the text remains intact, and the tables appear to have been stripped, in all files. I may continue to debug the few errors that I received, but I think my problem is solved!

Thanks again for your help, both in correcting my code and putting me on the right track.


Matt
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: preg_replace produces mysteriously blank file

Post by ridgerunner »

matthew.deangelis wrote:... What is the specific issue with using 'U', as opposed to explicitly lazying quantifiers? ...
The "U" modifier reverses the meaning of all quantifiers in an expression. When "U" is specified, all the quantifiers which are normally greedy become ungreedy (or lazy), and all the ones which are normally lazy become greedy. This is very confusing to the reader of the regex (who has grown accustomed to the standard default greedy syntax), especially when there are a mix of greedy and lazy quantifiers specified in the expression. Regexes are hard enough to read without the "U" modifier, but they become doubly hard to read when one has to reverse the meaning of all the quantifiers in ones head. Besides, there is never, ever a need to specify the "U" modifier, because one can (and should) always explicitly specify a quantifier as being lazy by adding the ? to it (quantifiers are normally greedy by default). (i.e. '/.*/' is greedy and '/.*?/' is lazy). In short, the "U" modifier has only one real effect - to confuse the reader!

When one has become expert at writing regular expressions, one learns precisely when to use lazy and greedy (and possessive) quantifiers to build an accurate and efficient expression. Lazy quantifiers are generally slow and should only be used when necessary. They are slow because they force the regex engine to backtrack on each and every iteration. Greedy quantifiers are fast and allow the engine to consume (or swallow), large spans of text in one gulp. And adding a possessive "+" modifier to a quantifier (when appropriate), can speed things up even more, (and can save the regex engine memory usage as well). These efficiency issues are an advanced topic, but to write a fast and accurate regex, one really needs to learn the details of how the regex engine works "under the hood". Fortunately there is an excellent book on the subject, which should be required reading for anyone needing to use regexes on a regular basis: Mastering Regular Expressions - 3rd Edition by Jeffrey Friedl. This work is nothing short of a masterpiece.

Ok, lets get back to your problem at hand. There is another big problem with your original regex (and the modified version I provided above). The problem is that TABLE tags within an HTML file can be nested - one table can sit inside of another. Consider the following HTML test file containing nested tables:

Code: Select all

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head><title>Test Nested Tables</title></head>
<body>
<table title="A">
  <tr><th>A1</th><th>A2</th></tr>
  <tr><td>
    <table title="B">
      <tr><th>B1</th><th>B2</th></tr>
      <tr><td>
        <table title="C">
          <tr><th>C1</th><th>C2</th></tr>
          <tr><td>1</td><td>2</td></tr>
        </table>
      </td><td>
        <table title="D">
          <tr><th>D1</th><th>D2</th></tr>
          <tr><td>1</td><td>2</td></tr>
        </table>
      </td></tr>
    </table>
  </td><td>
    <table title="E">
      <tr><th>E1</th><th>E2</th></tr>
      <tr><td>
        <table title="F">
          <tr><th>F1</th><th>F2</th></tr>
          <tr><td>1</td><td>2</td></tr>
        </table>
      </td><td>
        <table title="G">
          <tr><th>G1</th><th>G2</th></tr>
          <tr><td>1</td><td>2</td></tr>
        </table>
      </td></tr>
    </table>
  </td></tr>
</table>
<p>Stuff between the two main tables</p>
<table title="H">
  <tr><th>H1</th><th>H2</th></tr>
  <tr><td>
    <table title="I">
      <tr><th>I1</th><th>I2</th></tr>
      <tr><td>
        <table title="J">
          <tr><th>J1</th><th>J2</th></tr>
          <tr><td>1</td><td>2</td></tr>
        </table>
      </td><td>
        <table title="K">
          <tr><th>K1</th><th>K2</th></tr>
          <tr><td>1</td><td>2</td></tr>
        </table>
      </td></tr>
    </table>
  </td><td>
    <table title="L">
      <tr><th>L1</th><th>L2</th></tr>
      <tr><td>
        <table title="M">
          <tr><th>M1</th><th>M2</th></tr>
          <tr><td>1</td><td>2</td></tr>
        </table>
      </td><td>
        <table title="N">
          <tr><th>N1</th><th>N2</th></tr>
          <tr><td>1</td><td>2</td></tr>
        </table>
      </td></tr>
    </table>
  </td></tr>
</table>
</body>
</html>
To get a visual idea of how the tables nest, here is what this page looks like when rendered:
Image

Here's is the sub-string that the above regex (erroneously) matches:

Code: Select all

<table title="A">
  <tr><th>A1</th><th>A2</th></tr>
  <tr><td>
    <table title="B">
      <tr><th>B1</th><th>B2</th></tr>
      <tr><td>
        <table title="C">
          <tr><th>C1</th><th>C2</th></tr>
          <tr><td>1</td><td>2</td></tr>
        </table>
As you can see, the regex has no notion of the nesting. It is simply looking for the first closing </table> that it can find following the opening <table> tag. It thus matches incorrectly when the input text has nested tags.

This problem brings up a rather touchy topic in the world of regular expressions and that is: how to handle nested structures. Many would argue that this is not a job for regular expressions. However, I am of the belief that nested structures can be handled quite nicely using PHP. This is because PHP uses the very powerful PCRE regex engine which implements: recursive sub-expressions. However, proper use of recursive regular expressions is a very complex task, not for the faint of heart! Here is a link to the PHP manual page that discusses this feature: PHP Recursive patterns. The previously mentioned MRE3 book covers this topic in depth.

To solve the erroneous matching, one can craft two different regexes. The first is to match innermost tables; tables that do not themselves contain any nested tables. The second type of regex matches outermost tables; tables that may contain nested tables, but are not themselves contained within another table. The following script contains two such regular expressions. The second regex, which matches outermost tables, uses the recursive (?R) expression. These regexes are quite complex, but they are also fully commented.

Code: Select all

<?php // File: NestedTables.php
$data = file_get_contents('NestedTablesTestData.html');

// regex to match innermost TABLEs which do NOT contain nested TABLEs
$pattern_innermost = '%
# Use: "unroll-the-loop" technique. i.e. "(normal* (special normal*)*)"
# from: "Mastering Regular Expressions - 3rd Edition" by Jeffrey Friedl
<table\b[^>]*+>      # Match opening TABLE tag having any attributes.
[^<]*+               # 1st (normal*) = match up to next < opening tag char.
(?:                  # Special "<" found. Begin (special normal*)* loop.
  (?! </?table\b )   # Begin (special). If < is not start of a TABLE tag,
  <                  # then safe to match the non-TABLE-tag <. End (special).
  [^<]*+             # 2nd (normal*) = match up to next < opening tag char.
)*+                  # End of (special normal*)* loop.
</table>             # Match closing TABLE tag.
%ix';

if (preg_match_all($pattern_innermost, $data, $matches) > 0) {
	echo("Inner pattern matched. Here are the results:\r\n");
	print_r($matches);
}

// regex to match outermost TABLEs which may contain nested TABLEs
$data = file_get_contents('NestedTablesTestData.html');
$pattern_outermost = '%
<table\b[^>]*+>        # Match opening TABLE tag.
(?:                    # Non-capture group for alternation.
  (?R)                 # Match a whole nested TABLE element,
|                      # or... match a bunch of non-TABLE-tag characters
  [^<]*+               # 1st (normal*) = match up to next < opening tag char.
  (?:                  # Special "<" found. Begin (special normal*)* loop.
    (?! </?table\b )   # Begin (special). If < is not start of a TABLE tag,
    <                  # then safe to match the non-TABLE-tag <. End (special).
    [^<]*+             # 2nd (normal*) = match up to next < opening tag char.
  )*+                  # End of (special normal*)* loop.
)*+                    # loop as many as it takes until outer
</table>               # balanced closing TABLE tag is matched.
%six';
 
if (preg_match_all($pattern_outermost, $data, $matches) > 0) {
print_r($matches);
}

?>
When you run this script from the command line, you will see that you can match either innermost or outermost nested table tags. This may seem a little overwhelming (and it is), but you have opened a can-of-worms when you ask about matching HTML tags. Its certainly not a trivial endeavor!

I hope this helps!
:)

p.s. Beware - regular expressions can become addicting!
Post Reply