To do this right is not a trivial problem!
Several problems with your regex. First, as written, the .* (dot-star) in your regex only matches up to the next
end of line which is not what you want. (by default the dot does NOT match a newline character). This is where the 's' "single line" option flag is needed. (See
this page for a descriptions of all the available PHP modifier flags.) It is also known as the
dot matches all flag. Without the 's' modifier, the .* stops at the first linefeed, but with the 's' modifier, the dot truly matches anything, including linefeeds and will thus match all the way to the end of the string. Which brings us to...
The second problem, the .* dot-star expression is a very big hammer that is rarely needed or warranted. And your regex has two of them (which can easily lead to
catastrophic backtracking - a place you do not want to go!) As amargharat eluded, the .* is
too greedy and eats all the characters all the way to the end of the string, (which will erroneously go past any and all other <tables> and everything else as is goes!). Lets take a closer look at just the first part of your regex (and add the 's' modifier) which is trying to match an opening table tag:
What this regex is saying in english is:
First match "<table" literally, then greedily match (and capture into group $1) zero or more of anything all the way to the end of the string, and then give back one char at a time until you can match a literal ">" and then stop. The following example highlights in red what this regex actually matches:
$html = 'stuff before table
<table id="table1"><tr><td>AA</td></tr></table> <em>this is emphasized</em> stuff at the end';
As you can see this is clearly not what you want! As amargharat suggested, you can use the non-greedy, or
lazy version of the dot-star which looks like this: ".*?" (Note that the ? has a special meaning when it follows any quantifier; e.g. X??, X*?, X+?, X{1,9]?.) The lazy-dot-star does not immediately grab everything up to the end of the string (like the greedy version), but rather does just the opposite; it is lazy so it trys to match as few chars as possible before trying to match what follows the quantifier. So adding the lazy modifier to your original regex fixes some of the problems:
Code: Select all
$pattern[] = '/<table(.*?)>(.*?)<\/table>/si';
This is very close to what amargharat posted (however, that regex was also missing the needed 's' modifier). I've also added the 'i'
ignore-case modifier, since you probably want to remove uppercase tables too. This regex now accurately matches a (non-nested) table, but note that it may still be subject to catastrophic backtracking if it is fed mal-formed HTML markup.
The third problem, (and this is a *VERY* big one), your regex will not properly handle
nested tables. There was another thread here recently where someone was also trying to remove tables but was experiencing some very strange problems: "
preg_replace produces mysteriously blank file". In that thread I provided a detailed analysis of the problem and provided the necessary solution (which is NOT trivial). To summarize and make a long story short, let me provide a code snippet that simply does what you want:
Code to remove tables from an HTML file:
Code: Select all
$pattern[] = '%<table\b[^>]*+>(?:(?R)|[^<]*+(?:(?!</?table\b)<[^<]*+)*+)*+</table>%i';
$replace[] = '';
$filtered_html = preg_replace($pattern, $replace, $html);
This uses
Hope this helps!
