Page 1 of 1

remove html tables

Posted: Tue Aug 24, 2010 6:15 am
by coletek
Why doesn't the following remove all table data from html:

Code: Select all

$pattern[] = '/<table(.*)>(.*)<\\/table>/';
$replace[] = '';

 $filtered_html = preg_replace($pattern, $replace, $html);
shouldn't this match <table*>*</table>?

Re: remove html tables

Posted: Wed Aug 25, 2010 4:43 am
by amargharat
Use following code,

Code: Select all

$pattern[] = '/<table(.*?)>(.*?)<\\/table>/';
$replace[] = '';

 $filtered_html = preg_replace($pattern, $replace, $html);

Re: remove html tables

Posted: Wed Aug 25, 2010 5:04 am
by coletek
Why doesn't (.*) work then:
. matches any character
* specifies that the previous character can be matched zero or more times.
? specifies that the previous character can be matched once or zero times.

So wouldn't (.*) suffice? (.*?) makes no sense to me, it matches any char zero or more times, and one or zero times? Can you please explain how (.*?) works compared to (.*)

thx.

Re: remove html tables

Posted: Wed Aug 25, 2010 5:43 am
by amargharat

Re: remove html tables

Posted: Fri Aug 27, 2010 9:08 pm
by ridgerunner
To do this right is not a trivial problem!

Several problems with your regex. First, as written, the .* (dot-star) in your regex only matches up to the next end of line which is not what you want. (by default the dot does NOT match a newline character). This is where the 's' "single line" option flag is needed. (See this page for a descriptions of all the available PHP modifier flags.) It is also known as the dot matches all flag. Without the 's' modifier, the .* stops at the first linefeed, but with the 's' modifier, the dot truly matches anything, including linefeeds and will thus match all the way to the end of the string. Which brings us to...

The second problem, the .* dot-star expression is a very big hammer that is rarely needed or warranted. And your regex has two of them (which can easily lead to catastrophic backtracking - a place you do not want to go!) As amargharat eluded, the .* is too greedy and eats all the characters all the way to the end of the string, (which will erroneously go past any and all other <tables> and everything else as is goes!). Lets take a closer look at just the first part of your regex (and add the 's' modifier) which is trying to match an opening table tag:

Code: Select all

regex = '/<table(.*)>/s';
What this regex is saying in english is: First match "<table" literally, then greedily match (and capture into group $1) zero or more of anything all the way to the end of the string, and then give back one char at a time until you can match a literal ">" and then stop. The following example highlights in red what this regex actually matches:

$html = 'stuff before table <table id="table1"><tr><td>AA</td></tr></table> <em>this is emphasized</em> stuff at the end';

As you can see this is clearly not what you want! As amargharat suggested, you can use the non-greedy, or lazy version of the dot-star which looks like this: ".*?" (Note that the ? has a special meaning when it follows any quantifier; e.g. X??, X*?, X+?, X{1,9]?.) The lazy-dot-star does not immediately grab everything up to the end of the string (like the greedy version), but rather does just the opposite; it is lazy so it trys to match as few chars as possible before trying to match what follows the quantifier. So adding the lazy modifier to your original regex fixes some of the problems:

Code: Select all

$pattern[] = '/<table(.*?)>(.*?)<\/table>/si';
This is very close to what amargharat posted (however, that regex was also missing the needed 's' modifier). I've also added the 'i' ignore-case modifier, since you probably want to remove uppercase tables too. This regex now accurately matches a (non-nested) table, but note that it may still be subject to catastrophic backtracking if it is fed mal-formed HTML markup.

The third problem, (and this is a *VERY* big one), your regex will not properly handle nested tables. There was another thread here recently where someone was also trying to remove tables but was experiencing some very strange problems: "preg_replace produces mysteriously blank file". In that thread I provided a detailed analysis of the problem and provided the necessary solution (which is NOT trivial). To summarize and make a long story short, let me provide a code snippet that simply does what you want:

Code to remove tables from an HTML file:

Code: Select all

$pattern[] = '%<table\b[^>]*+>(?:(?R)|[^<]*+(?:(?!</?table\b)<[^<]*+)*+)*+</table>%i';
$replace[] = '';
$filtered_html = preg_replace($pattern, $replace, $html);
This uses

Hope this helps!
:)

Re: remove html tables

Posted: Sat Aug 28, 2010 12:38 am
by coletek
Awesome, thx for that. I also had came into the issue of nested tables, which I solved in another way, but will try a better regex like yours soon.

BTW: I thought .. matches anything except a newline character, and . matches anything. But when I placed with . I noticed it only matched up to a newline character. I needed up just str_replace("\n", " ", $html), but the 's', would be must better.

Thx heaps.

Re: remove html tables

Posted: Sat Aug 28, 2010 8:59 am
by ridgerunner
I'm glad to be of help. Where did you learn regular expressions? It sounds like you need to brush up on the fundamentals. I would recommend the tutorial at http://www.regular-expressions.info/

Re: remove html tables

Posted: Sat Aug 28, 2010 9:57 am
by coletek
A few websites, a long time ago. I just got back into using some notes I made some time ago. Indeed I need a touch on the basics - will do. Thx for the link.