Page 1 of 2

HTML Attribute Remover

Posted: Mon Aug 04, 2008 12:48 pm
by vchris
I'm working on an html attribute remover. I want it to be able to remove all except colspan and rowspan. Here's what I got:
\swidth="?[0-9]+"?|\sheight="?[0-9]+"?|\salign="?.*"?|\svalign="?.*"?|\sstyle=['|"](.|\s)*['|"]|\snowrap|\sclass="?.*"?|\slang="?.*"?|\sxml:lang="?.*"?

I have a couple bugs with it.

First, when coming upon something like this <p class=SomeClass><span lang=EN-US>text</span></p>, it leaves only <p . I would probably need something to indicate to stop at the first >. Second, my html comes from word so I have lots of long style attributes, some are on more than one line. My code only removes the first line of the style attribute.

Thanks.

Re: HTML Attribute Remover

Posted: Mon Aug 04, 2008 1:39 pm
by prometheuzz
Hi, could you post an SSCCE* with some example input and explain what the desired output should be?

* http://sscce.org/

Re: HTML Attribute Remover

Posted: Mon Aug 04, 2008 2:13 pm
by vchris
I was able to figure out most of the problems. Now the only problem I have is it freezes my editor and web-based regex testers. It freezes as soon as I type in </b></p> before the closing </td>. It seems to have something to do with the style regex.

Regex:
\swidth="?[0-9]+"?|\sheight="?[0-9]+"?|\salign="?[\w-_]*"?|\svalign="?[\w-_]*"?|\sstyle=('|")(.*|\n)*('|")|\snowrap|\sclass="?[\w-_]*"?|\slang="?[\w-_]*"?|\sxml:lang="?[\w-_]*"?

Sample:
<p class=MsoNormal align=center style='text-align:center;punctuation-wrap:
hanging;text-autospace:ideograph-numeric ideograph-other'><b><span
lang=EN-US style='font-size:7.5pt;color:white'>NPRI ID</span></b></p>
</td>

Re: HTML Attribute Remover

Posted: Mon Aug 04, 2008 2:19 pm
by prometheuzz
vchris wrote:...

Sample:
<p class=MsoNormal align=center style='text-align:center;punctuation-wrap:
hanging;text-autospace:ideograph-numeric ideograph-other'><b><span
lang=EN-US style='font-size:7.5pt;color:white'>NPRI ID</span></b></p>
</td>
Expected output?

Re: HTML Attribute Remover

Posted: Mon Aug 04, 2008 2:24 pm
by vchris
<p><b><span>NPRI ID</span></b></p></td>

I don't know if there is a way for the search to only be inside <> (html tags).

Re: HTML Attribute Remover

Posted: Mon Aug 04, 2008 2:43 pm
by prometheuzz
vchris wrote:<p><b><span>NPRI ID</span></b></p></td>

I don't know if there is a way for the search to only be inside <> (html tags).
That would mean counting opening and closing brackets: regex is not built for such things.
Try this:

Code: Select all

#!/usr/bin/php
<?php
$test = "<p class=MsoNormal align=center style='text-align:center;punctuation-wrap:
hanging;text-autospace:ideograph-numeric ideograph-other'><b><span
lang=EN-US style='font-size:7.5pt;color:white'>NPRI ID</span></b></p>
</td>";
$regex = '#\n|style=\'[^\']++\'|(?:width|height|align|valign|nowrap|class|lang|xml:lang)=\S++#i';
$test = preg_replace($regex, "", $test);
$test = preg_replace('/\s++>/', ">", $test); // optional
echo "$test\n";
?>
Edit:
Note that the single quotes inside the regex should be escaped with a backslash. I did that, but they get "eaten" by the forum software...
: |

Re: HTML Attribute Remover

Posted: Mon Aug 04, 2008 4:13 pm
by vchris
The thing is I'm running this in a web dev application not PHP. I just don't want the search and replace to be done outside of html tags. I got another way of doing this except I would need to figure out how to not remove colspan and rowspan in a td tag.

All I want is to clean the code created by MS Word. Most of what I need to clean is the attributes I listed. I don't know if you know of a better way to do this.

Re: HTML Attribute Remover

Posted: Tue Aug 05, 2008 2:05 am
by prometheuzz
vchris wrote:The thing is I'm running this in a web dev application not PHP.
I have no idea what a "web dev application" is. I do know that this is a PHP forum though.
; )
vchris wrote:I just don't want the search and replace to be done outside of html tags. I got another way of doing this except I would need to figure out how to not remove colspan and rowspan in a td tag.
As I already said: a regex engine is not designed to keep track of opening and closing chars (when performing replacements). You will need to do it in two steps:
1 - match an opening bracket, followed by one or more non-closing-brackets, followed by a closing bracket;
2 - replace some attributes from the matches you get from step 1.

Step 1 could be performed like this:

Code: Select all

#!/usr/bin/php
<?php
$test = "<p class=MsoNormal align=center style='text-align:center;punctuation-wrap:
hanging;text-autospace:ideograph-numeric ideograph-other'><b><span
lang=EN-US style='font-size:7.5pt;color:white'>NPRI ID</span></b></p>
</td>";
if(preg_match_all('/<[^\/][^>]++>/', $test, $matches)) {
    print_r($matches);
}
?>

Re: HTML Attribute Remover

Posted: Tue Aug 05, 2008 7:34 am
by vchris
I guess the attributes will be too complicated. Here's my other method.
<p [^>]*>
<p>
etc

I do that for a couple tags. My only problem is the td tags. Here's what I got:
<td( colspan="?[0-9]+"?)?( rowspan="?[0-9]+"?)?.>
<td$1$2>

The problem with this is that it has to start with colspan and then rowspan. I know I'm not far.

Re: HTML Attribute Remover

Posted: Tue Aug 05, 2008 7:59 am
by prometheuzz
What is your question?
Could you post an SSCCE including a proper amount of example input and explain what the expected output shold?

No offence, but I am getting a bit tired of pulling information out of you. It is your task to communicate your problem/question properly.

Re: HTML Attribute Remover

Posted: Tue Aug 05, 2008 8:59 am
by vchris
My question is how do I remove all attributes in a td tag except for colspan and rowspan?

Sample:
<td width=48 rowspan=2 style='width:36.0pt;border:solid windowtext 1.0pt;
background:#0C0C0C;padding:0cm 5.4pt 0cm 5.4pt;height:15.0pt'>

Regex:
<td( colspan="?[0-9]+"?)?( rowspan="?[0-9]+"?)?.>

Expected Result:
<td rowspan=2>

Re: HTML Attribute Remover

Posted: Tue Aug 05, 2008 9:15 am
by prometheuzz
Didn't I already showed you how?
Anyway, here's another way:

Code: Select all

<?php
$sample = "<td width=48 rowspan=2 style='width:36.0pt;border:solid windowtext 1.0pt;
            background:#0C0C0C;padding:0cm 5.4pt 0cm 5.4pt;height:15.0pt'>";
echo preg_replace("/\S+(?<!rowspan|colspan)=(?:'[^']++'|\S++)/s", '', $sample);
?>

Re: HTML Attribute Remover

Posted: Tue Aug 05, 2008 3:27 pm
by GeertDD
prometheuzz wrote:

Code: Select all

<?php
$sample = "<td width=48 rowspan=2 style='width:36.0pt;border:solid windowtext 1.0pt;
            background:#0C0C0C;padding:0cm 5.4pt 0cm 5.4pt;height:15.0pt'>";
echo preg_replace("/\S+(?<!rowspan|colspan)=(?:'[^']++'|\S++)/s", '', $sample);
?>
Tried to improve on your regex, prometheuzz. I believe the one below does a more solid job mostly on the attribute value matching part. Moreover this regex is twice as fast. Examine and learn. :) Let me know if you have questions.

Code: Select all

$str = '<td class=foo rowspan=2 id=\'bar\' colspan=\'5\' title="lorem">';
 
echo preg_replace('~\b[a-z]++(?<!rowspan|colspan)=(?:\'[^\']*+\'|"[^"]*+"|\S*+)~i', '', $str);
// Hmm, phpBB removes the escapes in front of the single quotes...

Re: HTML Attribute Remover

Posted: Thu Aug 07, 2008 5:08 am
by prometheuzz
GeertDD wrote:...
Tried to improve on your regex, prometheuzz. I believe the one below does a more solid job mostly on the attribute value matching part.
Ah, yes. I must admit that I didn't (don't actually) know what the exact rules are as it comes to attributes naming...

GeertDD wrote:Moreover this regex is twice as fast. Examine and learn. :) Let me know if you have questions.

Code: Select all

$str = '<td class=foo rowspan=2 id=\'bar\' colspan=\'5\' title="lorem">';
 
echo preg_replace('~\b[a-z]++(?<!rowspan|colspan)=(?:\'[^\']*+\'|"[^"]*+"|\S*+)~i', '', $str);
Ah, \b[a-z]++ makes it indeed faster, and I see that both single and normal quotes are permitted.
Thanks, Geert.

Re: HTML Attribute Remover

Posted: Thu Aug 07, 2008 6:06 am
by GeertDD
At page 200 in Mastering Regular Expressions (3rd edition) you'll find some more information about matching html tags and attributes as well.