HTML Attribute Remover

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

vchris
Forum Contributor
Posts: 204
Joined: Tue Aug 30, 2005 7:53 pm
Location: Canada, Quebec

HTML Attribute Remover

Post by vchris »

I'm working on an html attribute remover. I want it to be able to remove all except colspan and rowspan. Here's what I got:
\swidth="?[0-9]+"?|\sheight="?[0-9]+"?|\salign="?.*"?|\svalign="?.*"?|\sstyle=['|"](.|\s)*['|"]|\snowrap|\sclass="?.*"?|\slang="?.*"?|\sxml:lang="?.*"?

I have a couple bugs with it.

First, when coming upon something like this <p class=SomeClass><span lang=EN-US>text</span></p>, it leaves only <p . I would probably need something to indicate to stop at the first >. Second, my html comes from word so I have lots of long style attributes, some are on more than one line. My code only removes the first line of the style attribute.

Thanks.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: HTML Attribute Remover

Post by prometheuzz »

Hi, could you post an SSCCE* with some example input and explain what the desired output should be?

* http://sscce.org/
vchris
Forum Contributor
Posts: 204
Joined: Tue Aug 30, 2005 7:53 pm
Location: Canada, Quebec

Re: HTML Attribute Remover

Post by vchris »

I was able to figure out most of the problems. Now the only problem I have is it freezes my editor and web-based regex testers. It freezes as soon as I type in </b></p> before the closing </td>. It seems to have something to do with the style regex.

Regex:
\swidth="?[0-9]+"?|\sheight="?[0-9]+"?|\salign="?[\w-_]*"?|\svalign="?[\w-_]*"?|\sstyle=('|")(.*|\n)*('|")|\snowrap|\sclass="?[\w-_]*"?|\slang="?[\w-_]*"?|\sxml:lang="?[\w-_]*"?

Sample:
<p class=MsoNormal align=center style='text-align:center;punctuation-wrap:
hanging;text-autospace:ideograph-numeric ideograph-other'><b><span
lang=EN-US style='font-size:7.5pt;color:white'>NPRI ID</span></b></p>
</td>
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: HTML Attribute Remover

Post by prometheuzz »

vchris wrote:...

Sample:
<p class=MsoNormal align=center style='text-align:center;punctuation-wrap:
hanging;text-autospace:ideograph-numeric ideograph-other'><b><span
lang=EN-US style='font-size:7.5pt;color:white'>NPRI ID</span></b></p>
</td>
Expected output?
vchris
Forum Contributor
Posts: 204
Joined: Tue Aug 30, 2005 7:53 pm
Location: Canada, Quebec

Re: HTML Attribute Remover

Post by vchris »

<p><b><span>NPRI ID</span></b></p></td>

I don't know if there is a way for the search to only be inside <> (html tags).
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: HTML Attribute Remover

Post by prometheuzz »

vchris wrote:<p><b><span>NPRI ID</span></b></p></td>

I don't know if there is a way for the search to only be inside <> (html tags).
That would mean counting opening and closing brackets: regex is not built for such things.
Try this:

Code: Select all

#!/usr/bin/php
<?php
$test = "<p class=MsoNormal align=center style='text-align:center;punctuation-wrap:
hanging;text-autospace:ideograph-numeric ideograph-other'><b><span
lang=EN-US style='font-size:7.5pt;color:white'>NPRI ID</span></b></p>
</td>";
$regex = '#\n|style=\'[^\']++\'|(?:width|height|align|valign|nowrap|class|lang|xml:lang)=\S++#i';
$test = preg_replace($regex, "", $test);
$test = preg_replace('/\s++>/', ">", $test); // optional
echo "$test\n";
?>
Edit:
Note that the single quotes inside the regex should be escaped with a backslash. I did that, but they get "eaten" by the forum software...
: |
vchris
Forum Contributor
Posts: 204
Joined: Tue Aug 30, 2005 7:53 pm
Location: Canada, Quebec

Re: HTML Attribute Remover

Post by vchris »

The thing is I'm running this in a web dev application not PHP. I just don't want the search and replace to be done outside of html tags. I got another way of doing this except I would need to figure out how to not remove colspan and rowspan in a td tag.

All I want is to clean the code created by MS Word. Most of what I need to clean is the attributes I listed. I don't know if you know of a better way to do this.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: HTML Attribute Remover

Post by prometheuzz »

vchris wrote:The thing is I'm running this in a web dev application not PHP.
I have no idea what a "web dev application" is. I do know that this is a PHP forum though.
; )
vchris wrote:I just don't want the search and replace to be done outside of html tags. I got another way of doing this except I would need to figure out how to not remove colspan and rowspan in a td tag.
As I already said: a regex engine is not designed to keep track of opening and closing chars (when performing replacements). You will need to do it in two steps:
1 - match an opening bracket, followed by one or more non-closing-brackets, followed by a closing bracket;
2 - replace some attributes from the matches you get from step 1.

Step 1 could be performed like this:

Code: Select all

#!/usr/bin/php
<?php
$test = "<p class=MsoNormal align=center style='text-align:center;punctuation-wrap:
hanging;text-autospace:ideograph-numeric ideograph-other'><b><span
lang=EN-US style='font-size:7.5pt;color:white'>NPRI ID</span></b></p>
</td>";
if(preg_match_all('/<[^\/][^>]++>/', $test, $matches)) {
    print_r($matches);
}
?>
vchris
Forum Contributor
Posts: 204
Joined: Tue Aug 30, 2005 7:53 pm
Location: Canada, Quebec

Re: HTML Attribute Remover

Post by vchris »

I guess the attributes will be too complicated. Here's my other method.
<p [^>]*>
<p>
etc

I do that for a couple tags. My only problem is the td tags. Here's what I got:
<td( colspan="?[0-9]+"?)?( rowspan="?[0-9]+"?)?.>
<td$1$2>

The problem with this is that it has to start with colspan and then rowspan. I know I'm not far.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: HTML Attribute Remover

Post by prometheuzz »

What is your question?
Could you post an SSCCE including a proper amount of example input and explain what the expected output shold?

No offence, but I am getting a bit tired of pulling information out of you. It is your task to communicate your problem/question properly.
vchris
Forum Contributor
Posts: 204
Joined: Tue Aug 30, 2005 7:53 pm
Location: Canada, Quebec

Re: HTML Attribute Remover

Post by vchris »

My question is how do I remove all attributes in a td tag except for colspan and rowspan?

Sample:
<td width=48 rowspan=2 style='width:36.0pt;border:solid windowtext 1.0pt;
background:#0C0C0C;padding:0cm 5.4pt 0cm 5.4pt;height:15.0pt'>

Regex:
<td( colspan="?[0-9]+"?)?( rowspan="?[0-9]+"?)?.>

Expected Result:
<td rowspan=2>
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: HTML Attribute Remover

Post by prometheuzz »

Didn't I already showed you how?
Anyway, here's another way:

Code: Select all

<?php
$sample = "<td width=48 rowspan=2 style='width:36.0pt;border:solid windowtext 1.0pt;
            background:#0C0C0C;padding:0cm 5.4pt 0cm 5.4pt;height:15.0pt'>";
echo preg_replace("/\S+(?<!rowspan|colspan)=(?:'[^']++'|\S++)/s", '', $sample);
?>
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: HTML Attribute Remover

Post by GeertDD »

prometheuzz wrote:

Code: Select all

<?php
$sample = "<td width=48 rowspan=2 style='width:36.0pt;border:solid windowtext 1.0pt;
            background:#0C0C0C;padding:0cm 5.4pt 0cm 5.4pt;height:15.0pt'>";
echo preg_replace("/\S+(?<!rowspan|colspan)=(?:'[^']++'|\S++)/s", '', $sample);
?>
Tried to improve on your regex, prometheuzz. I believe the one below does a more solid job mostly on the attribute value matching part. Moreover this regex is twice as fast. Examine and learn. :) Let me know if you have questions.

Code: Select all

$str = '<td class=foo rowspan=2 id=\'bar\' colspan=\'5\' title="lorem">';
 
echo preg_replace('~\b[a-z]++(?<!rowspan|colspan)=(?:\'[^\']*+\'|"[^"]*+"|\S*+)~i', '', $str);
// Hmm, phpBB removes the escapes in front of the single quotes...
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: HTML Attribute Remover

Post by prometheuzz »

GeertDD wrote:...
Tried to improve on your regex, prometheuzz. I believe the one below does a more solid job mostly on the attribute value matching part.
Ah, yes. I must admit that I didn't (don't actually) know what the exact rules are as it comes to attributes naming...

GeertDD wrote:Moreover this regex is twice as fast. Examine and learn. :) Let me know if you have questions.

Code: Select all

$str = '<td class=foo rowspan=2 id=\'bar\' colspan=\'5\' title="lorem">';
 
echo preg_replace('~\b[a-z]++(?<!rowspan|colspan)=(?:\'[^\']*+\'|"[^"]*+"|\S*+)~i', '', $str);
Ah, \b[a-z]++ makes it indeed faster, and I see that both single and normal quotes are permitted.
Thanks, Geert.
User avatar
GeertDD
Forum Contributor
Posts: 274
Joined: Sun Oct 22, 2006 1:47 am
Location: Belgium

Re: HTML Attribute Remover

Post by GeertDD »

At page 200 in Mastering Regular Expressions (3rd edition) you'll find some more information about matching html tags and attributes as well.
Post Reply