HTML Attribute Remover
Moderator: General Moderators
HTML Attribute Remover
I'm working on an html attribute remover. I want it to be able to remove all except colspan and rowspan. Here's what I got:
\swidth="?[0-9]+"?|\sheight="?[0-9]+"?|\salign="?.*"?|\svalign="?.*"?|\sstyle=['|"](.|\s)*['|"]|\snowrap|\sclass="?.*"?|\slang="?.*"?|\sxml:lang="?.*"?
I have a couple bugs with it.
First, when coming upon something like this <p class=SomeClass><span lang=EN-US>text</span></p>, it leaves only <p . I would probably need something to indicate to stop at the first >. Second, my html comes from word so I have lots of long style attributes, some are on more than one line. My code only removes the first line of the style attribute.
Thanks.
\swidth="?[0-9]+"?|\sheight="?[0-9]+"?|\salign="?.*"?|\svalign="?.*"?|\sstyle=['|"](.|\s)*['|"]|\snowrap|\sclass="?.*"?|\slang="?.*"?|\sxml:lang="?.*"?
I have a couple bugs with it.
First, when coming upon something like this <p class=SomeClass><span lang=EN-US>text</span></p>, it leaves only <p . I would probably need something to indicate to stop at the first >. Second, my html comes from word so I have lots of long style attributes, some are on more than one line. My code only removes the first line of the style attribute.
Thanks.
- prometheuzz
- Forum Regular
- Posts: 779
- Joined: Fri Apr 04, 2008 5:51 am
Re: HTML Attribute Remover
Hi, could you post an SSCCE* with some example input and explain what the desired output should be?
* http://sscce.org/
* http://sscce.org/
Re: HTML Attribute Remover
I was able to figure out most of the problems. Now the only problem I have is it freezes my editor and web-based regex testers. It freezes as soon as I type in </b></p> before the closing </td>. It seems to have something to do with the style regex.
Regex:
\swidth="?[0-9]+"?|\sheight="?[0-9]+"?|\salign="?[\w-_]*"?|\svalign="?[\w-_]*"?|\sstyle=('|")(.*|\n)*('|")|\snowrap|\sclass="?[\w-_]*"?|\slang="?[\w-_]*"?|\sxml:lang="?[\w-_]*"?
Sample:
<p class=MsoNormal align=center style='text-align:center;punctuation-wrap:
hanging;text-autospace:ideograph-numeric ideograph-other'><b><span
lang=EN-US style='font-size:7.5pt;color:white'>NPRI ID</span></b></p>
</td>
Regex:
\swidth="?[0-9]+"?|\sheight="?[0-9]+"?|\salign="?[\w-_]*"?|\svalign="?[\w-_]*"?|\sstyle=('|")(.*|\n)*('|")|\snowrap|\sclass="?[\w-_]*"?|\slang="?[\w-_]*"?|\sxml:lang="?[\w-_]*"?
Sample:
<p class=MsoNormal align=center style='text-align:center;punctuation-wrap:
hanging;text-autospace:ideograph-numeric ideograph-other'><b><span
lang=EN-US style='font-size:7.5pt;color:white'>NPRI ID</span></b></p>
</td>
- prometheuzz
- Forum Regular
- Posts: 779
- Joined: Fri Apr 04, 2008 5:51 am
Re: HTML Attribute Remover
Expected output?vchris wrote:...
Sample:
<p class=MsoNormal align=center style='text-align:center;punctuation-wrap:
hanging;text-autospace:ideograph-numeric ideograph-other'><b><span
lang=EN-US style='font-size:7.5pt;color:white'>NPRI ID</span></b></p>
</td>
Re: HTML Attribute Remover
<p><b><span>NPRI ID</span></b></p></td>
I don't know if there is a way for the search to only be inside <> (html tags).
I don't know if there is a way for the search to only be inside <> (html tags).
- prometheuzz
- Forum Regular
- Posts: 779
- Joined: Fri Apr 04, 2008 5:51 am
Re: HTML Attribute Remover
That would mean counting opening and closing brackets: regex is not built for such things.vchris wrote:<p><b><span>NPRI ID</span></b></p></td>
I don't know if there is a way for the search to only be inside <> (html tags).
Try this:
Code: Select all
#!/usr/bin/php
<?php
$test = "<p class=MsoNormal align=center style='text-align:center;punctuation-wrap:
hanging;text-autospace:ideograph-numeric ideograph-other'><b><span
lang=EN-US style='font-size:7.5pt;color:white'>NPRI ID</span></b></p>
</td>";
$regex = '#\n|style=\'[^\']++\'|(?:width|height|align|valign|nowrap|class|lang|xml:lang)=\S++#i';
$test = preg_replace($regex, "", $test);
$test = preg_replace('/\s++>/', ">", $test); // optional
echo "$test\n";
?>Note that the single quotes inside the regex should be escaped with a backslash. I did that, but they get "eaten" by the forum software...
: |
Re: HTML Attribute Remover
The thing is I'm running this in a web dev application not PHP. I just don't want the search and replace to be done outside of html tags. I got another way of doing this except I would need to figure out how to not remove colspan and rowspan in a td tag.
All I want is to clean the code created by MS Word. Most of what I need to clean is the attributes I listed. I don't know if you know of a better way to do this.
All I want is to clean the code created by MS Word. Most of what I need to clean is the attributes I listed. I don't know if you know of a better way to do this.
- prometheuzz
- Forum Regular
- Posts: 779
- Joined: Fri Apr 04, 2008 5:51 am
Re: HTML Attribute Remover
I have no idea what a "web dev application" is. I do know that this is a PHP forum though.vchris wrote:The thing is I'm running this in a web dev application not PHP.
; )
As I already said: a regex engine is not designed to keep track of opening and closing chars (when performing replacements). You will need to do it in two steps:vchris wrote:I just don't want the search and replace to be done outside of html tags. I got another way of doing this except I would need to figure out how to not remove colspan and rowspan in a td tag.
1 - match an opening bracket, followed by one or more non-closing-brackets, followed by a closing bracket;
2 - replace some attributes from the matches you get from step 1.
Step 1 could be performed like this:
Code: Select all
#!/usr/bin/php
<?php
$test = "<p class=MsoNormal align=center style='text-align:center;punctuation-wrap:
hanging;text-autospace:ideograph-numeric ideograph-other'><b><span
lang=EN-US style='font-size:7.5pt;color:white'>NPRI ID</span></b></p>
</td>";
if(preg_match_all('/<[^\/][^>]++>/', $test, $matches)) {
print_r($matches);
}
?>Re: HTML Attribute Remover
I guess the attributes will be too complicated. Here's my other method.
<p [^>]*>
<p>
etc
I do that for a couple tags. My only problem is the td tags. Here's what I got:
<td( colspan="?[0-9]+"?)?( rowspan="?[0-9]+"?)?.>
<td$1$2>
The problem with this is that it has to start with colspan and then rowspan. I know I'm not far.
<p [^>]*>
<p>
etc
I do that for a couple tags. My only problem is the td tags. Here's what I got:
<td( colspan="?[0-9]+"?)?( rowspan="?[0-9]+"?)?.>
<td$1$2>
The problem with this is that it has to start with colspan and then rowspan. I know I'm not far.
- prometheuzz
- Forum Regular
- Posts: 779
- Joined: Fri Apr 04, 2008 5:51 am
Re: HTML Attribute Remover
What is your question?
Could you post an SSCCE including a proper amount of example input and explain what the expected output shold?
No offence, but I am getting a bit tired of pulling information out of you. It is your task to communicate your problem/question properly.
Could you post an SSCCE including a proper amount of example input and explain what the expected output shold?
No offence, but I am getting a bit tired of pulling information out of you. It is your task to communicate your problem/question properly.
Re: HTML Attribute Remover
My question is how do I remove all attributes in a td tag except for colspan and rowspan?
Sample:
<td width=48 rowspan=2 style='width:36.0pt;border:solid windowtext 1.0pt;
background:#0C0C0C;padding:0cm 5.4pt 0cm 5.4pt;height:15.0pt'>
Regex:
<td( colspan="?[0-9]+"?)?( rowspan="?[0-9]+"?)?.>
Expected Result:
<td rowspan=2>
Sample:
<td width=48 rowspan=2 style='width:36.0pt;border:solid windowtext 1.0pt;
background:#0C0C0C;padding:0cm 5.4pt 0cm 5.4pt;height:15.0pt'>
Regex:
<td( colspan="?[0-9]+"?)?( rowspan="?[0-9]+"?)?.>
Expected Result:
<td rowspan=2>
- prometheuzz
- Forum Regular
- Posts: 779
- Joined: Fri Apr 04, 2008 5:51 am
Re: HTML Attribute Remover
Didn't I already showed you how?
Anyway, here's another way:
Anyway, here's another way:
Code: Select all
<?php
$sample = "<td width=48 rowspan=2 style='width:36.0pt;border:solid windowtext 1.0pt;
background:#0C0C0C;padding:0cm 5.4pt 0cm 5.4pt;height:15.0pt'>";
echo preg_replace("/\S+(?<!rowspan|colspan)=(?:'[^']++'|\S++)/s", '', $sample);
?>Re: HTML Attribute Remover
Tried to improve on your regex, prometheuzz. I believe the one below does a more solid job mostly on the attribute value matching part. Moreover this regex is twice as fast. Examine and learn.prometheuzz wrote:Code: Select all
<?php $sample = "<td width=48 rowspan=2 style='width:36.0pt;border:solid windowtext 1.0pt; background:#0C0C0C;padding:0cm 5.4pt 0cm 5.4pt;height:15.0pt'>"; echo preg_replace("/\S+(?<!rowspan|colspan)=(?:'[^']++'|\S++)/s", '', $sample); ?>
Code: Select all
$str = '<td class=foo rowspan=2 id=\'bar\' colspan=\'5\' title="lorem">';
echo preg_replace('~\b[a-z]++(?<!rowspan|colspan)=(?:\'[^\']*+\'|"[^"]*+"|\S*+)~i', '', $str);- prometheuzz
- Forum Regular
- Posts: 779
- Joined: Fri Apr 04, 2008 5:51 am
Re: HTML Attribute Remover
Ah, yes. I must admit that I didn't (don't actually) know what the exact rules are as it comes to attributes naming...GeertDD wrote:...
Tried to improve on your regex, prometheuzz. I believe the one below does a more solid job mostly on the attribute value matching part.
Ah, \b[a-z]++ makes it indeed faster, and I see that both single and normal quotes are permitted.GeertDD wrote:Moreover this regex is twice as fast. Examine and learn.Let me know if you have questions.
Code: Select all
$str = '<td class=foo rowspan=2 id=\'bar\' colspan=\'5\' title="lorem">'; echo preg_replace('~\b[a-z]++(?<!rowspan|colspan)=(?:\'[^\']*+\'|"[^"]*+"|\S*+)~i', '', $str);
Thanks, Geert.
Re: HTML Attribute Remover
At page 200 in Mastering Regular Expressions (3rd edition) you'll find some more information about matching html tags and attributes as well.