php regex to remove style from html

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
kage
Forum Newbie
Posts: 10
Joined: Thu Aug 07, 2003 11:34 am

php regex to remove style from html

Post by kage »

ive been trying to do this for hours now and just getting lost in it, ive read and searched as much as i can if someone could give me a hand it would be great.

so to outline my problem i am trying to get *VALUE* and the html is formatted as such
<td style="WHATEVER1">
*VALUE1*
</td>
<td style="WHATEVER2">
*VALUE2*
</td>
i have been going about it by trying to replace using regex style="match anything"> with ">" so that it would end up like
<td>
*VALUE1*
</td>
<td>
*VALUE2*
</td>

then using explode to get everything between <td>'s as an array, i then later have to remove </td>

first off i cant get a regex to match what i need it to :(
also if anyone has a better way of doing this that would be awesome, to make it easier i know that there is nothing after the style ever, its always style="something">

any help would be appreciated
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: php regex to remove style from html

Post by ridgerunner »

If you only need to process the TD tags and your tables are not nested, this one should do the trick:

Code: Select all

// long commented version
$text = preg_replace('%
    <td      # match the start of the opening TD tag
    [^>]*?   # allow any atributes before style
    style    # we require a style attribute
    \s*=\s*  # followed by an equals sign
    "[^"]*"  # followed by the style attribute value
    [^>]*    # allow any atributes after style
    >        # match the end of the opening TD tag
    (.*?)    # lazily capture TD contents in group 1
    </td>    # match the closing TD tag
    %six', '<td>$1</td>', $text);
 
// short version
$text = preg_replace('%<td[^>]*?style\s*=\s*"[^"]*"[^>]*>(.*?)</td>%si', '<td>$1</td>', $text);
If your tables are nested, there is another solution that will work using recursion, but it would be a bit more complex.

Hope this helps...
kage
Forum Newbie
Posts: 10
Joined: Thu Aug 07, 2003 11:34 am

Re: php regex to remove style from html

Post by kage »

tinkered with that a bit and got it to do what i need :) thank you very much, perhaps you can help me with one last regex problem
i now have the data formatted
<td>*VALUE1*</td><td>*value2*</td>, im using explode("</td>", $string) to break it into the separate tokens, however it might be more useful for me to to use preg_split and match anything between <td> and </td> requiring at least one character between. is there a way to do this?

thanks again! :)
User avatar
ridgerunner
Forum Contributor
Posts: 214
Joined: Sun Jul 05, 2009 10:39 pm
Location: SLC, UT

Re: php regex to remove style from html

Post by ridgerunner »

if you are interested in getting an array containing all of the contents of all TD tags, you don't really want to use split but rather preg_match_all with a capture group to grab everything between the start and end tags like so:

Code: Select all

preg_match_all('%<td[^>]*>(.+?)</td>%si', $text, $result, PREG_PATTERN_ORDER);
$result = $result[1];
Note that you probably could have used this one from the very start.
Post Reply