Page 1 of 1

php regex to remove style from html

Posted: Fri Oct 16, 2009 10:09 am
by kage
ive been trying to do this for hours now and just getting lost in it, ive read and searched as much as i can if someone could give me a hand it would be great.

so to outline my problem i am trying to get *VALUE* and the html is formatted as such
<td style="WHATEVER1">
*VALUE1*
</td>
<td style="WHATEVER2">
*VALUE2*
</td>
i have been going about it by trying to replace using regex style="match anything"> with ">" so that it would end up like
<td>
*VALUE1*
</td>
<td>
*VALUE2*
</td>

then using explode to get everything between <td>'s as an array, i then later have to remove </td>

first off i cant get a regex to match what i need it to :(
also if anyone has a better way of doing this that would be awesome, to make it easier i know that there is nothing after the style ever, its always style="something">

any help would be appreciated

Re: php regex to remove style from html

Posted: Fri Oct 16, 2009 10:40 am
by ridgerunner
If you only need to process the TD tags and your tables are not nested, this one should do the trick:

Code: Select all

// long commented version
$text = preg_replace('%
    <td      # match the start of the opening TD tag
    [^>]*?   # allow any atributes before style
    style    # we require a style attribute
    \s*=\s*  # followed by an equals sign
    "[^"]*"  # followed by the style attribute value
    [^>]*    # allow any atributes after style
    >        # match the end of the opening TD tag
    (.*?)    # lazily capture TD contents in group 1
    </td>    # match the closing TD tag
    %six', '<td>$1</td>', $text);
 
// short version
$text = preg_replace('%<td[^>]*?style\s*=\s*"[^"]*"[^>]*>(.*?)</td>%si', '<td>$1</td>', $text);
If your tables are nested, there is another solution that will work using recursion, but it would be a bit more complex.

Hope this helps...

Re: php regex to remove style from html

Posted: Fri Oct 16, 2009 2:52 pm
by kage
tinkered with that a bit and got it to do what i need :) thank you very much, perhaps you can help me with one last regex problem
i now have the data formatted
<td>*VALUE1*</td><td>*value2*</td>, im using explode("</td>", $string) to break it into the separate tokens, however it might be more useful for me to to use preg_split and match anything between <td> and </td> requiring at least one character between. is there a way to do this?

thanks again! :)

Re: php regex to remove style from html

Posted: Sat Oct 17, 2009 4:44 pm
by ridgerunner
if you are interested in getting an array containing all of the contents of all TD tags, you don't really want to use split but rather preg_match_all with a capture group to grab everything between the start and end tags like so:

Code: Select all

preg_match_all('%<td[^>]*>(.+?)</td>%si', $text, $result, PREG_PATTERN_ORDER);
$result = $result[1];
Note that you probably could have used this one from the very start.