Page 1 of 1

Stripping-Down HTML

Posted: Sun Jun 03, 2007 11:34 pm
by JonathonReinhart
Hello everyone, I'm looking to write some regex/php to basically strip down some HTML that I've CURL-ed from another site, and appears to have been written in MS Word.... Here's a little bit of what I'm getting...

Code: Select all

<p class=MsoNormal><b><u>Driver regulations and Safety:<o:p></o:p></u></b></p>

<ol style='margin-top:0in' start=1 type=1>
 <li class=MsoNormal style='mso-list:l17 level1 lfo19;tab-stops:list .5in'>Must
     be 16 years of age (NO EXCEPTIONS!!)</li>
You can see that there are three things I would to take out. First, I would like to get rid of the <o:p> 's. I don't even know what those are. Secondly, I'd like to get rid of all class definitions and Third, the style definitions.

Can this be easily done?

Posted: Mon Jun 04, 2007 12:05 am
by feyd
Take a look at the demo I put up some time ago of a small library I built for another member: http://code.tatzu.net/cleantags/

I just tested you needs in it. It will do it, no problem.

You can therefore download the source here: http://code.tatzu.net/cleantags/cleantags.zip
Input:

Code: Select all

<p class=MsoNormal><b><u>Driver regulations and Safety:<o:p></o:p></u></b></p>

<ol style='margin-top:0in' start=1 type=1>
 <li class=MsoNormal style='mso-list:l17 level1 lfo19;tab-stops:list .5in'>Must
     be 16 years of age (NO EXCEPTIONS!!)</li> 

Input Size:

254 bytes


Cleaning result:

Code: Select all

<p><b><u>Driver regulations and Safety:</u></b></p>

<ol start=1 type=1>
 <li>Must
     be 16 years of age (NO EXCEPTIONS!!)</li> 

Time:

0.00182796 seconds