Stripping-Down HTML

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
JonathonReinhart
Forum Newbie
Posts: 1
Joined: Sun Jun 03, 2007 10:37 pm

Stripping-Down HTML

Post by JonathonReinhart »

Hello everyone, I'm looking to write some regex/php to basically strip down some HTML that I've CURL-ed from another site, and appears to have been written in MS Word.... Here's a little bit of what I'm getting...

Code: Select all

<p class=MsoNormal><b><u>Driver regulations and Safety:<o:p></o:p></u></b></p>

<ol style='margin-top:0in' start=1 type=1>
 <li class=MsoNormal style='mso-list:l17 level1 lfo19;tab-stops:list .5in'>Must
     be 16 years of age (NO EXCEPTIONS!!)</li>
You can see that there are three things I would to take out. First, I would like to get rid of the <o:p> 's. I don't even know what those are. Secondly, I'd like to get rid of all class definitions and Third, the style definitions.

Can this be easily done?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Take a look at the demo I put up some time ago of a small library I built for another member: http://code.tatzu.net/cleantags/

I just tested you needs in it. It will do it, no problem.

You can therefore download the source here: http://code.tatzu.net/cleantags/cleantags.zip
Input:

Code: Select all

<p class=MsoNormal><b><u>Driver regulations and Safety:<o:p></o:p></u></b></p>

<ol style='margin-top:0in' start=1 type=1>
 <li class=MsoNormal style='mso-list:l17 level1 lfo19;tab-stops:list .5in'>Must
     be 16 years of age (NO EXCEPTIONS!!)</li> 

Input Size:

254 bytes


Cleaning result:

Code: Select all

<p><b><u>Driver regulations and Safety:</u></b></p>

<ol start=1 type=1>
 <li>Must
     be 16 years of age (NO EXCEPTIONS!!)</li> 

Time:

0.00182796 seconds
Post Reply