Regular Expression Newb...Needs Help
Posted: Tue Jan 11, 2005 1:51 pm
I have dabbled a little bit with regular expressions but in all the dabbling I've done, I didn't retain anything I've learned...add that to the fact that I didn't understand any of the dabbling I did and you have a recipe for "Complete friggin' regular expression newbie = me"
here's my situation: I have some content devlopers who are insisting to develop their content in MS Word and then paste it to a Rich Text editor that I'm using on my web site. This is all hunky dorey until you look at the HTML code that MS Word creates....horrible stuff there.
What I need is to write some regular expressions to strip out all of the mule poo that MS Word has decided is necessary to show an html page.
See code example below:
I need a regular expression to just strip out all of the extra junk in the <p>'s and <span>'s and <ul>'s etc.
in this particular example the <b>'s and <u>'s didn't get the extra stuff added, but in most cases I've seen, a lot of junk gets added to them as well.
The end result after my regular expression should look like this:
that still has some extraneous stuff, but it's stuff that I can live with.
I think I could just write a regular expression that searches for "<P" and then removes everythign up to the ">" and kills it with a very noble death.
any guidance you can provide will be greatly appreciated.
Burr
here's my situation: I have some content devlopers who are insisting to develop their content in MS Word and then paste it to a Rich Text editor that I'm using on my web site. This is all hunky dorey until you look at the HTML code that MS Word creates....horrible stuff there.
What I need is to write some regular expressions to strip out all of the mule poo that MS Word has decided is necessary to show an html page.
See code example below:
Code: Select all
<P class=MsoBodyText style="MARGIN: 0in 0in 0pt">I am testing some stuff that I create in here to see how I could potentially change it with the rte.</P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt"> <?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /><o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt">This will be <B>bold here</B><o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt"> <o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt">And this will be <U>underlined</U>.<o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt"> <o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt">And this will be a <a target="_blank" href="http://www.mysite.net/">hyperlink</A>.<o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt"> <o:p></o:p></SPAN></P>
<UL style="MARGIN-TOP: 0in" type=disc>
<LI class=MsoNormal style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo1; tab-stops: list .5in">
<SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt">Maybe some bullets<o:p></o:p></SPAN>
</LI>
<LI class=MsoNormal style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo1; tab-stops: list .5in">
<SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt">Or two<o:p></o:p></SPAN>
</LI>
</UL>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt"> <o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt"> <o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt"> <o:p></o:p></SPAN></P>
<p class='nsize'><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA">And some more p’s</SPAN></P>in this particular example the <b>'s and <u>'s didn't get the extra stuff added, but in most cases I've seen, a lot of junk gets added to them as well.
The end result after my regular expression should look like this:
Code: Select all
<P>I am testing some stuff that I create in here to see how I could potentially change it with the rte.</P>
<P></P>
<P><SPAN>This will be <B>bold here</B></SPAN></P>
<P><SPAN></SPAN></P>
<P><SPAN>And this will be <U>underlined</U>.</SPAN></P>
<P><SPAN></SPAN></P>
<P><SPAN>And this will be a <a target="_blank" href="http://www.mysite.net/">hyperlink</A>.</SPAN></P>
<P><SPAN></SPAN></P>
<UL>
<LI>
<SPAN>Maybe some bullets</SPAN>
</LI>
<LI>
<SPAN>Or two</SPAN>
</LI>
</UL>
<P><SPAN></SPAN></P>
<P><SPAN></SPAN></P>
<P><SPAN></SPAN></P>
<p><SPAN>And some more p’s</SPAN></P>I think I could just write a regular expression that searches for "<P" and then removes everythign up to the ">" and kills it with a very noble death.
any guidance you can provide will be greatly appreciated.
Burr