Regular Expression Newb...Needs Help

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
Burrito
Spockulator
Posts: 4715
Joined: Wed Feb 04, 2004 8:15 pm
Location: Eden, Utah

Regular Expression Newb...Needs Help

Post by Burrito »

I have dabbled a little bit with regular expressions but in all the dabbling I've done, I didn't retain anything I've learned...add that to the fact that I didn't understand any of the dabbling I did and you have a recipe for "Complete friggin' regular expression newbie = me"

here's my situation: I have some content devlopers who are insisting to develop their content in MS Word and then paste it to a Rich Text editor that I'm using on my web site. This is all hunky dorey until you look at the HTML code that MS Word creates....horrible stuff there.

What I need is to write some regular expressions to strip out all of the mule poo that MS Word has decided is necessary to show an html page.

See code example below:

Code: Select all

<P class=MsoBodyText style="MARGIN: 0in 0in 0pt">I am testing some stuff that I create in here to see how I could potentially change it with the rte.</P>

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt"> <?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /><o:p></o:p></SPAN></P>

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt">This will be <B>bold here</B><o:p></o:p></SPAN></P>

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt"> <o:p></o:p></SPAN></P>

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt">And this will be <U>underlined</U>.<o:p></o:p></SPAN></P>

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt"> <o:p></o:p></SPAN></P>

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt">And this will be a <a target="_blank" href="http://www.mysite.net/">hyperlink</A>.<o:p></o:p></SPAN></P>

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt"> <o:p></o:p></SPAN></P>

<UL style="MARGIN-TOP: 0in" type=disc>
<LI class=MsoNormal style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo1; tab-stops: list .5in">
<SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt">Maybe some bullets<o:p></o:p></SPAN>
</LI>

<LI class=MsoNormal style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo1; tab-stops: list .5in">
<SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt">Or two<o:p></o:p></SPAN>
</LI>
</UL>

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt"> <o:p></o:p></SPAN></P>

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt"> <o:p></o:p></SPAN></P>

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt"> <o:p></o:p></SPAN></P>

<p class='nsize'><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA">And some more p’s</SPAN></P>
I need a regular expression to just strip out all of the extra junk in the <p>'s and <span>'s and <ul>'s etc.

in this particular example the <b>'s and <u>'s didn't get the extra stuff added, but in most cases I've seen, a lot of junk gets added to them as well.

The end result after my regular expression should look like this:

Code: Select all

<P>I am testing some stuff that I create in here to see how I could potentially change it with the rte.</P>

<P></P>

<P><SPAN>This will be <B>bold here</B></SPAN></P>

<P><SPAN></SPAN></P>

<P><SPAN>And this will be <U>underlined</U>.</SPAN></P>

<P><SPAN></SPAN></P>

<P><SPAN>And this will be a <a target="_blank" href="http://www.mysite.net/">hyperlink</A>.</SPAN></P>

<P><SPAN></SPAN></P>

<UL>
<LI>
<SPAN>Maybe some bullets</SPAN>
</LI>

<LI>
<SPAN>Or two</SPAN>
</LI>
</UL>

<P><SPAN></SPAN></P>

<P><SPAN></SPAN></P>

<P><SPAN></SPAN></P>

<p><SPAN>And some more p’s</SPAN></P>
that still has some extraneous stuff, but it's stuff that I can live with.

I think I could just write a regular expression that searches for "<P" and then removes everythign up to the ">" and kills it with a very noble death.

any guidance you can provide will be greatly appreciated.

Burr
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

because of the other tags and things involved, you'll need more than one expression.

untested:

pattern 1: spit

Code: Select all

preg_replace('#<\s*(p|span|li|ul|ol)\s+.*?>#si', '<\1>',...)
pattern 2: polish

Code: Select all

preg_replace('#<(\s*span\s*>\s*<\s*/\s*span\s*|\?xml\s*:\s*namespace.*?|\s*/?\s*o\s*:\s*p\s*)>#si','',...)

lastly, I recommend slapping the hell out of your copy writers. :)
User avatar
Burrito
Spockulator
Posts: 4715
Joined: Wed Feb 04, 2004 8:15 pm
Location: Eden, Utah

Post by Burrito »

feyd wrote: lastly, I recommend slapping the hell out of your copy writers. :)
oh yes, there's been much slapping alread...to no avail.

I tried what you gave me:

Code: Select all

<? $str = '<span style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA">bob</span>';
$rep = "#<\s*(p|span|li|ul|ol)\s+.*?>#si";
$str = preg_replace($rep,"",$str);

preg_replace('#<(\s*span\s*>\s*<\s*/\s*span\s*|\?xml\s*:\s*namespace.*?|\s*/?\s*o\s*:\s*p\s*)>#si','',$str);
echo $str;
and it just kills the entire opening <span> tag and just yields:

bob</span>

I would try to just adjust it myself, but looking at what you wrote is like trying to learn chinese by reading a swahili text book to me....

thx for the help!

Burr
User avatar
Burrito
Spockulator
Posts: 4715
Joined: Wed Feb 04, 2004 8:15 pm
Location: Eden, Utah

Post by Burrito »

doh! I was missing the "<\\1>"

sry about that..works like a champ

thx again amigo.

wonder if I could trouble you to explain a lil' about what that's actually doing?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

How it works, the short version.

pattern 1:
find a paragraph, span, list item, unordered list, or ordered list starting tag. Scrunch it to minimal size.

pattern 2:
Find ~empty span and o:p tags, nix those. Find the xml namespace define, toss that.
User avatar
Burrito
Spockulator
Posts: 4715
Joined: Wed Feb 04, 2004 8:15 pm
Location: Eden, Utah

Post by Burrito »

was hoping for a little more detail...but you've done more than enough.

one thing though, pattern two isnt' working.

I still get the <?xml junk and the <o:p> tags.

any ideas?
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

try this:

Code: Select all

preg_replace('#(<\s*span\s*>\s*<\s*/\s*span\s*>|<\?xml\s*:\s*namespace.*?>|<\s*/?\s*o\s*:\s*p\s*>)#si','',...)
User avatar
Burrito
Spockulator
Posts: 4715
Joined: Wed Feb 04, 2004 8:15 pm
Location: Eden, Utah

Post by Burrito »

that gets us there...almost. The empty spans are still there, but that's because in the second pattern, they aren't empty (have the <o:p>'s in them)

I can just add another one after pattern 2 and then strip those out if needed.

many many thanks feyd.

...the slapping will continue...

Burr
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

How it works, longer version:

pattern 1 of

Code: Select all

#<\s*(p|span|li|ul|ol)\s+.*?>#si
# - pattern delimination starting symbol (many possible symbols.. I just use # most often)
<\s* - look for the less than (<) character, followed by zero or more whitespaces
(p|span|li|ul|ol) - look for p, span, li, ul, or ol
\s+ - followed by 1 or more whitespaces
.*?> - follwed by the shortest length of zero or more characters (any) until a greater than (>) character is found.
# - end pattern delimiter
s - single line modifier, don't stop the pattern for carriage returns
i - ignore case modifier


pattern 2 of

Code: Select all

#(<\s*span\s*>\s*<\s*/\s*span\s*>|<\?xml\s*:\s*namespace.*?>|<\s*/?\s*o\s*:\s*p\s*>)#si
# - start pattern
(<\s*span\s*>\s*<\s*/\s*span\s*>|<\?xml\s*:\s*namespace.*?>|<\s*/?\s*o\s*:\s*p\s*>) - search for whitespace filled spans, the xml namespace, or the o:p tag.
<\s*span\s*>\s*<\s*/\s*span\s*> - search for spans with only amounts of whitespace in the tags or contained between them.
<\?xml\s*:\s*namespace.*?> - xml tag allowing only whitespace between the colon and the xml and namespace marks. The rest of the tag is consumed with .*?>
<\s*/?\s*o\s*:\s*p\s*> - similar to the whitespace only span match
# - end pattern delimiter
s - single line modifier, don't stop the pattern for carriage returns
i - ignore case modifier
User avatar
Burrito
Spockulator
Posts: 4715
Joined: Wed Feb 04, 2004 8:15 pm
Location: Eden, Utah

Post by Burrito »

very excellent. One quick question though:

when I replace with <\\1>, what is that doing?

I assume that you're telling it to replace with the "1" character after the numeral which is ">"...correct me if I'm wrong.

If I'm right, then what does the <\\ do?

thx again feyd... you just made me a hero here at work and I have you to thank...I'll buy you some pie sometime :)

Burr
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

a double backslash followed by a number (limited) asks the regex engine to place the pattern (whose number is denoted by the value) remembered, to be placed there.

In this case, I asked the regex engine to remember if it found a p, span, li, ul, or ol tag.

There are other, more complex uses of backreferences like finding the same value later within the same pattern.. but you don't need to get into that right now.. You can search through and read my other regex explaination posts if you'd like.

I know of one (with children) that can be found through "+stripper +tag", I believe.
Post Reply