PHP Developers Network

A community of PHP developers offering assistance, advice, discussion, and friendship.
 
Loading
It is currently Mon Feb 18, 2019 2:38 pm

All times are UTC - 5 hours




Post new topic Reply to topic  [ 11 posts ] 
Author Message
PostPosted: Tue Jan 11, 2005 2:51 pm 
Offline
Spockulator
User avatar

Joined: Wed Feb 04, 2004 9:15 pm
Posts: 4713
Location: Eden, Utah
I have dabbled a little bit with regular expressions but in all the dabbling I've done, I didn't retain anything I've learned...add that to the fact that I didn't understand any of the dabbling I did and you have a recipe for "Complete friggin' regular expression newbie = me"

here's my situation: I have some content devlopers who are insisting to develop their content in MS Word and then paste it to a Rich Text editor that I'm using on my web site. This is all hunky dorey until you look at the HTML code that MS Word creates....horrible stuff there.

What I need is to write some regular expressions to strip out all of the mule poo that MS Word has decided is necessary to show an html page.

See code example below:

Syntax: [ Download ] [ Hide ]
<P class=MsoBodyText style="MARGIN: 0in 0in 0pt">I am testing some stuff that I create in here to see how I could potentially change it with the rte.</P>

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt"> <?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" /><o:p></o:p></SPAN></P>

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt">This will be <B>bold here</B><o:p></o:p></SPAN></P>

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt"> <o:p></o:p></SPAN></P>

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt">And this will be <U>underlined</U>.<o:p></o:p></SPAN></P>

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt"> <o:p></o:p></SPAN></P>

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt">And this will be a <a target="_blank" href="http://www.mysite.net/">hyperlink</A>.<o:p></o:p></SPAN></P>

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt"> <o:p></o:p></SPAN></P>

<UL style="MARGIN-TOP: 0in" type=disc>
<LI class=MsoNormal style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo1; tab-stops: list .5in">
<SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt">Maybe some bullets<o:p></o:p></SPAN>
</LI>

<LI class=MsoNormal style="MARGIN: 0in 0in 0pt; mso-list: l0 level1 lfo1; tab-stops: list .5in">
<SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt">Or two<o:p></o:p></SPAN>
</LI>
</UL>

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt"> <o:p></o:p></SPAN></P>

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt"> <o:p></o:p></SPAN></P>

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt"> <o:p></o:p></SPAN></P>

<p class='nsize'><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt; mso-fareast-font-family: 'Times New Roman'; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA">And some more p’s</SPAN></P>

I need a regular expression to just strip out all of the extra junk in the <p>'s and <span>'s and <ul>'s etc.

in this particular example the <b>'s and <u>'s didn't get the extra stuff added, but in most cases I've seen, a lot of junk gets added to them as well.

The end result after my regular expression should look like this:

Syntax: [ Download ] [ Hide ]
<P>I am testing some stuff that I create in here to see how I could potentially change it with the rte.</P>

<P></P>

<P><SPAN>This will be <B>bold here</B></SPAN></P>

<P><SPAN></SPAN></P>

<P><SPAN>And this will be <U>underlined</U>.</SPAN></P>

<P><SPAN></SPAN></P>

<P><SPAN>And this will be a <a target="_blank" href="http://www.mysite.net/">hyperlink</A>.</SPAN></P>

<P><SPAN></SPAN></P>

<UL>
<LI>
<SPAN>Maybe some bullets</SPAN>
</LI>

<LI>
<SPAN>Or two</SPAN>
</LI>
</UL>

<P><SPAN></SPAN></P>

<P><SPAN></SPAN></P>

<P><SPAN></SPAN></P>

<p><SPAN>And some more p’s</SPAN></P>


that still has some extraneous stuff, but it's stuff that I can live with.

I think I could just write a regular expression that searches for "<P" and then removes everythign up to the ">" and kills it with a very noble death.

any guidance you can provide will be greatly appreciated.

Burr


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jan 11, 2005 3:16 pm 
Offline
Neighborhood Spidermoddy
User avatar

Joined: Mon Mar 29, 2004 4:24 pm
Posts: 31559
Location: Bothell, Washington, USA
because of the other tags and things involved, you'll need more than one expression.

untested:

pattern 1: spit
Syntax: [ Download ] [ Hide ]
preg_replace('#<\s*(p|span|li|ul|ol)\s+.*?>#si', '<\1>',...)

pattern 2: polish
Syntax: [ Download ] [ Hide ]
preg_replace('#<(\s*span\s*>\s*<\s*/\s*span\s*|\?xml\s*:\s*namespace.*?|\s*/?\s*o\s*:\s*p\s*)>#si','',...)



lastly, I recommend slapping the hell out of your copy writers. :)


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jan 11, 2005 3:27 pm 
Offline
Spockulator
User avatar

Joined: Wed Feb 04, 2004 9:15 pm
Posts: 4713
Location: Eden, Utah
feyd wrote:
lastly, I recommend slapping the hell out of your copy writers. :)


oh yes, there's been much slapping alread...to no avail.

I tried what you gave me:

Syntax: [ Download ] [ Hide ]
<? $str = '<span style="FONT-SIZE: 11pt; FONT-FAMILY: Arial; mso-bidi-font-size: 12.0pt; mso-fareast-font-family: Times New Roman; mso-ansi-language: EN-US; mso-fareast-language: EN-US; mso-bidi-language: AR-SA">bob</span>';

$rep = "#<\s*(p|span|li|ul|ol)\s+.*?>#si";

$str = preg_replace($rep,"",$str);



preg_replace('#<(\s*span\s*>\s*<\s*/\s*span\s*|\?xml\s*:\s*namespace.*?|\s*/?\s*o\s*:\s*p\s*)>#si','',$str);

echo $str;


and it just kills the entire opening <span> tag and just yields:

bob</span>

I would try to just adjust it myself, but looking at what you wrote is like trying to learn chinese by reading a swahili text book to me....

thx for the help!

Burr


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jan 11, 2005 3:29 pm 
Offline
Spockulator
User avatar

Joined: Wed Feb 04, 2004 9:15 pm
Posts: 4713
Location: Eden, Utah
doh! I was missing the "<\\1>"

sry about that..works like a champ

thx again amigo.

wonder if I could trouble you to explain a lil' about what that's actually doing?


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jan 11, 2005 3:58 pm 
Offline
Neighborhood Spidermoddy
User avatar

Joined: Mon Mar 29, 2004 4:24 pm
Posts: 31559
Location: Bothell, Washington, USA
How it works, the short version.

pattern 1:
find a paragraph, span, list item, unordered list, or ordered list starting tag. Scrunch it to minimal size.

pattern 2:
Find ~empty span and o:p tags, nix those. Find the xml namespace define, toss that.


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jan 11, 2005 4:04 pm 
Offline
Spockulator
User avatar

Joined: Wed Feb 04, 2004 9:15 pm
Posts: 4713
Location: Eden, Utah
was hoping for a little more detail...but you've done more than enough.

one thing though, pattern two isnt' working.

I still get the <?xml junk and the <o:p> tags.

any ideas?


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jan 11, 2005 4:07 pm 
Offline
Neighborhood Spidermoddy
User avatar

Joined: Mon Mar 29, 2004 4:24 pm
Posts: 31559
Location: Bothell, Washington, USA
try this:
Syntax: [ Download ] [ Hide ]
preg_replace('#(<\s*span\s*>\s*<\s*/\s*span\s*>|<\?xml\s*:\s*namespace.*?>|<\s*/?\s*o\s*:\s*p\s*>)#si','',...)


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jan 11, 2005 4:18 pm 
Offline
Spockulator
User avatar

Joined: Wed Feb 04, 2004 9:15 pm
Posts: 4713
Location: Eden, Utah
that gets us there...almost. The empty spans are still there, but that's because in the second pattern, they aren't empty (have the <o:p>'s in them)

I can just add another one after pattern 2 and then strip those out if needed.

many many thanks feyd.

...the slapping will continue...

Burr


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jan 11, 2005 4:27 pm 
Offline
Neighborhood Spidermoddy
User avatar

Joined: Mon Mar 29, 2004 4:24 pm
Posts: 31559
Location: Bothell, Washington, USA
How it works, longer version:

pattern 1 of
Syntax: [ Download ] [ Hide ]
#&lt;\s*(p|span|li|ul|ol)\s+.*?&gt;#si

# - pattern delimination starting symbol (many possible symbols.. I just use # most often)
<\s* - look for the less than (<) character, followed by zero or more whitespaces
(p|span|li|ul|ol) - look for p, span, li, ul, or ol
\s+ - followed by 1 or more whitespaces
.*?> - follwed by the shortest length of zero or more characters (any) until a greater than (>) character is found.
# - end pattern delimiter
s - single line modifier, don't stop the pattern for carriage returns
i - ignore case modifier


pattern 2 of
Syntax: [ Download ] [ Hide ]
#(&lt;\s*span\s*&gt;\s*&lt;\s*/\s*span\s*&gt;|&lt;\?xml\s*:\s*namespace.*?&gt;|&lt;\s*/?\s*o\s*:\s*p\s*&gt;)#si

# - start pattern
(<\s*span\s*>\s*<\s*/\s*span\s*>|<\?xml\s*:\s*namespace.*?>|<\s*/?\s*o\s*:\s*p\s*>) - search for whitespace filled spans, the xml namespace, or the o:p tag.
<\s*span\s*>\s*<\s*/\s*span\s*> - search for spans with only amounts of whitespace in the tags or contained between them.
<\?xml\s*:\s*namespace.*?> - xml tag allowing only whitespace between the colon and the xml and namespace marks. The rest of the tag is consumed with .*?>
<\s*/?\s*o\s*:\s*p\s*> - similar to the whitespace only span match
# - end pattern delimiter
s - single line modifier, don't stop the pattern for carriage returns
i - ignore case modifier


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jan 11, 2005 4:34 pm 
Offline
Spockulator
User avatar

Joined: Wed Feb 04, 2004 9:15 pm
Posts: 4713
Location: Eden, Utah
very excellent. One quick question though:

when I replace with <\\1>, what is that doing?

I assume that you're telling it to replace with the "1" character after the numeral which is ">"...correct me if I'm wrong.

If I'm right, then what does the <\\ do?

thx again feyd... you just made me a hero here at work and I have you to thank...I'll buy you some pie sometime :)

Burr


Top
 Profile  
 
 Post subject:
PostPosted: Tue Jan 11, 2005 4:46 pm 
Offline
Neighborhood Spidermoddy
User avatar

Joined: Mon Mar 29, 2004 4:24 pm
Posts: 31559
Location: Bothell, Washington, USA
a double backslash followed by a number (limited) asks the regex engine to place the pattern (whose number is denoted by the value) remembered, to be placed there.

In this case, I asked the regex engine to remember if it found a p, span, li, ul, or ol tag.

There are other, more complex uses of backreferences like finding the same value later within the same pattern.. but you don't need to get into that right now.. You can search through and read my other regex explaination posts if you'd like.

I know of one (with children) that can be found through "+stripper +tag", I believe.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 11 posts ] 

All times are UTC - 5 hours


Who is online

Users browsing this forum: No registered users and 6 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group