Strip bad MS Word tags from string?

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
Sindarin
Forum Regular
Posts: 521
Joined: Tue Sep 25, 2007 8:36 am
Location: Greece

Strip bad MS Word tags from string?

Post by Sindarin »

Everything works in my CMS, but there has been a problem when the client copies the topic content from his MS Word Documents on the Rich text area (I'm using Tiny MCE). This results in a post that has these MS tags in:

Code: Select all

<meta http-equiv="\"Content-Type\"" content="\"text/html;" charset="utf-8\"" />
<meta name="\"ProgId\"" content="\"Word.Document\"" />
<meta name="\"Generator\"" content="\"Microsoft" />
<meta name="\"Originator\"" content="\"Microsoft" />
<link rel="\"File-List\"" href="\" />
<!--[if gte mso 9]>
<xml> Normal   0         false   false   false                             MicrosoftInternetExplorer4 </xml><![endif]--><!--[if gte mso 9]>
<xml> </xml><![endif]-->
<style><!--
 
--></style>
<!--[if gte mso 10]>
 <mce:style><!   /* Style Definitions */  table.MsoNormalTable  {mso-style-name:\"Table Normal\";   mso-tstyle-rowband-size:0;  mso-tstyle-colband-size:0;  mso-style-noshow:yes;   mso-style-parent:\"\";  mso-padding-alt:0cm 5.4pt 0cm 5.4pt;    mso-para-margin:0cm;    mso-para-margin-bottom:.0001pt;     mso-pagination:widow-orphan;    font-size:10.0pt;   font-family:\"Times New Roman\";    mso-ansi-language:#0400;    mso-fareast-language:#0400;     mso-bidi-language:#0400;} --> <!--[endif]-->
These tags cause Internet Explorer 6/7 (ugh..) to break the page layout. Firefox seems to gracefully ignore them.

Tiny MCE has a button to paste from an MS Word Document and removes all these tags, but the client most of the times forgets to use it, so it results into a broken CMS. Is there any server side way to remove those tags with PHP while keeping the rest of the rich html content?
mattpointblank
Forum Contributor
Posts: 304
Joined: Tue Dec 23, 2008 6:29 am

Re: Strip bad MS Word tags from string?

Post by mattpointblank »

I sympathise - I come across this a lot.

There's a function called strip_tags() which does what you might expect - removes HTML from input. You can specify which tags to leave in, so strip out all the pointless <meta> tags etc, but keep in (for example) <p>, <b>, <a> etc.

The only drawback is that Word managed to infect these elements too, so you'll still end up with <p style="MSoNormal"> or something... still not sure how to get around that one besides a hard search and replace every time they submit text, to clean it.
User avatar
Sindarin
Forum Regular
Posts: 521
Joined: Tue Sep 25, 2007 8:36 am
Location: Greece

Re: Strip bad MS Word tags from string?

Post by Sindarin »

I tried this setup,

$product_description=strip_tags($product_description,"<p><a><br><img><span><div><b><i><font><table><li><ol><ul>");

But this removes the whole text and tags, it seems something is confusing the function. I guess it's better submitting a blank post than letting these tags in the content break the page.
Post Reply