Page 1 of 1

Trying to strip MSWord tags

Posted: Fri Apr 03, 2009 7:36 am
by Sindarin
I am trying to create a function to detect and remove the ugly tags MSWord leaves behind when copy-pasted in my rich text field:

Code: Select all

<?php
 
 
/* DETECT AND REMOVE MSTAGS */
 
function strip_mstags($str)
{
 
$str=str_replace('<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />','<!--stripped-->',$str);
$str=str_replace('<meta name="ProgId" content="Word.Document" />','<!--stripped-->',$str);
$str=str_replace('<meta name="Generator" content="Microsoft Word 11" />','<!--stripped-->',$str);
$str=str_replace('<meta name="Originator" content="Microsoft Word 11" />','<!--stripped-->',$str);
$str=str_replace('<!--[if gte mso 9]><xml>','<!--stripped-->',$str);
$str=str_replace('</xml><![endif]-->','<!--stripped-->',$str);
$str=str_replace('<!--[if gte mso 10]>','<!--stripped-->',$str);
$str=str_replace('<mce:style>','<!--stripped-->',$str);
$str=str_replace('<p class="MsoNormal">','<!--stripped-->',$str);
$str=str_replace('<o:p>','<!--stripped-->',$str);
$str=str_replace('</o:p>','<!--stripped-->',$str);
$str=str_replace('<link rel="File-List" href="','<!--stripped-->',$str);
$str=str_replace('<!--[if','<!--stripped-->',$str);
$str=str_replace('<![endif]-->','<!--stripped-->',$str);
$str=str_replace('<w:WordDocument>','<!--stripped-->',$str);
 
return $str;
}
 
?>
Can someone provide me with a better method to do this, supply any lists with ms specific tags and a way to remove the whole contents of:

Code: Select all

 
<!--
 /* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
    {mso-style-parent:"";
    margin:0cm;
    margin-bottom:.0001pt;
    mso-pagination:widow-orphan;
    font-size:12.0pt;
    font-family:"Times New Roman";
    mso-fareast-font-family:"Times New Roman";}
a:link, span.MsoHyperlink
    {color:blue;
    text-decoration:underline;
    text-underline:single;}
a:visited, span.MsoHyperlinkFollowed
    {color:purple;
    text-decoration:underline;
    text-underline:single;}
@page Section1
    {size:612.0pt 792.0pt;
    margin:72.0pt 90.0pt 72.0pt 90.0pt;
    mso-header-margin:36.0pt;
    mso-footer-margin:36.0pt;
    mso-paper-source:0;}
div.Section1
    {page:Section1;}
-->
 

Re: Trying to strip MSWord tags

Posted: Fri Apr 03, 2009 10:38 am
by mattpointblank
This is a tricky one - I use TinyMCE for rich text editing online which has functions to strip out bad code like the above - try it?

Re: Trying to strip MSWord tags

Posted: Sun Apr 05, 2009 8:38 am
by Sindarin
I have, and it works, however my clients are not that savvy to use the Paste from Word button all the time. That's why I want to use PHP for that work.