Page 1 of 1

Removing MS Word HTML from a file

Posted: Wed Jun 18, 2003 5:14 am
by martincrumlish
Hi,

I have a problem with converting word docs to HTML. As you probably know, when word generates its HTML it has a lot of needless tags. Is there a way I can cut these out and just leave basic formatting tags such as <b>, <br>, <ol>, <li> etc.

I found this code to remove Word HTML

Code: Select all

$search = array ("'<script[^>]*?>.*?</script>'si",  // Strip out javascript 
                 "'<[\/\!]*?[^<>]*?>'si",           // Strip out html tags 
                 "'([\r\n])[\s]+'",                 // Strip out white space 
                 "'&(quot|#34);'i",                 // Replace html entities 
                 "'&(amp|#38);'i", 
                 "'&(lt|#60);'i", 
                 "'&(gt|#62);'i", 
                 "'&(nbsp|#160);'i", 
                 "'&(iexcl|#161);'i", 
                 "'&(cent|#162);'i", 
                 "'&(pound|#163);'i", 
                 "'&(copy|#169);'i", 
                 "'&#(\d+);'e");                    // evaluate as php 

$replace = array ("", 
                  "", 
                  "\\1", 
                  """, 
                  "&", 
                  "<", 
                  ">", 
                  " ", 
                  chr(161), 
                  chr(162), 
                  chr(163), 
                  chr(169), 
                  "chr(\\1)"); 

$content = preg_replace ($search, $replace, $content);
I found this on another site as a solution for removing word HTML but the problem is it removes all of the HTML leaving the file as a blob of text only with nor line breaks or anything. I am afraid I don't understand the code above fully so I was hoping someone on here could help me out.

Basically, I need help modiying the code above to remove all the crap but still leave certain tags.

Thanks in advance,
Martin

Posted: Wed Jun 18, 2003 7:37 am
by Stoker
not the answer to your question;
Why use stuff like IE and Word to begin width?? Copy and paste and html generation with those makes the biggest mess in the world.. Not to mention MS-Publisher (unpatched) creating 2Megabyte pages and Frontpage not understanding that a table can be inside a table inside a table inside a table inside a layer inside a table etc..

Software like Macromedia Dreamweaver has built in functions for cleaning up Word-HTML, but if you have DW, I see no point making Garbage sites with Bill Gates nor Steve Ballmer to begin with...

Posted: Wed Jun 18, 2003 7:41 am
by martincrumlish
Hi Stoker.

You don't need to tell me about the mess generated by the programs when they try to make HTML. They are terrible.

The reason I have to work around this problem is because there are 100's of word documents in our dept. at work that need to go into my application. Mr X the acccountant wants to be able to write a document in word and add it to the system. he is too busy/dumb/lazy to learn to do it in HTML or to paste his content as plain text and reformat it in my WYSIWG editor so he wants to paste his word document. As Mr X is higher up than me I have to come up with something to make his life easier....hence my problem.

The purpose of the workaround to this problem is to remove the garbage as best possible so it doesnt ruin the hard work I put into developing the intranet application for these higher ups to use.

Posted: Wed Jun 18, 2003 8:54 pm
by trollll
As someone who worked for the EPA and had to deal with crap like this on a daily basis ("I just need these 125 .docs each converted to HTML and PDF today, will that present a problem?") I can honestly tell you that the easiest way to reformat Word generated HTML that I could come up with: open it in Word, copy the text (without formatting) and paste it into your template. Then hand-code the images and such into it and the HTML around it.

I spent a lot of down-time trying to come up with a good way to automatically convert everything (or at least siphon some of the crap out of it). Not worth it! Spend that time playing with cellular automata or reading the Onion or something productive!

If they insist on editing site content themselves, either get it approved to your own CMS (should take a couple weeks, but well worth the effort when done well) or hand them a copy of Homesite and tell them to deal with compatibility issues themselves.

Posted: Thu Jun 19, 2003 10:49 am
by Stoker
...as mentioned, Dreamweaver has a function for cleaning Word-HTML, if you have to do it page by page anyway, you might save some time using DW..