Removing MS Word HTML from a file

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
martincrumlish
Forum Newbie
Posts: 13
Joined: Mon Oct 14, 2002 10:46 am

Removing MS Word HTML from a file

Post by martincrumlish »

Hi,

I have a problem with converting word docs to HTML. As you probably know, when word generates its HTML it has a lot of needless tags. Is there a way I can cut these out and just leave basic formatting tags such as <b>, <br>, <ol>, <li> etc.

I found this code to remove Word HTML

Code: Select all

$search = array ("'<script[^>]*?>.*?</script>'si",  // Strip out javascript 
                 "'<[\/\!]*?[^<>]*?>'si",           // Strip out html tags 
                 "'([\r\n])[\s]+'",                 // Strip out white space 
                 "'&(quot|#34);'i",                 // Replace html entities 
                 "'&(amp|#38);'i", 
                 "'&(lt|#60);'i", 
                 "'&(gt|#62);'i", 
                 "'&(nbsp|#160);'i", 
                 "'&(iexcl|#161);'i", 
                 "'&(cent|#162);'i", 
                 "'&(pound|#163);'i", 
                 "'&(copy|#169);'i", 
                 "'&#(\d+);'e");                    // evaluate as php 

$replace = array ("", 
                  "", 
                  "\\1", 
                  """, 
                  "&", 
                  "<", 
                  ">", 
                  " ", 
                  chr(161), 
                  chr(162), 
                  chr(163), 
                  chr(169), 
                  "chr(\\1)"); 

$content = preg_replace ($search, $replace, $content);
I found this on another site as a solution for removing word HTML but the problem is it removes all of the HTML leaving the file as a blob of text only with nor line breaks or anything. I am afraid I don't understand the code above fully so I was hoping someone on here could help me out.

Basically, I need help modiying the code above to remove all the crap but still leave certain tags.

Thanks in advance,
Martin
User avatar
Stoker
Forum Regular
Posts: 782
Joined: Thu Jan 23, 2003 9:45 pm
Location: SWNY
Contact:

Post by Stoker »

not the answer to your question;
Why use stuff like IE and Word to begin width?? Copy and paste and html generation with those makes the biggest mess in the world.. Not to mention MS-Publisher (unpatched) creating 2Megabyte pages and Frontpage not understanding that a table can be inside a table inside a table inside a table inside a layer inside a table etc..

Software like Macromedia Dreamweaver has built in functions for cleaning up Word-HTML, but if you have DW, I see no point making Garbage sites with Bill Gates nor Steve Ballmer to begin with...
martincrumlish
Forum Newbie
Posts: 13
Joined: Mon Oct 14, 2002 10:46 am

Post by martincrumlish »

Hi Stoker.

You don't need to tell me about the mess generated by the programs when they try to make HTML. They are terrible.

The reason I have to work around this problem is because there are 100's of word documents in our dept. at work that need to go into my application. Mr X the acccountant wants to be able to write a document in word and add it to the system. he is too busy/dumb/lazy to learn to do it in HTML or to paste his content as plain text and reformat it in my WYSIWG editor so he wants to paste his word document. As Mr X is higher up than me I have to come up with something to make his life easier....hence my problem.

The purpose of the workaround to this problem is to remove the garbage as best possible so it doesnt ruin the hard work I put into developing the intranet application for these higher ups to use.
User avatar
trollll
Forum Contributor
Posts: 181
Joined: Tue Jun 10, 2003 11:56 pm
Location: Round Rock, TX
Contact:

Post by trollll »

As someone who worked for the EPA and had to deal with crap like this on a daily basis ("I just need these 125 .docs each converted to HTML and PDF today, will that present a problem?") I can honestly tell you that the easiest way to reformat Word generated HTML that I could come up with: open it in Word, copy the text (without formatting) and paste it into your template. Then hand-code the images and such into it and the HTML around it.

I spent a lot of down-time trying to come up with a good way to automatically convert everything (or at least siphon some of the crap out of it). Not worth it! Spend that time playing with cellular automata or reading the Onion or something productive!

If they insist on editing site content themselves, either get it approved to your own CMS (should take a couple weeks, but well worth the effort when done well) or hand them a copy of Homesite and tell them to deal with compatibility issues themselves.
User avatar
Stoker
Forum Regular
Posts: 782
Joined: Thu Jan 23, 2003 9:45 pm
Location: SWNY
Contact:

Post by Stoker »

...as mentioned, Dreamweaver has a function for cleaning Word-HTML, if you have to do it page by page anyway, you might save some time using DW..
Post Reply