Remove Microsoft Word HTML tags

I've spent a long time trying many different approaches at getting rid of MS Word HTML, when importing or pasting text into my content management system, with very mixed success. Previous efforts involved using the MSHTML Element Dom but this was slow and difficult to implement. i think i've finally found a satisfactory and fast solution using only regular expressions. Please feel free to use it in your applications, and post any improvements you may find.

function cleanHTML($html) {
/// <summary>
/// Removes all FONT and SPAN tags, and all Class and Style attributes.
/// Designed to get rid of non-standard Microsoft Word HTML tags.
/// </summary>
// start by completely removing all unwanted tags

$html = ereg_replace("<(/)?(font|span|del|ins)[^>]*>","",$html);

// then run another pass over the html (twice), removing unwanted attributes

$html = ereg_replace("<([^>]*)(class|lang|style|size|face)=("[^"]*"|'[^']*'|[^>]+)([^>]*)>","<\1>",$html);
$html = ereg_replace("<([^>]*)(class|lang|style|size|face)=("[^"]*"|'[^']*'|[^>]+)([^>]*)>","<\1>",$html);

return $html
  • HervĂ©

    Some escaping are missing in the regex and a ; after return $html