This is the code I use with the DOMDocument object for HTML files not prepared in MS Word:
Code: Select all
<?php
/* Using the DOMDocument class */
/* Create a new DOMDocument object. */
$html = new DOMDocument("1.0", "UTF-8");
/* Load HTML code from an HTML file into the DOMDocument. */
$html->loadHTMLFile("HTML File With Empty Paragraphs.html");
/* Assign all the <p> elements into the $pars DOMNodeList object. */
$pars = $html->getElementsByTagName("p");
echo "The initial number of paragraphs is " . $pars->length . ".<br />";
/* The trim() function is used to remove leading and trailing spaces as well as
* newline characters. */
for ($i = 0; $i < $pars->length; $i++){
if (trim($pars->item($i)->textContent == "")){
$pars->item($i)->parentNode->removeChild($pars->item($i));
$i--;
}
}
echo "The final number of paragraphs is " . $pars->length . ".<br />";
// Write the HTML code back into an HTML file.
$html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");
?>Code: Select all
<?php
/* Using simple_html_dom.php */
include("simple_html_dom.php");
$html = file_get_html("HTML File With Empty Paragraphs.html");
$pars = $html->find("p");
for ($i = 0; $i < count($pars); $i++) {
if (trim($pars[$i]->plaintext == "")) {
unset($pars[$i]);
$i--;
}
}
$html->save("HTML File without Empty Paragraphs.html");
?>Does anyone know how I can fix this?
Thank you.
I also asked on stackoverflow.