Page 1 of 1

Why do I have to execute the removeChild method twice?

Posted: Mon Sep 13, 2010 1:39 am
by GeoffreyBernardo
I have a script in PHP which removes empty paragraphs from an HTML file. The empty paragraphs are those `<p></p>` elements without textContent.

HTML File with Empty Paragraphs:

Code: Select all

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
    <!--
    This page is used with remove_empty_paragraphs.php script.
    This page contains empty paragraphs. The script removes the empty paragraphs and
    writes a new HTML file.
    -->
    <html>
        <head>
            <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
            <title></title>
        </head>
        <body>
            <p>This is a paragraph.</p>
            <!-- Below is an empty paragraph. -->
            <p><span></span></p>
            <p>This is another paragraph.</p>
            <!-- Below is another empty paragraph. -->
            <p class=MsoNormal><b></b></p>
            <p style=''></p>
            <p>
                <span lang=EN-US style='font-size:5.0pt;color:navy;mso-ansi-language:EN-`US'></span>
            </p>
        </body>
    </html>
First Attempt:

Code: Select all

    $html = new DOMDocument("1.0", "UTF-8");
    $html->loadHTMLFile("HTML File with Empty Paragraphs.html");
    $pars = $html->getElementsByTagName("p");
    
    /* removeChild foreach-loop */
    foreach ($pars as $par) {
        if ($par->textContent == "") {
            $par->parentNode->removeChild($par);
        }
    }

    $html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");
This succeeds to:

- remove empty paragraphs without the style-attribute,

but fails to:

- remove empty paragraphs with the
style-attribute.

So I insert the removeStyleAttribute foreach-loop before the removeChild foreach-loop. (I do not mind removing the style-attributes of nonempty paragraphs.)

Second Attempt:

Code: Select all

    $html = new DOMDocument("1.0", "UTF-8");
    $html->loadHTMLFile("HTML File with Empty Paragraphs.html");
    $pars = $html->getElementsByTagName("p");    

    /* removeStyleAttribute foreach-loop */
    foreach ($pars as $par) {
        if ($par->hasAttribute("style")) {
            $par->removeAttribute("style");
        }
    }

    /* removeChild foreach-loop */
    foreach ($pars as $par) {
        if ($par->textContent == "") {
                $par->parentNode->removeChild($par);
        }
    }

    $html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");
This succeeds in:

- removing the style-attributes from empty paragraphs which have the style attribute.
- removing empty paragraphs that do not have the style-attributes.

But fails! to:

- _remove those empty paragraphs from which the style-attributes were
removed._

So I have to have two removeChild foreach-loops, one after the other.

Third Attempt:

Code: Select all

    $html = new DOMDocument("1.0", "UTF-8");
    $html->loadHTMLFile("HTML File with Empty Paragraphs.html");
    $pars = $html->getElementsByTagName("p");
    
    /* removeStyleAttribute foreach-loop */
    foreach ($pars as $par) {
        if ($par->hasAttribute("style")) {
            $par->removeAttribute("style");
        }
    }

    /* First removeChild foreach-loop */
    foreach ($pars as $par) {
        if ($par->textContent == "") {
            $par->parentNode->removeChild($par);
        }
    }
    
    /* Second removeChild foreach-loop, identical to the first removeChild foreach-loop */
    foreach ($pars as $par) {
        if ($par->textContent == "") {
            $par->parentNode->removeChild($par);
        }
    }

    $html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");
This works perfectly!, but it is weird to have two identical loops, one right after the other.

I also tried to use only one loop for everything.

Fourth Attempt:

Code: Select all

    $html = new DOMDocument("1.0", "UTF-8");
    $html->loadHTMLFile("HTML File with Empty Paragraphs.html");
    $pars = $html->getElementsByTagName("p");  

    foreach ($pars as $par) {
        if ($par->textContent == "") {
            if ($par->hasAttribute("style")){
                $par->removeAttribute("style");
            }
            $par->parentNode->removeChild($par);
        }
    }

    $html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");
This succeeds to:

- remove empty paragraphs without the
style-attribute,

but fails to:

- remove the style-attribute from empty paragraphs that have it.
- remove empty paragraphs with the style attribute.

This is done with PHP 5.3.0 in Netbeans 6.8.

Also posted question on stackoverflow.

Re: Why do I have to execute the removeChild method twice?

Posted: Mon Sep 13, 2010 3:57 am
by requinix
Oh man, it took a long time to track this one down. Always, always, check the user comments in the PHP manual when something doesn't seem to make sense.

getElementsByTagName will return a list of document nodes, but there's a catch: that list is dynamic. Watch:

Code: Select all

echo "There are ", $pars->length, " <p> nodes in the document.<br>\n";
$pars->item(0)->parentNode->removeChild($pars->item(0));
echo "There are now ", $pars->length, " <p> nodes in the document.<br>\n";
You would expect to see the same number both times, right? After all, you didn't affect $pars. Wrong. Removing a node from the document also removes it from the list. Comment out that removeChild and see for yourself.

Normally that would be fine, but the foreach doesn't know that the list changed. It moved on to the next item in the list completely unaware that the DOMNodeList removed an item. That means it skips nodes every time one is deleted.

Code: Select all

$pars:   1 2 3 4 5 6
foreach: ^
> Node not deleted

$pars:   1 2 3 4 5 6
foreach:   ^
> Node deleted

$pars:   1 3 4 5 6
foreach:     ^
! Node 3 skipped
> Node deleted

$pars:   1 3 5 6
foreach:       ^
! Node 5 skipped
> Node not deleted
  (because there is whitespace: the newlines+spaces inside the <p> around the <span>)
Nodes 3 ("This is another paragraph") and 5 (style='') weren't even looked at.

The solution: don't use a foreach. Your first attempt was fine except for that one part.

Code: Select all

    $html = new DOMDocument("1.0", "UTF-8");
    $html->loadHTMLFile("HTML File with Empty Paragraphs.html");
    $pars = $html->getElementsByTagName("p");
   
    /* removeChild for-loop */
    for ($i = 0; $i < $pars->length; $i++) { // a typical for loop
        $par = $pars->item($i);
        if (trim($par->textContent) == "") { // textContent will include newlines and such
            $par->parentNode->removeChild($par);
            $i--; // go back one position because the list has changed
        }
    }

    $html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");

Re: Why do I have to execute the removeChild method twice?

Posted: Mon Sep 13, 2010 5:35 am
by GeoffreyBernardo
Thanks alot, tasairis. It works like a charm!

Re: Why do I have to execute the removeChild method twice?

Posted: Mon Sep 13, 2010 6:12 am
by GeoffreyBernardo
tasairis,

In the for-loop, just remember to change

Code: Select all

$par
to

Code: Select all

$pars->item($i)
I remembered from a programming book that it was not good to change a collection while iterating over it. I did try a for-loop, but it did not work, because I did not think of backtracking with

Code: Select all

$i--
and probably because I tried to use the DOMNodeList like an array by accessing its items with a brackets, instead of with parentheses.

It just so happened that the empty paragraphs with the style-attribute were situated in such a position where they were skipped. That is why I thought they were to blame.

Re: Why do I have to execute the removeChild method twice?

Posted: Mon Sep 13, 2010 6:22 am
by requinix
Ah, yes, forgot that again. (Yeah, again.) I used

Code: Select all

$par = $pars->item($i);