Page 1 of 1

[solved] Parsing an HTML file with PHP XML functions

Posted: Thu Dec 14, 2006 3:37 am
by Skittlewidth
Can some one tell me how to get started with parsing an HTML document using one of the PHP XML functions, i.e. which of the many PHP XML facilities can I use and how do I use it?

I have a file containing an HTML List that contains sublists, and I was wondering whether I could parse it like an XML file (which I'm sure you can) so I can determine the relationships between the lists.

For example if I know what the content is of one of the <li> tags in the nested list, I want to work out what the content of the parent <li> tag was that contains the <ul> nested list.

I'm not too hot on navigating through XML at the moment so this is really just an experimental exercise to form a breadcrumb from a hardcoded CSS menu. Usually these menus are formed from a database so I can use the database to generate the breadcrumb, but I thought I'd take a chance to learn something new, and I'm not saying its the best way of doing things.

I'm searching for tutorials and articles as we speak, but some starter code or if you point me to the relevant functions on PHP.net that would be helpful. There seems to be a lot of choice in the manual and I don't know where to start!

Posted: Thu Dec 14, 2006 3:51 am
by volka
With the DOM extension of php 5 I can e.g. do

Code: Select all

$doc = DOMDocument::loadhtmlfile('http://www.php.net');
$xpath = new DOMXPath($doc);
$nodeset = $xpath->query('//div[@id="leftbar"]//ul[position()=1]//li');
foreach($nodeset as $node) {
	echo $node->textContent, "<br />\n";
}
and get all the <li> entries of the
Thanks To

* easyDNS
* Directi
* pair Networks
* EV1Servers
* Server Central
* Hosted Solutions
* Spry VPS Hosting
* eZ systems / HiT
* OSU Open Source Lab
* Emini A/S
* Yahoo! Inc.
list on http://www.php.net

Take a look at http://en.wikipedia.org/wiki/XML and http://en.wikipedia.org/wiki/XPath esp. the links to tutorials

Posted: Thu Dec 14, 2006 4:19 am
by Skittlewidth
Cool, I was playing around with something along those lines but wasn't quite doing it right. I'll take a look at that sample code.

Posted: Thu Dec 14, 2006 6:34 am
by Skittlewidth
feyd | Please use

Code: Select all

,

Code: Select all

and [syntax="..."] tags where appropriate when posting code. Your post has been edited to reflect how we'd like it posted. Please read:  [url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url] to learn how to do it too.[/color]


For anyone interested:
[syntax="html"]
<span id="test">
<ul>
  		<li><a href="index.php" title="Process Systems">Process Systems</a>
        <ul>
          <li><a href="intake_storage" title="Intake & Storage">Intake & Storage</a></li>
          <li><a href="mixing_blending" title="Mixing & Blending">Mixing & Blending</a></li>
          <li><a href="heat_treatment" title="Heat Treatment / Pasteurisation">Heat Treatment / Pasteurisation</a></li>
		  <li><a href="separation_homogenisation" title="Separation & Homogenisation">Separation & Homogenisation</a></li>
          <li style="border-bottom:1px solid #016867"><a href="cleaning_in_place" title="Cleaning in Place">Cleaning in Place</a></li>
        </ul>
      </li>
		<li><a href="links" title="Links">Links</a></li>
		<li><a href="contact_us" title="Contact Us">Contact Us</a>
		  <ul>
			<li><a href="key_personnel" title="Key Personnel">Key Personnel</a></li>
			<li style="border-bottom:1px solid #016867"><a href="location" title="Location Map">Location Map</a></li>
		  </ul>
		</li>
  	</ul> 
</span>	
[/syntax]

Code: Select all

$doc = DOMDocument::loadhtmlfile('includes/menu.inc.php');
$xpath = new DOMXPath($doc);

$nodeset = $xpath->query('//span[@id="test"]//li[a="Key Personnel"]');  
foreach($nodeset as $node) {
                $parent = $node->parentNode->parentNode; // <ul> is the next node up, we want the <li> before it.
		echo "<pre>".$parent->firstChild->textContent. "</pre>";
}
This successfully identifies "Contact Us" as the parent of the "Key Personnel" link on the menu. (i.e. "Key Personnel" is on a flyout from "Contact Us")

I think I could avoid the double parentNode call if I modified the xpath to go for the first parent node but I was struggling with the syntax and it kept returning more data than I was after.
Any modification suggestions are welcome!

:)


feyd | Please use

Code: Select all

,

Code: Select all

and [syntax="..."] tags where appropriate when posting code. Your post has been edited to reflect how we'd like it posted. Please read:  [url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url] to learn how to do it too.[/color]