Parse HTML with DOMXPath

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
patrickvanos
Forum Newbie
Posts: 1
Joined: Sun Jan 02, 2011 5:10 am

Parse HTML with DOMXPath

Post by patrickvanos »

Hi everyone,

My girlfriend is searching for a used car. Therefore I would like to get more insight in the prices of the second handed cars. To do so I've started working on a script to gather information from a car website, so I can get some statistics. The only problem is that i can't get the data as I would like.

This is a snippet of the source of the websites:

Code: Select all

        <!-- DATABLOCK -->
        <div class="datarow_container nobackground">

            <div class="inner">

                <!-- PHOTOOVERVIEW -->
                <div id="photooverview">
                                        <img src="http://media.autotrack.nl/car-images/of01/s034/afm6/0007220734.jpg" width="283" alt="Opel Corsa 1.2 16V 5DRS" id="detail_photo_large" class="car_detail_large" />

                    <!-- PHOTOOVERVIEW THUMBS-->
                        <ul id="photooverview_thumbs"><li class="thumb">
                                          <a href="http://media.autotrack.nl/car-images/of01/s034/afm6/0007220734.jpg">
                                              <img width="88" alt="thumb" src="http://media.autotrack.nl/car-images/of01/s034/afm1/0007220734.jpg" />
                                          </a>
                                      </li><li class="thumb">
                                          <a href="http://media.autotrack.nl/car-images/of02/s034/afm6/0007220734.jpg">
                                              <img width="88" alt="thumb" src="http://media.autotrack.nl/car-images/of02/s034/afm1/0007220734.jpg" />
                                          </a>
                                      </li><li class="thumb">
                                          <a href="http://media.autotrack.nl/car-images/of03/s034/afm6/0007220734.jpg">
                                              <img width="88" alt="thumb" src="http://media.autotrack.nl/car-images/of03/s034/afm1/0007220734.jpg" />
                                          </a>
                                      </li></ul>                        <br class="clear" />
                </div>

                <dl class="datarows" id="datarows_small">


    <dt class="odd">Kilometerstand</dt>

    <dd class="odd">
                <a  name="Kilometerinfo" title="Voor AutoTrack.nl is het, door het ontbreken van de (kenteken)gegevens of het lidmaatschap bij NAP, niet mogelijk om deze kilometerstand  te checken bij Nationale Auto Pas. Informeer hiernaar bij de aanbieder."></a>
        53.730 km    </dd>

    <dt>Kenteken</dt>
    <dd>82-TS-SF</dd>

    <dt class="odd">Bouwjaar</dt>
    <dd class="odd"> 01-2007</dd>

    <dt>APK geldig tot</dt>
    <dd>09-01-2011</dd>

    <dt class="odd">Brandstof</dt>
    <dd class="odd">Benzine</dd>

    <dt>Verbruik</dt>
    <dd>17,2 km per liter</dd>

    <dt class="odd">Carrosserievorm</dt>
    <dd class="odd">Hatchback</dd>

    <dt>Aantal deuren</dt>
    <dd>5</dd>

    <dt class="odd">Transmissie</dt>
    <dd class="odd">Handgeschakeld</dd>

    <dt>Geïmporteerd</dt>
    <dd>Nee</dd>

    <dt class="odd">Onderhoudsboekje</dt>
    <dd class="odd">-</dd>

    </dl>
And the code that i've got so far from some internet examples is

Code: Select all

	
                $target_url = "http://www.autotrack.nl/tweedehands/opel/corsa/7220734";
	$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

	// make the cURL request to $target_url
	$ch = curl_init();
	curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
	curl_setopt($ch, CURLOPT_URL,$target_url);
	curl_setopt($ch, CURLOPT_FAILONERROR, true);
	curl_setopt($ch, CURLOPT_AUTOREFERER, true);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
	curl_setopt($ch, CURLOPT_TIMEOUT, 10);
	$html= curl_exec($ch);
	if (!$html) {
		echo "<br />cURL error number:" .curl_errno($ch);
		echo "<br />cURL error:" . curl_error($ch);
		exit;
	}

	// parse the html into a DOMDocument
	$dom = new DOMDocument();
	@$dom->loadHTML($html);

	// grab all the on the page
	$xpath = new DOMXPath($dom);
	$hrefs = $xpath->query('//div[@class="datarow_container nobackground"]/dl[@class="datarows"]');

	foreach ($hrefs as $tag) {     
		var_dump(trim($tag->nodeValue)); 
	}
When i use query '//div[@class="datarow_container nobackground"]' i've get all the information back in one query. I think there is a better way to do this so i can store this information into a database eventualy as followed:

Kilometerstand 53.730 km
Kenteken 82-TS-SD
Bouwjaar 01-2007
APK geldig tot 09-01-2011
Brandstof Benzine
Verbruik 17,2 km per liter
Carrosserievorm Hatchback
Aantal deuren 5
Transmissie Handgeschakeld
Geïmporteerd Nee
Onderhoudsboekje -

Hopefully someone here can give me some tips how to parse this information.

Thank in advance.
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Re: Parse HTML with DOMXPath

Post by Weirdan »

the query most likely should be something like: //dl[@class="datarows"]/*/text()
drop the /text() part if that doesn't work
Post Reply