Parse HTML with DOMXPath
Posted: Sun Jan 02, 2011 5:29 am
Hi everyone,
My girlfriend is searching for a used car. Therefore I would like to get more insight in the prices of the second handed cars. To do so I've started working on a script to gather information from a car website, so I can get some statistics. The only problem is that i can't get the data as I would like.
This is a snippet of the source of the websites:
And the code that i've got so far from some internet examples is
When i use query '//div[@class="datarow_container nobackground"]' i've get all the information back in one query. I think there is a better way to do this so i can store this information into a database eventualy as followed:
Kilometerstand 53.730 km
Kenteken 82-TS-SD
Bouwjaar 01-2007
APK geldig tot 09-01-2011
Brandstof Benzine
Verbruik 17,2 km per liter
Carrosserievorm Hatchback
Aantal deuren 5
Transmissie Handgeschakeld
Geïmporteerd Nee
Onderhoudsboekje -
Hopefully someone here can give me some tips how to parse this information.
Thank in advance.
My girlfriend is searching for a used car. Therefore I would like to get more insight in the prices of the second handed cars. To do so I've started working on a script to gather information from a car website, so I can get some statistics. The only problem is that i can't get the data as I would like.
This is a snippet of the source of the websites:
Code: Select all
<!-- DATABLOCK -->
<div class="datarow_container nobackground">
<div class="inner">
<!-- PHOTOOVERVIEW -->
<div id="photooverview">
<img src="http://media.autotrack.nl/car-images/of01/s034/afm6/0007220734.jpg" width="283" alt="Opel Corsa 1.2 16V 5DRS" id="detail_photo_large" class="car_detail_large" />
<!-- PHOTOOVERVIEW THUMBS-->
<ul id="photooverview_thumbs"><li class="thumb">
<a href="http://media.autotrack.nl/car-images/of01/s034/afm6/0007220734.jpg">
<img width="88" alt="thumb" src="http://media.autotrack.nl/car-images/of01/s034/afm1/0007220734.jpg" />
</a>
</li><li class="thumb">
<a href="http://media.autotrack.nl/car-images/of02/s034/afm6/0007220734.jpg">
<img width="88" alt="thumb" src="http://media.autotrack.nl/car-images/of02/s034/afm1/0007220734.jpg" />
</a>
</li><li class="thumb">
<a href="http://media.autotrack.nl/car-images/of03/s034/afm6/0007220734.jpg">
<img width="88" alt="thumb" src="http://media.autotrack.nl/car-images/of03/s034/afm1/0007220734.jpg" />
</a>
</li></ul> <br class="clear" />
</div>
<dl class="datarows" id="datarows_small">
<dt class="odd">Kilometerstand</dt>
<dd class="odd">
<a name="Kilometerinfo" title="Voor AutoTrack.nl is het, door het ontbreken van de (kenteken)gegevens of het lidmaatschap bij NAP, niet mogelijk om deze kilometerstand te checken bij Nationale Auto Pas. Informeer hiernaar bij de aanbieder."></a>
53.730 km </dd>
<dt>Kenteken</dt>
<dd>82-TS-SF</dd>
<dt class="odd">Bouwjaar</dt>
<dd class="odd"> 01-2007</dd>
<dt>APK geldig tot</dt>
<dd>09-01-2011</dd>
<dt class="odd">Brandstof</dt>
<dd class="odd">Benzine</dd>
<dt>Verbruik</dt>
<dd>17,2 km per liter</dd>
<dt class="odd">Carrosserievorm</dt>
<dd class="odd">Hatchback</dd>
<dt>Aantal deuren</dt>
<dd>5</dd>
<dt class="odd">Transmissie</dt>
<dd class="odd">Handgeschakeld</dd>
<dt>Geïmporteerd</dt>
<dd>Nee</dd>
<dt class="odd">Onderhoudsboekje</dt>
<dd class="odd">-</dd>
</dl>Code: Select all
$target_url = "http://www.autotrack.nl/tweedehands/opel/corsa/7220734";
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->query('//div[@class="datarow_container nobackground"]/dl[@class="datarows"]');
foreach ($hrefs as $tag) {
var_dump(trim($tag->nodeValue));
}Kilometerstand 53.730 km
Kenteken 82-TS-SD
Bouwjaar 01-2007
APK geldig tot 09-01-2011
Brandstof Benzine
Verbruik 17,2 km per liter
Carrosserievorm Hatchback
Aantal deuren 5
Transmissie Handgeschakeld
Geïmporteerd Nee
Onderhoudsboekje -
Hopefully someone here can give me some tips how to parse this information.
Thank in advance.