Page 1 of 1

Parsing HTML Page to database

Posted: Thu Jan 04, 2007 10:14 pm
by mckooter
Okay this is the beginning of a project im working on, the goal is to take all the data stored in the following html page:
http://www.cryosphere.f2s.com/Freelancer/example.html
(thats just a demo my actual page has nearly 900 entries)

and put all that data into a database, im just at the beginning and already having trouble, and cannot figure out what

first im trying to parse the html file to grab the info i want, using loadHTMLFile() I created the following script from the example

test.php

Code: Select all

<?php
$doc = new DOMDocument();
$doc->loadHTML("ex2.html");

$tags = $doc->getElementsByTagName('a');

foreach ($tags as $tag) {
       echo $tag->getAttribute('href').' | '.$tag->nodeValue."\n";
}
?>

ex2.html

Code: Select all

<html>
<head>
<title>My Page</title>
</head>
<body>
<p><a href="/mypage1">Hello World!</a></p>
<p><a href="/mypage2">Another Hello World!</a></p>
</body>
</html>


the origional example of

Code: Select all

<?php
$myhtml = <<<EOF
<html>
<head>
<title>My Page</title>
</head>
<body>
<p><a href="/mypage1">Hello World!</a></p>
<p><a href="/mypage2">Another Hello World!</a></p>
</body>
</html>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($myhtml);

$tags = $doc->getElementsByTagName('a');

foreach ($tags as $tag) {
       echo $tag->getAttribute('href').' | '.$tag->nodeValue."\n";
}
?>
works fantastic, this is simple i know it, but it wont wory any way i try it all i get with my example is a blank page, but its the same information.... im sooo confused, apparently i cant do half of what i thought i could

Posted: Thu Jan 04, 2007 10:19 pm
by volka
mckooter wrote:$doc->loadHTML($myhtml);
here you pass the html contents to the method loadHTML(). And here
mckooter wrote:$doc->loadHTML("ex2.html");
it's the name of a file. How would php know the difference?
You need another method described at http://de3.php.net/dom

Posted: Thu Jan 04, 2007 10:23 pm
by mckooter
ahhh

loadHTMLFile is what i wanted, i swear ill get it, spent all day looking to be able to find a way to do this, not sure if im going about it the most effective but i think i can get it to work

Posted: Fri Jan 05, 2007 12:06 am
by mckooter
sorry to double post but i dont think i need to start a new topic, i just want to see if there is a much easier way of doing what im doing, or rather what seems to be the only way i can do something

the data im trying to import has a date format of

Code: Select all

21:52:13 - 29 Dec 06
as an example, im trying to store this to database, so i found a class to convert the data to the type that i will be needing

the portion i have done so far is:

Code: Select all

<?php

$date = "21:52:13 - 29 Dec 06";

$newdate = ereg_replace("[-]","",$date);
echo $newdate;

?>
removing the -, simple enough, now i want to convert the DEC to 12, but i feel that 12 consecutive ereg_replace would be ridiculous and laughable, so before i make a mess of code I figured i would check to see if there is a way to convert it easier

also, please dont laugh if my above way is wayyyy too long of a path to reach a simple goal, all the documentation i have read regarding date has referred to converting date recieved from database/php to readable formats, i am doing the opposite

Posted: Fri Jan 05, 2007 12:59 am
by nickvd
wouldn't using regex be quicker and easier?

Posted: Fri Jan 05, 2007 1:13 am
by Kieran Huggins

Code: Select all

echo date('m/d/Y, H:i:s',strtotime(preg_replace('/(.*) - (.*)/',"$2, $1",$date)));

Posted: Fri Jan 05, 2007 1:21 am
by mckooter
thanks to both, i will look at them both, regex seems to be the much easier way to accomplish a simple task, i am still learning, so far i can find the longest way to a easy goal, but atleast i can find that goal, the community here and elsewhere help to show me the much quicker path to the goal i want, ang i thank you for that


PS: you should see some of the scripts ive written, coding wise they are terrifying, just horribly scripted, but as a newcomer they did what i wanted

Posted: Fri Jan 05, 2007 1:33 am
by nickvd
I was actually referring to the scraping of the html page, but using it for the timestamp works fine too :)