Page 1 of 1

extract html from url

Posted: Sat Jan 26, 2008 5:04 pm
by xobani
i want to be able to have a php script go through links in a website and be able to extract from the html that those links point to whatever i want it to extract. i'm pretty familiar with perl regex, and i think php has something similar, so the actual extraction shouldn't be much of a problem for me.

what i don't really know how to do is write the php code that will grab the html file from the url (the html file is actually not an html file, it is a php file, so the url would look something like this: http://somesite.com/somepage.php). i also want the php code to be able to grab the html from linked pages. i've heard about libcurl but i'm having trouble installing it on my linux distro.

could someone give me working examples of how to use curl with php (if that is what i need to use) so that it will extract the html from a specified url and all links within that url???

hope i make sense. thanks!

Re: extract html from url

Posted: Sat Jan 26, 2008 5:33 pm
by JAM
Example & link to something you can use to start with. Not cURL but you might not need it...

Code: Select all

<?php
    $uri = 'http://se.php.net/manual/en/ref.pcre.php';
    $content = file_get_contents($uri);
    echo $content;
?>
Extract the a href's, and proceed to follow links...

Re: extract html from url

Posted: Sun Jan 27, 2008 6:31 pm
by xobani
great! that works nicely, but i want to be able to go a little bit further. for each linked page within this page, i want to be able to extract all corresponding content. however, each link passes a $_GET value all using the same variables, but each variable passing a different a different value.

so the url for "list.php" looks like this:

Code: Select all

list.php?CLIST=ALL&DLIST=ALL&ELIST=ALL
each item in the list has a link that points to:

Code: Select all

item.php?POS=0
item.php?POS=1
item.php?POS=2
item.php?POS=3
item.php?POS=4
item.php?POS=5
and so on and so forth....

so, doing something like:

Code: Select all

$uri = 'http://somewebsite.com/item.php?POS=0';
$content = file_get_contents($uri);
echo $content;
won't work because it doesn't know what list.php is set to (?CLIST=ALL&DLIST=ALL&ELIST=ALL)

does that make sense? what do i need to do to make this work properly??

Re: extract html from url

Posted: Mon Jan 28, 2008 2:31 pm
by xobani
anyone have any ideas on this???