extract html from url

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
xobani
Forum Newbie
Posts: 3
Joined: Sat Jan 26, 2008 4:54 pm

extract html from url

Post by xobani »

i want to be able to have a php script go through links in a website and be able to extract from the html that those links point to whatever i want it to extract. i'm pretty familiar with perl regex, and i think php has something similar, so the actual extraction shouldn't be much of a problem for me.

what i don't really know how to do is write the php code that will grab the html file from the url (the html file is actually not an html file, it is a php file, so the url would look something like this: http://somesite.com/somepage.php). i also want the php code to be able to grab the html from linked pages. i've heard about libcurl but i'm having trouble installing it on my linux distro.

could someone give me working examples of how to use curl with php (if that is what i need to use) so that it will extract the html from a specified url and all links within that url???

hope i make sense. thanks!
User avatar
JAM
DevNet Resident
Posts: 2101
Joined: Fri Aug 08, 2003 6:53 pm
Location: Sweden
Contact:

Re: extract html from url

Post by JAM »

Example & link to something you can use to start with. Not cURL but you might not need it...

Code: Select all

<?php
    $uri = 'http://se.php.net/manual/en/ref.pcre.php';
    $content = file_get_contents($uri);
    echo $content;
?>
Extract the a href's, and proceed to follow links...
xobani
Forum Newbie
Posts: 3
Joined: Sat Jan 26, 2008 4:54 pm

Re: extract html from url

Post by xobani »

great! that works nicely, but i want to be able to go a little bit further. for each linked page within this page, i want to be able to extract all corresponding content. however, each link passes a $_GET value all using the same variables, but each variable passing a different a different value.

so the url for "list.php" looks like this:

Code: Select all

list.php?CLIST=ALL&DLIST=ALL&ELIST=ALL
each item in the list has a link that points to:

Code: Select all

item.php?POS=0
item.php?POS=1
item.php?POS=2
item.php?POS=3
item.php?POS=4
item.php?POS=5
and so on and so forth....

so, doing something like:

Code: Select all

$uri = 'http://somewebsite.com/item.php?POS=0';
$content = file_get_contents($uri);
echo $content;
won't work because it doesn't know what list.php is set to (?CLIST=ALL&DLIST=ALL&ELIST=ALL)

does that make sense? what do i need to do to make this work properly??
xobani
Forum Newbie
Posts: 3
Joined: Sat Jan 26, 2008 4:54 pm

Re: extract html from url

Post by xobani »

anyone have any ideas on this???
Post Reply