Striping URL's and related Text from a web page

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
Rebajas
Forum Newbie
Posts: 16
Joined: Tue Aug 20, 2002 9:35 am
Location: http://www.rebajas.co.uk/

Striping URL's and related Text from a web page

Post by Rebajas »

I'd like to take an existing page of markup and strip everything but the links within. Basically the page consists of 5 formatted links and associated text and the page formatting.

So i'd like to turn this:

Code: Select all

<html><body topmargin=0 leftmargin=0 marginheight=0 marginwidth=0><table
width=120 height=632 border=0 cellspacing=0 cellpadding=0><tr valign=bottom
align=center><td><A
href=http://www.amazon.co.uk/exec/obidos/redirect?tag=member-21&creative=14
10&camp=210&link_code=bn1&path=tg/stores/browse/-/welcome/468294
target=_top><IMG
src=http://rcm-images.amazon.com/images/G/02/associates/recommends/recommend
s_120x60.gif border=0 height=60 width=120 alt=amazon.co.uk></A></td></tr><tr
valign=top align=center><td><table border=0 cellspacing=0 cellpadding=1
bgcolor=#000000 width=120 height=572><tr><td width=100% height=100%><table
border=0 cellspacing=0 cellpadding=0 width=118 height=570><tr
bgcolor=#FFFFFF><TD><table width=118 height=570 cellpadding=0 cellspacing=0
border=0 align=center valign=middle><tr><td align=center valign=top><table
cellpadding=1 cellspacing=0><TR valign=top><TD align=center valign=top>A
href=http://www.amazon.co.uk/exec/obidos/redirect?tag=member-21&creative=14
10&camp=210&link_code=bn1&path=tg/sim-explorer/explore-items/-/0764550462/0
target=_top><FONT face=Arial size=-1 color=3366FF>Homebrewing For
Dummies</font></a><br><FONT face=Arial size=-2 color=000000>Nachel<br>Our
Price: <font color=990000>$17.99</font></td></tr></table></td></tr><TR
height=20 width=118 bgcolor=#FFFFFF valign=top><TD align=center
valign=bottom><FONT face=Arial size=-2 color=A1A1A1>Prices May Change<br><A
href=http://rcm.amazon.com/e/cm/privacy-policy.html?o=2 target=_top><FONT
face=Arial size=-2 color=A1A1A1>Privacy
Information</FONT></A></TD></TR></TABLE></TD></TR></TABLE></TD></TR></TABLE>
</TD></TR></TABLE></body></html>
Into this:

Code: Select all

http://www.amazon.co.uk/exec/obidos/redirect?tag=member-21&creative=14
10&camp=210&link_code=bn1&path=tg/sim-explorer/explore-items/-/0764550462/0|Homebrewing For
Dummies|£17.99
Any help greatly appreciated.

:)
Rebajas
Forum Newbie
Posts: 16
Joined: Tue Aug 20, 2002 9:35 am
Location: http://www.rebajas.co.uk/

Appended

Post by Rebajas »

I'd like to turn it into the latter but with all 5 links in.

I left them out to save space, i'm sure you get my drift... :)
User avatar
volka
DevNet Evangelist
Posts: 8391
Joined: Tue May 07, 2002 9:48 am
Location: Berlin, ger

Post by volka »

can you provide the amazon search url? Your example seems a bit broken
Rebajas
Forum Newbie
Posts: 16
Joined: Tue Aug 20, 2002 9:35 am
Location: http://www.rebajas.co.uk/

The URL

Post by Rebajas »

Here is the URL of the actual page on Amazon.

http://rcm-uk.amazon.co.uk/e/cm?t=membe ... r&bg1=&npa
User avatar
volka
DevNet Evangelist
Posts: 8391
Joined: Tue May 07, 2002 9:48 am
Location: Berlin, ger

Post by volka »

unfortunatly the price is not on this page

Code: Select all

<pre><?php
$fd = fopen('http://rcm-uk.amazon.co.uk/e/cm?t=member-21&l=bn1&browse=271027&mode=books-uk&p=11&o=2&f=ifr&bg1=&npal', 'rb');
$content = '';
while($part=fread($fd, 1024))
	$content .= $part;
fclose($fd);	
preg_match_all('!<a\s+([^>]*)!', $content, $matches);

if (preg_match_all('!<a\s+href=(\S+)[^>]*><font[^>]*>([^<]*)</font></a><br><font[^>]*>([^<]*)!', $content, $matches))
{
	$data = array();
	while(count($matches[1]) > 0)
	{
		$link = array_shift($matches[1]);
		$fd = fopen($link, 'rb');
		$content = '';
		while($part=fread($fd, 1024))
		{
			$content = substr($content, -48, 48).$part;
			if ($price = strstr($content, '<b>Our Price:'))
				break;
		}
		fclose($fd);	
		if ($price !==FALSE)
		{
			if (preg_match('!&pound;(\d+\.\d+)!', $price, $price))
				$price = $price[1];
		}
		$data[] = join('|', array($link, array_shift($matches[2]), array_shift($matches[3]), $price));
	}
}
print_r($data);
?></pre>
will do but takes some time because it has to request another page for each item :-S
JP
Forum Newbie
Posts: 3
Joined: Tue Dec 24, 2002 3:13 am
Location: The Netherlands

Stripping url's?

Post by JP »

Hi,

I suppose you want to do all of this serverside? (php forum duh)

If the program your writing can do stuff clientside you can use
JavaScript:

document.links.length gives you the number of links in the page

document.links[n].
- hash
- host
- hostname
- href
- pathname
- port
- protocol
- search (= query string data !!)
- target
- text

Try them out ...
Post Reply