Regular expressions:Extracting<a href="blah.com"

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
User avatar
Skittlewidth
Forum Contributor
Posts: 389
Joined: Wed Nov 06, 2002 9:18 am
Location: Kent, UK

Regular expressions:Extracting<a href="blah.com"

Post by Skittlewidth »

I need to search an html file and extract only the hyper text links from it.
As I don't know what those urls and links are going to be, I know I need to find everything from the opening <a to the closing </a> tag and ignore everything else.

This is the first time I've attempted a real regular expression and I'm totally stuck! I found the following expression which is supposed to achieve what I want, but implementing it seems to be another thing altogether!

<a href="[^"]+">[^<]+</a>

I've tried escaping the double quotes in the expression so that it doesn't get confused by the double quotes containing the expression when I use it in preg_match, but then I get the following error:

Warning: Unknown modifier '[' in /home/httpd/html/newstaffintranet/index.php on line 93

Line 93

Code: Select all

$urls = preg_match("<a href="&#1111;^"]+">&#1111;^<]+</a>", $news);
Escaping '[' then just brings up the same error but with '''
So what am I doing wrong this time? :?: :(

Thanks
User avatar
volka
DevNet Evangelist
Posts: 8391
Joined: Tue May 07, 2002 9:48 am
Location: Berlin, ger

Post by volka »

all preg_... functions are located within the Perl Compatible Regular Expression (PCRE)-module. In perl a regex has the form /<pattern>/<options>, e.g.

Code: Select all

if ($i =~ /abc(\d+)/i)
preg_... emulates this behaviour and demands a delimiter character in front of and at the end of the pattern-parameter; behind the pattern there may be options, too (e.g. i for case-insensitive). The delimeter can be chosen (almost) freely.
If you want to fetch a certain part of the match you have to mark it with unquoted parentheses; the complete match is returned anyway.
The signiture of preg_match is int preg_match ( string pattern, string subject [, array matches [, int flags]]), therefor $urls should be the third parameter to get the captured matches.
try

Code: Select all

if(preg_match('/<a href="([^"]+)">[^<]+</a>/i', $news, $urls))
{
	echo '<pre>';
	print_r($urls);
	echo '</pre>';
}
User avatar
twigletmac
Her Royal Site Adminness
Posts: 5371
Joined: Tue Apr 23, 2002 2:21 am
Location: Essex, UK

Post by twigletmac »

One teeny correction to volka's code - you're going to have to escape any forward slashes within the pattern otherwise you'll get errors, so instead of </a> you'd put <\/a>.

Mac
User avatar
Skittlewidth
Forum Contributor
Posts: 389
Joined: Wed Nov 06, 2002 9:18 am
Location: Kent, UK

Post by Skittlewidth »

Thanks I'll try that now! :)
User avatar
volka
DevNet Evangelist
Posts: 8391
Joined: Tue May 07, 2002 9:48 am
Location: Berlin, ger

Post by volka »

you're going to have to escape any forward slashes within the pattern otherwise you'll get errors
*outch*
or you choose another delimeter (one that does not appear within the pattern, /me slaps volka around a bit with a large trout ;) )

Code: Select all

preg_match('!<a href="([^"]+)">[^<]+</a>!i', $news, $urls)
User avatar
Skittlewidth
Forum Contributor
Posts: 389
Joined: Wed Nov 06, 2002 9:18 am
Location: Kent, UK

Post by Skittlewidth »

It doesn't appear to be returning or displaying anything, which might suggest it hasn't found a match?

Its not come up with any errors so apart from that it must be right! I guess the expression needs to be refined a bit more?

I've checked that the PCRE module is enabled.
:?
User avatar
volka
DevNet Evangelist
Posts: 8391
Joined: Tue May 07, 2002 9:48 am
Location: Berlin, ger

Post by volka »

I just tested this script

Code: Select all

<html><body>
	<table border="1">
<?php
$fd = fopen('http://www.php.net', 'rb');
$phpnet = '';

while($part = fread($fd, 4096))
	$phpnet.=$part;

if(preg_match_all('!<a href="([^"]+)">[^<]+</a>!i', $phpnet, $urls))
{
	for($i=0; $i!=count($urls[0]); $i++)
		echo '<tr><td>', $i, '</td><td>', htmlentities($urls[0][$i]), '</td><td>', $urls[1][$i], '</td></tr>';
}
else
	echo 'no matches';
	
?>
</table>
</body></html>
preg_match will only return the first match and stop afterwards. preg_match_all return all matches.
User avatar
Skittlewidth
Forum Contributor
Posts: 389
Joined: Wed Nov 06, 2002 9:18 am
Location: Kent, UK

Post by Skittlewidth »

I've run your script and it works just great :D and in theory that's just what I want, however just my luck, when I replace http://www.php.net with the url I need to use - http://www.guardian.co.uk/syndication/s ... l?U1271588 then it returns no matches! 8O

I've checked other urls and they've worked fine, so I looked at the source code for this page to try and find a reason. The file I need to search is so basic I'm surprised there's any trouble finding the links!

Thanks for your help so far, it's really been appreciated!
User avatar
volka
DevNet Evangelist
Posts: 8391
Joined: Tue May 07, 2002 9:48 am
Location: Berlin, ger

Post by volka »

e.g.
<a HREF='http://www.guardian.co.uk/Iraq/Story/0, ... +Uk+latest'
>
PM caught in diplomatic deadlock
</a>
the url is encapsulated in ' instead of " and the whole thing is spread over more than one line.

Code: Select all

if(preg_match_all('!<a href=[''"]([^''"]+)["'']\s*>[^<]+</a>!mi', $phpnet, $urls))
should do the trick.
But The Guardian provides its headlines as rss feed, too.
http://www.guardian.co.uk/rss/ (sometimes the internet explorer cannot retrieve the data, then try http://www.guardian.co.uk/rss/1,,,00.xml)
superwormy
Forum Commoner
Posts: 67
Joined: Fri Oct 04, 2002 9:25 am
Location: CT

Post by superwormy »

Just as an added tip, you don't need Regex at all to do this, you can USUALLY do this much more efficiently with basic string matching.

Unfortunately I don't have my script here that I wrote or I'd post it for you, but the basic idea is this:

while (strstr ("<a ", $contents)) {

$contents = stristr ("<a ", $contents);
$contents = substr ($contents, 3);
$tostrip = stristr (">", $contents);

$atag = str_replace ($tostrip, "", $contents);

// this leaves something like this:
// href="http://www.ur.com" name="whatever"

}

Then just add to this to take everything thats 'href=\"' next, and it'll usually be MUCH faster than a regular expression.
User avatar
Skittlewidth
Forum Contributor
Posts: 389
Joined: Wed Nov 06, 2002 9:18 am
Location: Kent, UK

Post by Skittlewidth »

Thats fantastic Volka, thankyou :D
I'm going to continue to try and get my head round these regular expressions and will hopefully be able to work them out myself someday!

rss feeds are something I have only heard about in the last few days since trying to do this thing, and xml is something I'd like to look into at some point. Whats the idea behind rss and how is it better than the other news feed they offer?
Post Reply