Page 1 of 1
Regular expressions:Extracting<a href="blah.com"
Posted: Tue Mar 11, 2003 8:46 am
by Skittlewidth
I need to search an html file and extract only the hyper text links from it.
As I don't know what those urls and links are going to be, I know I need to find everything from the opening <a to the closing </a> tag and ignore everything else.
This is the first time I've attempted a real regular expression and I'm totally stuck! I found the following expression which is supposed to achieve what I want, but implementing it seems to be another thing altogether!
<a href="[^"]+">[^<]+</a>
I've tried escaping the double quotes in the expression so that it doesn't get confused by the double quotes containing the expression when I use it in preg_match, but then I get the following error:
Warning: Unknown modifier '[' in /home/httpd/html/newstaffintranet/index.php on line 93
Line 93
Code: Select all
$urls = preg_match("<a href="ї^"]+">ї^<]+</a>", $news);
Escaping '[' then just brings up the same error but with '''
So what am I doing wrong this time?
Thanks
Posted: Tue Mar 11, 2003 9:32 am
by volka
all preg_... functions are located within the
Perl Compatible Regular Expression (PCRE)-module. In perl a regex has the form /<pattern>/<options>, e.g.
preg_... emulates this behaviour and demands a delimiter character in front of and at the end of the pattern-parameter; behind the pattern there may be options, too (e.g. i for case-insensitive). The delimeter can be chosen (almost) freely.
If you want to fetch a certain part of the match you have to mark it with unquoted parentheses; the complete match is returned anyway.
The signiture of preg_match is
int preg_match ( string pattern, string subject [, array matches [, int flags]]), therefor $urls should be the third parameter to get the captured matches.
try
Code: Select all
if(preg_match('/<a href="([^"]+)">[^<]+</a>/i', $news, $urls))
{
echo '<pre>';
print_r($urls);
echo '</pre>';
}
Posted: Tue Mar 11, 2003 9:44 am
by twigletmac
One teeny correction to volka's code - you're going to have to escape any forward slashes within the pattern otherwise you'll get errors, so instead of </a> you'd put <\/a>.
Mac
Posted: Tue Mar 11, 2003 9:49 am
by Skittlewidth
Thanks I'll try that now!

Posted: Tue Mar 11, 2003 9:52 am
by volka
you're going to have to escape any forward slashes within the pattern otherwise you'll get errors
*outch*
or you choose another delimeter (one that does not appear within the pattern,
/me slaps volka around a bit with a large trout 
)
Code: Select all
preg_match('!<a href="([^"]+)">[^<]+</a>!i', $news, $urls)
Posted: Tue Mar 11, 2003 10:10 am
by Skittlewidth
It doesn't appear to be returning or displaying anything, which might suggest it hasn't found a match?
Its not come up with any errors so apart from that it must be right! I guess the expression needs to be refined a bit more?
I've checked that the PCRE module is enabled.

Posted: Tue Mar 11, 2003 10:59 am
by volka
I just tested this script
Code: Select all
<html><body>
<table border="1">
<?php
$fd = fopen('http://www.php.net', 'rb');
$phpnet = '';
while($part = fread($fd, 4096))
$phpnet.=$part;
if(preg_match_all('!<a href="([^"]+)">[^<]+</a>!i', $phpnet, $urls))
{
for($i=0; $i!=count($urls[0]); $i++)
echo '<tr><td>', $i, '</td><td>', htmlentities($urls[0][$i]), '</td><td>', $urls[1][$i], '</td></tr>';
}
else
echo 'no matches';
?>
</table>
</body></html>
preg_match will only return the first match and stop afterwards. preg_match_all return all matches.
Posted: Wed Mar 12, 2003 3:57 am
by Skittlewidth
I've run your script and it works just great

and in theory that's just what I want, however just my luck, when I replace
http://www.php.net with the url I need to use -
http://www.guardian.co.uk/syndication/s ... l?U1271588 then it returns no matches!
I've checked other urls and they've worked fine, so I looked at the source code for this page to try and find a reason. The file I need to search is so basic I'm surprised there's any trouble finding the links!
Thanks for your help so far, it's really been appreciated!
Posted: Wed Mar 12, 2003 6:47 am
by volka
e.g.
the url is encapsulated in ' instead of " and the whole thing is spread over more than one line.
Code: Select all
if(preg_match_all('!<a href=[''"]([^''"]+)["'']\s*>[^<]+</a>!mi', $phpnet, $urls))
should do the trick.
But The Guardian provides its headlines as
rss feed, too.
http://www.guardian.co.uk/rss/ (sometimes the internet explorer cannot retrieve the data, then try
http://www.guardian.co.uk/rss/1,,,00.xml)
Posted: Wed Mar 12, 2003 7:46 am
by superwormy
Just as an added tip, you don't need Regex at all to do this, you can USUALLY do this much more efficiently with basic string matching.
Unfortunately I don't have my script here that I wrote or I'd post it for you, but the basic idea is this:
while (strstr ("<a ", $contents)) {
$contents = stristr ("<a ", $contents);
$contents = substr ($contents, 3);
$tostrip = stristr (">", $contents);
$atag = str_replace ($tostrip, "", $contents);
// this leaves something like this:
// href="
http://www.ur.com" name="whatever"
}
Then just add to this to take everything thats 'href=\"' next, and it'll usually be MUCH faster than a regular expression.
Posted: Wed Mar 12, 2003 7:57 am
by Skittlewidth
Thats fantastic Volka, thankyou
I'm going to continue to try and get my head round these regular expressions and will hopefully be able to work them out myself someday!
rss feeds are something I have only heard about in the last few days since trying to do this thing, and xml is something I'd like to look into at some point. Whats the idea behind rss and how is it better than the other news feed they offer?