Page 1 of 1
Extracting URL from a string
Posted: Thu Oct 17, 2002 1:35 pm
by f1nutter
Hi,
As part of a project I need to extract a URL from a random string.
So my string might look like:
Code: Select all
<td><a href="/home.php?ID=1">Link</a></td>
and I would like to just keep
/home.php?ID=1 and lose the rest.
The string could start with spaces or anything else not needed. I just need whats in between the quotes.
I've had a look at the regular expression functions (OK) but it is the expression its self that escapes me.
Cheers.
Re: Extracting URL from a string
Posted: Thu Oct 17, 2002 4:16 pm
by rev
f1nutter wrote:
I've had a look at the regular expression functions (OK) but it is the expression its self that escapes me.
How hard did you look?
Pulled from example 3 of preg_match off of php.net...
Code: Select all
<?php
// get host name from URL
preg_match("/^(http:\/\/)?(ї^\/]+)/i",
"http://www.php.net/index.html", $matches);
$host = $matchesї2];
// get last two segments of host name
preg_match("/ї^\.\/]+\.ї^\.\/]+$/",$host,$matches);
echo "domain name is: ".$matchesї0]."\n";
?>
http://www.php.net/manual/en/function.preg-match.php
It does almost what you are looking for.
Posted: Fri Oct 18, 2002 8:17 am
by swush
well i had/have that problem too. of course you can filter anything that starts with http:// very easily. the problem however is if you only want to choose links.
it might look like this:
or like that
Code: Select all
<a id="12" href="index.php">link</a>
so
Code: Select all
preg_match_callback('/<a.*?href="(.*)".*?>.*?</a>/si',$string,"fkt")
doesn't work, because of the greedy behavior of (.*). I was unable to figure it out with 1 regular expression, but i finally got sth that worked, but only with 2 regular expressions. I'd be interested in the outcome of your efforts

Posted: Fri Oct 18, 2002 9:25 am
by ReDucTor
/<a.+href=("([^"]*)"|[^\s]*).*>/ I think that should give you the URL part

Posted: Sun Oct 20, 2002 2:09 pm
by f1nutter
I had a good read of the manuals etc. and the domain name example is not sufficient. Basically I am trying to find
anything in between the quotes that follow an href.
This seems to work
Code: Select all
$pattern = "/(?<=href=")ї^"]*(?=")/";
// int preg_match_all(string pattern, string subject, array matches)
$success = preg_match_all($pattern, $string, $matches);
$matches = $matchesї0];
So whats going on here?
Code: Select all
$pattern =
"/ // start of expression
(?<= // find a string that starts with ..
href=" // .. href then open double quotes (escaped)
) // close 'start with' directive
ї^"] // match any valid characters except quote
* // any number of times
(?= // ending with
" // close double quotes
) // close 'ending with' directive
/" // close expression
preg_match_all matches all
patterns from the given
string and stores in array
matches.
$matches will hold an array of arrays (depending on options), but only matches[0] will hold what we want, so reassign it to matches.
Disadvantages:
Matches all href attributes, including style sheets.
Only matches attributes enclosed in double quotes, can change to single quotes, but not both.
Hope this will be of help to someone, or can be improved upon.
Thanks.