Extracting URL from a string

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
f1nutter
Forum Contributor
Posts: 125
Joined: Wed Jun 05, 2002 12:08 pm
Location: London

Extracting URL from a string

Post by f1nutter »

Hi,

As part of a project I need to extract a URL from a random string.

So my string might look like:

Code: Select all

<td><a href="/home.php?ID=1">Link</a></td>
and I would like to just keep /home.php?ID=1 and lose the rest.

The string could start with spaces or anything else not needed. I just need whats in between the quotes.

I've had a look at the regular expression functions (OK) but it is the expression its self that escapes me.

Cheers.
rev
Forum Commoner
Posts: 52
Joined: Wed Oct 02, 2002 3:58 pm
Location: Atlanta, GA

Re: Extracting URL from a string

Post by rev »

f1nutter wrote: I've had a look at the regular expression functions (OK) but it is the expression its self that escapes me.
How hard did you look? :)

Pulled from example 3 of preg_match off of php.net...

Code: Select all

&lt;?php
// get host name from URL
preg_match("/^(http:\/\/)?(&#1111;^\/]+)/i",
"http://www.php.net/index.html", $matches);
$host = $matches&#1111;2];
// get last two segments of host name
preg_match("/&#1111;^\.\/]+\.&#1111;^\.\/]+$/",$host,$matches);
echo "domain name is: ".$matches&#1111;0]."\n";
?&gt;
http://www.php.net/manual/en/function.preg-match.php

It does almost what you are looking for.
swush
Forum Newbie
Posts: 2
Joined: Fri Oct 18, 2002 7:55 am

Post by swush »

well i had/have that problem too. of course you can filter anything that starts with http:// very easily. the problem however is if you only want to choose links.
it might look like this:

Code: Select all

<a href="index.php">link</a>
or like that

Code: Select all

<a id="12" href="index.php">link</a>
so

Code: Select all

preg_match_callback('/<a.*?href="(.*)".*?>.*?</a>/si',$string,"fkt")
doesn't work, because of the greedy behavior of (.*). I was unable to figure it out with 1 regular expression, but i finally got sth that worked, but only with 2 regular expressions. I'd be interested in the outcome of your efforts :)
ReDucTor
Forum Commoner
Posts: 90
Joined: Thu Aug 15, 2002 6:13 am

Post by ReDucTor »

/<a.+href=("([^"]*)"|[^\s]*).*>/ I think that should give you the URL part :)
f1nutter
Forum Contributor
Posts: 125
Joined: Wed Jun 05, 2002 12:08 pm
Location: London

Post by f1nutter »

I had a good read of the manuals etc. and the domain name example is not sufficient. Basically I am trying to find anything in between the quotes that follow an href.

This seems to work :D

Code: Select all

$pattern = "/(?&lt;=href=")&#1111;^"]*(?=")/";

 // int preg_match_all(string pattern, string subject, array matches)
 $success = preg_match_all($pattern, $string, $matches);

 $matches = $matches&#1111;0];
So whats going on here?

Code: Select all

$pattern = 
"/ // start of expression
(?&lt;= // find a string that starts with ..
href=" // .. href then open double quotes (escaped)
) // close 'start with' directive

&#1111;^"] // match any valid characters except quote
* // any number of times
(?= // ending with
" // close double quotes
) // close 'ending with' directive
/" // close expression
preg_match_all matches all patterns from the given string and stores in array matches.

$matches will hold an array of arrays (depending on options), but only matches[0] will hold what we want, so reassign it to matches.

Disadvantages:
Matches all href attributes, including style sheets.
Only matches attributes enclosed in double quotes, can change to single quotes, but not both.

Hope this will be of help to someone, or can be improved upon.

Thanks.
Post Reply