Page 1 of 1

URL extraction

Posted: Fri Nov 05, 2010 6:17 am
by spacebiscuit
Hi,

I am trying to write a script which reads a source page of html and extracts all and any urls that end in an extension type, in this example a jpg.

This is what I have so far:

Code: Select all

$handle = fopen('http://www.awebsite', 'r');

while (!feof($handle)) {
     $contents .= fgets($handle);
                                 }

preg_match_all('!http://[\S]+jpg!',"$contents", $matches);

for($x=0; $x<=100; $x++){
    print_r($matches[0][$x].'<br>');
                                     }
This seems to work fine but for reason the results are written into a two dimensional array. Does anyone know how I can get the number if indexes used, I tried count() but this just gave me 1 - which is the number of indexes for the first dimension. To test I am using an arbitaray number of 100 in my for loop.

Thanks,

Rob.

Re: URL extraction

Posted: Fri Nov 05, 2010 9:49 am
by AbraCadaver
Not sure how you want it outputted, but this seems more concise and like it would do what you want:

Code: Select all

$contents = file_get_contents('http://extreme-board.com/showthread.php?t=356361&page=5');
preg_match_all('!http://[\S]+jpg!', $contents, $matches);

foreach($matches[0] as $url){
    echo "$url<br />\n";
}

URL Reg Ex

Posted: Fri Nov 05, 2010 9:53 am
by spacebiscuit
I am trying to extract a URL from some HTML source, I am using the following:

!http://[\S]+!

This seems to work but it is also matching the end quotation marks of a url, for example:

<a href="http://www.google.co.uk">

http://www.google.co.uk"

How can I not pick up the quotations - or is there a more efficient way of doing what I'm trying to do?

Thanks in advance,

Rob.

Re: URL Reg Ex

Posted: Fri Nov 05, 2010 10:10 am
by AbraCadaver
How about:

[text]!http://[^"]+![/text]

Re: URL Reg Ex

Posted: Fri Nov 05, 2010 10:22 am
by spacebiscuit
Thanks - so I used:

!http://[\S][^"]+!

Which seems to work but how do I also ignore a single quote. Adding it to the expression causing the php to not execute as it is reasd as the end of the parameter!

Rob.

Re: URL extraction

Posted: Fri Nov 05, 2010 11:04 am
by Weirdan
Merged topics as I don't think they were so much different as to keep them separate.

Re: URL Reg Ex

Posted: Fri Nov 05, 2010 11:13 am
by AbraCadaver
robburne wrote:Thanks - so I used:

!http://[\S][^"]+!

Which seems to work but how do I also ignore a single quote. Adding it to the expression causing the php to not execute as it is reasd as the end of the parameter!

Rob.
I don't know why you are using the \S. Do you have alot of URLs like http://[space]example.com that you don't want?

Code: Select all

preg_match_all('!http://[^"\']+jpg!', $contents, $matches);
Read: http://us.php.net/manual/en/language.types.string.php

Re: URL extraction

Posted: Fri Nov 05, 2010 11:21 am
by spacebiscuit
Apologies, I realised after my 2nd post that I should have posted in the regex forum, perhaps it would be better over there?

Ok I have almost nailed this, just stuck on one small detail, let's say I want all URL's from one of two domains:

!http://w{0,3}.{0,1}[microsoft|google]{1}[\S][^"|^\']+!

So the string must include:

http://
none or 3 occurebnces of 'w'
none or 1 occurence of '.'
followed by 'micosoft or google
followed by any non single non-whitespace character but escape ' and "

This however does not work, the following is a match:

http://forums.devnetwork.net/

How do I force the domain part?

Thanks,

Rob.

Re: URL extraction

Posted: Fri Nov 05, 2010 11:38 am
by AbraCadaver
I didn't test this, just off the top of my head:
[text]!http://(www\.)?(microsoft|google)[^"\']+![/text]

Because your regex is whacked:

This says 0 or 3 w and then 0 or 1 of ANY SINGLE character (that's what . matches, you need to escape it): w{0,3}.{0,1}
This is a character class [] so any of the following SINGLE characters 1 time: microsoft|google
Any character except a space 1 time: [\S]
Any characters except the following 1 or more times: "|^'

Re: URL extraction

Posted: Fri Nov 05, 2010 12:29 pm
by spacebiscuit
That works perfectly:

!http://(www\.)?(google|microsoft)[^"\']+!

I'm concerned though because I do not understand why it works (no matter how much reading I do).

I can't find anything in the tutorials which explains this:

(www\.)?

I can't see what catches the end of the URL. All the expression says as far as I can see is that the url must contain one of the two domains. What is stopping the expression from matching the rest of the document? I just don't get it.

Finally, why is the preg_match_all function breaking the match down into three parts:

Code: Select all

Array
(
    [0] => Array
        (
            [0] => http://www.google.com
        )

    [1] => Array
        (
            [0] => www.
        )

    [2] => Array
        (
            [0] => google
        )

Thanks,

Rob.

)

Re: URL extraction

Posted: Fri Nov 05, 2010 2:17 pm
by AbraCadaver
robburne wrote:That works perfectly:

!http://(www\.)?(google|microsoft)[^"\']+!

I'm concerned though because I do not understand why it works (no matter how much reading I do).

I can't find anything in the tutorials which explains this:

(www\.)?
That says match www. and the () groups it so you can use the ? which means that it is OPTIONAL. It can match but doesn't have to.
I can't see what catches the end of the URL. All the expression says as far as I can see is that the url must contain one of the two domains. What is stopping the expression from matching the rest of the document? I just don't get it.
The ^ means not match, so [^"\']+ is match any character NOT ' and NOT " 1 or more times. If it encounters one of these then it is finished matching.
Finally, why is the preg_match_all function breaking the match down into three parts:
$matches[0] is an array of complete matches (the entire pattern, probably what you want) and next is an array for each of the capture groups (). You can try the PREG_SET_ORDER flag or do this ?: which should not include those as capture groups:
[text]!http://(?:www\.)?(?:google|microsoft)[^"\']+![/text]
Technically I don't think you need the () around the domains but you'll have to test it:
[text]!http://(?:www\.)?google|microsoft[^"\']+![/text]
Check this:
http://www.regular-expressions.info/reference.html
and this:
http://www.regular-expressions.info/refadv.html