Page 1 of 1
URL extraction
Posted: Fri Nov 05, 2010 6:17 am
by spacebiscuit
Hi,
I am trying to write a script which reads a source page of html and extracts all and any urls that end in an extension type, in this example a jpg.
This is what I have so far:
Code: Select all
$handle = fopen('http://www.awebsite', 'r');
while (!feof($handle)) {
$contents .= fgets($handle);
}
preg_match_all('!http://[\S]+jpg!',"$contents", $matches);
for($x=0; $x<=100; $x++){
print_r($matches[0][$x].'<br>');
}
This seems to work fine but for reason the results are written into a two dimensional array. Does anyone know how I can get the number if indexes used, I tried count() but this just gave me 1 - which is the number of indexes for the first dimension. To test I am using an arbitaray number of 100 in my for loop.
Thanks,
Rob.
Re: URL extraction
Posted: Fri Nov 05, 2010 9:49 am
by AbraCadaver
Not sure how you want it outputted, but this seems more concise and like it would do what you want:
Code: Select all
$contents = file_get_contents('http://extreme-board.com/showthread.php?t=356361&page=5');
preg_match_all('!http://[\S]+jpg!', $contents, $matches);
foreach($matches[0] as $url){
echo "$url<br />\n";
}
URL Reg Ex
Posted: Fri Nov 05, 2010 9:53 am
by spacebiscuit
I am trying to extract a URL from some HTML source, I am using the following:
!http://[\S]+!
This seems to work but it is also matching the end quotation marks of a url, for example:
<a href="
http://www.google.co.uk">
http://www.google.co.uk"
How can I not pick up the quotations - or is there a more efficient way of doing what I'm trying to do?
Thanks in advance,
Rob.
Re: URL Reg Ex
Posted: Fri Nov 05, 2010 10:10 am
by AbraCadaver
How about:
[text]!http://[^"]+![/text]
Re: URL Reg Ex
Posted: Fri Nov 05, 2010 10:22 am
by spacebiscuit
Thanks - so I used:
!http://[\S][^"]+!
Which seems to work but how do I also ignore a single quote. Adding it to the expression causing the php to not execute as it is reasd as the end of the parameter!
Rob.
Re: URL extraction
Posted: Fri Nov 05, 2010 11:04 am
by Weirdan
Merged topics as I don't think they were so much different as to keep them separate.
Re: URL Reg Ex
Posted: Fri Nov 05, 2010 11:13 am
by AbraCadaver
robburne wrote:Thanks - so I used:
!http://[\S][^"]+!
Which seems to work but how do I also ignore a single quote. Adding it to the expression causing the php to not execute as it is reasd as the end of the parameter!
Rob.
I don't know why you are using the \S. Do you have alot of URLs like http://[space]example.com that you don't want?
Code: Select all
preg_match_all('!http://[^"\']+jpg!', $contents, $matches);
Read:
http://us.php.net/manual/en/language.types.string.php
Re: URL extraction
Posted: Fri Nov 05, 2010 11:21 am
by spacebiscuit
Apologies, I realised after my 2nd post that I should have posted in the regex forum, perhaps it would be better over there?
Ok I have almost nailed this, just stuck on one small detail, let's say I want all URL's from one of two domains:
!http://w{0,3}.{0,1}[microsoft|google]{1}[\S][^"|^\']+!
So the string must include:
http://
none or 3 occurebnces of 'w'
none or 1 occurence of '.'
followed by 'micosoft or google
followed by any non single non-whitespace character but escape ' and "
This however does not work, the following is a match:
http://forums.devnetwork.net/
How do I force the domain part?
Thanks,
Rob.
Re: URL extraction
Posted: Fri Nov 05, 2010 11:38 am
by AbraCadaver
I didn't test this, just off the top of my head:
[text]!http://(www\.)?(microsoft|google)[^"\']+![/text]
Because your regex is whacked:
This says 0 or 3 w and then 0 or 1 of ANY SINGLE character (that's what . matches, you need to escape it): w{0,3}.{0,1}
This is a character class [] so any of the following SINGLE characters 1 time: microsoft|google
Any character except a space 1 time: [\S]
Any characters except the following 1 or more times: "|^'
Re: URL extraction
Posted: Fri Nov 05, 2010 12:29 pm
by spacebiscuit
That works perfectly:
!http://(www\.)?(google|microsoft)[^"\']+!
I'm concerned though because I do not understand why it works (no matter how much reading I do).
I can't find anything in the tutorials which explains this:
(www\.)?
I can't see what catches the end of the URL. All the expression says as far as I can see is that the url must contain one of the two domains. What is stopping the expression from matching the rest of the document? I just don't get it.
Finally, why is the preg_match_all function breaking the match down into three parts:
Code: Select all
Array
(
[0] => Array
(
[0] => http://www.google.com
)
[1] => Array
(
[0] => www.
)
[2] => Array
(
[0] => google
)
Thanks,
Rob.
)
Re: URL extraction
Posted: Fri Nov 05, 2010 2:17 pm
by AbraCadaver
robburne wrote:That works perfectly:
!http://(www\.)?(google|microsoft)[^"\']+!
I'm concerned though because I do not understand why it works (no matter how much reading I do).
I can't find anything in the tutorials which explains this:
(www\.)?
That says match www. and the () groups it so you can use the ? which means that it is OPTIONAL. It can match but doesn't have to.
I can't see what catches the end of the URL. All the expression says as far as I can see is that the url must contain one of the two domains. What is stopping the expression from matching the rest of the document? I just don't get it.
The ^ means not match, so [^"\']+ is match any character NOT ' and NOT " 1 or more times. If it encounters one of these then it is finished matching.
Finally, why is the preg_match_all function breaking the match down into three parts:
$matches[0] is an array of complete matches (the entire pattern, probably what you want) and next is an array for each of the capture groups (). You can try the PREG_SET_ORDER flag or do this ?: which should not include those as capture groups:
[text]!http://(?:www\.)?(?:google|microsoft)[^"\']+![/text]
Technically I don't think you need the () around the domains but you'll have to test it:
[text]!http://(?:www\.)?google|microsoft[^"\']+![/text]
Check this:
http://www.regular-expressions.info/reference.html
and this:
http://www.regular-expressions.info/refadv.html