Page 1 of 1

preg_match_all problem.

Posted: Fri Jul 01, 2011 9:17 am
by philthemn
Hey, I'm trying to extract certain links from the telegraph news website. I'm using a preg_match_all function because the links I want to extract maintain a consistent pattern.

Here is a sample of the source I want to extract the link from:
<h3>
<a href="/finance/dominique-strauss-kahn/8610673/Dominique-Strauss-Kahn-sexual-assault-case-on-verge-of-collapse-amid-doubts-over-maid.html">Dominique Strauss-Kahn 'could still enter French presidential race'</a>
</h3>
<div class="picleft containerdiv ">
<a href="/finance/dominique-strauss-kahn/8610673/Dominique-Strauss-Kahn-sexual-assault-case-on-verge-of-collapse-amid-doubts-over-maid.html"><img src="http://i.telegraph.co.uk/multimedia/arc ... 01927g.jpg" alt="Dominique Strauss-Kahn at Manhattan Criminal Court " border="0" width="140" height="87" />
<span class="cornerimageleft">&nbsp;</span></a>
</div>
As you can see the links have a 7-digit identifier, so my code so far goes like this:

Code: Select all

$html = file_get_contents("http://www.telegraph.co.uk");

preg_match_all("#href=\"[a-z|A-Z|0-9|\/|\.\-]+[0-9]{7}.+a>$#", $html, $link);

print_r($link);

But for some reason the output is just: 'Array ( [0] => Array ( ) ) '. I've even tested my expression using an reg_expression tester online, and there it picks up the link.

Does anyone have any idea why my expression will not pick out the links in the above page source?

Thanks alot,
Phil

Re: preg_match_all problem.

Posted: Fri Jul 01, 2011 11:43 am
by social_experiment

Code: Select all

$pattern = '#href=\"[\/|\D]{7}.+\">#';
This picks out the following :

Code: Select all

<?php
Array
(
    [0] => Array
        (
            [0] => href="/finance/dominique-strauss-kahn/8610673/Dominique-Strauss-Kahn-sexual-assault-case-on-verge-of-collapse-amid-doubts-over-maid.html">
            [1] => href="/finance/dominique-strauss-kahn/8610673/Dominique-Strauss-Kahn-sexual-assault-case-on-verge-of-collapse-amid-doubts-over-maid.html">
        )

)
?>
I've tried to get it that the less than sign isn't added but if i remove it from the pattern it doesn't work. If you use substr($link[0][0], 0, -1); it is possible to retrieve the string minus the < sign (after running the preg_match_all() function).

Re: preg_match_all problem.

Posted: Sat Jul 02, 2011 6:51 pm
by McInfo
I recommend this pattern:

Code: Select all

$pattern = <<<PATTERN
    ~      # start pattern: #(?<=\shref=")[^"]*/\d{7}/[^"]*(?=")#i
    (?<=   #   start sub-pattern (non-capturing look-behind assertion)
    \s     #     any white-space character
    href=" #     a literal character sequence
    )      #   end sub-pattern
    [^"]   #   character class: match anything but quote character
    *      #   repetition: match previous character class zero or more times
    /      #   a literal slash character
    \d     #   any decimal digit
    {7}    #   repetition: match seven decimal digits (blog links have nine)
    /      #   a literal slash character
    [^"]*  #   character class; same as before
    (?=    #   start sub-pattern (non-capturing look-ahead assertion)
    "      #     a literal quote (matched but not captured)
    )      #   end sub-pattern
           # (below) end pattern and set modifiers:
           # i: case-insensitive matching
           # x: ignore white-space in pattern and allow these comments
    ~ix
PATTERN;
If you also want to match the blog links which have nine digits instead of seven, add a comma after the 7.

Pass the matched URIs through array_unique() to filter out the duplicates.