preg_match_all problem.

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
philthemn
Forum Newbie
Posts: 1
Joined: Fri Jul 01, 2011 9:08 am

preg_match_all problem.

Post by philthemn »

Hey, I'm trying to extract certain links from the telegraph news website. I'm using a preg_match_all function because the links I want to extract maintain a consistent pattern.

Here is a sample of the source I want to extract the link from:
<h3>
<a href="/finance/dominique-strauss-kahn/8610673/Dominique-Strauss-Kahn-sexual-assault-case-on-verge-of-collapse-amid-doubts-over-maid.html">Dominique Strauss-Kahn 'could still enter French presidential race'</a>
</h3>
<div class="picleft containerdiv ">
<a href="/finance/dominique-strauss-kahn/8610673/Dominique-Strauss-Kahn-sexual-assault-case-on-verge-of-collapse-amid-doubts-over-maid.html"><img src="http://i.telegraph.co.uk/multimedia/arc ... 01927g.jpg" alt="Dominique Strauss-Kahn at Manhattan Criminal Court " border="0" width="140" height="87" />
<span class="cornerimageleft">&nbsp;</span></a>
</div>
As you can see the links have a 7-digit identifier, so my code so far goes like this:

Code: Select all

$html = file_get_contents("http://www.telegraph.co.uk");

preg_match_all("#href=\"[a-z|A-Z|0-9|\/|\.\-]+[0-9]{7}.+a>$#", $html, $link);

print_r($link);

But for some reason the output is just: 'Array ( [0] => Array ( ) ) '. I've even tested my expression using an reg_expression tester online, and there it picks up the link.

Does anyone have any idea why my expression will not pick out the links in the above page source?

Thanks alot,
Phil
User avatar
social_experiment
DevNet Master
Posts: 2793
Joined: Sun Feb 15, 2009 11:08 am
Location: .za

Re: preg_match_all problem.

Post by social_experiment »

Code: Select all

$pattern = '#href=\"[\/|\D]{7}.+\">#';
This picks out the following :

Code: Select all

<?php
Array
(
    [0] => Array
        (
            [0] => href="/finance/dominique-strauss-kahn/8610673/Dominique-Strauss-Kahn-sexual-assault-case-on-verge-of-collapse-amid-doubts-over-maid.html">
            [1] => href="/finance/dominique-strauss-kahn/8610673/Dominique-Strauss-Kahn-sexual-assault-case-on-verge-of-collapse-amid-doubts-over-maid.html">
        )

)
?>
I've tried to get it that the less than sign isn't added but if i remove it from the pattern it doesn't work. If you use substr($link[0][0], 0, -1); it is possible to retrieve the string minus the < sign (after running the preg_match_all() function).
“Don’t worry if it doesn’t work right. If everything did, you’d be out of a job.” - Mosher’s Law of Software Engineering
User avatar
McInfo
DevNet Resident
Posts: 1532
Joined: Wed Apr 01, 2009 1:31 pm

Re: preg_match_all problem.

Post by McInfo »

I recommend this pattern:

Code: Select all

$pattern = <<<PATTERN
    ~      # start pattern: #(?<=\shref=")[^"]*/\d{7}/[^"]*(?=")#i
    (?<=   #   start sub-pattern (non-capturing look-behind assertion)
    \s     #     any white-space character
    href=" #     a literal character sequence
    )      #   end sub-pattern
    [^"]   #   character class: match anything but quote character
    *      #   repetition: match previous character class zero or more times
    /      #   a literal slash character
    \d     #   any decimal digit
    {7}    #   repetition: match seven decimal digits (blog links have nine)
    /      #   a literal slash character
    [^"]*  #   character class; same as before
    (?=    #   start sub-pattern (non-capturing look-ahead assertion)
    "      #     a literal quote (matched but not captured)
    )      #   end sub-pattern
           # (below) end pattern and set modifiers:
           # i: case-insensitive matching
           # x: ignore white-space in pattern and allow these comments
    ~ix
PATTERN;
If you also want to match the blog links which have nine digits instead of seven, add a comma after the 7.

Pass the matched URIs through array_unique() to filter out the duplicates.
Post Reply