PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!
Hey, I'm trying to extract certain links from the telegraph news website. I'm using a preg_match_all function because the links I want to extract maintain a consistent pattern.
Here is a sample of the source I want to extract the link from:
<h3>
<a href="/finance/dominique-strauss-kahn/8610673/Dominique-Strauss-Kahn-sexual-assault-case-on-verge-of-collapse-amid-doubts-over-maid.html">Dominique Strauss-Kahn 'could still enter French presidential race'</a>
</h3>
<div class="picleft containerdiv ">
<a href="/finance/dominique-strauss-kahn/8610673/Dominique-Strauss-Kahn-sexual-assault-case-on-verge-of-collapse-amid-doubts-over-maid.html"><img src="http://i.telegraph.co.uk/multimedia/arc ... 01927g.jpg" alt="Dominique Strauss-Kahn at Manhattan Criminal Court " border="0" width="140" height="87" />
<span class="cornerimageleft"> </span></a>
</div>
As you can see the links have a 7-digit identifier, so my code so far goes like this:
But for some reason the output is just: 'Array ( [0] => Array ( ) ) '. I've even tested my expression using an reg_expression tester online, and there it picks up the link.
Does anyone have any idea why my expression will not pick out the links in the above page source?
I've tried to get it that the less than sign isn't added but if i remove it from the pattern it doesn't work. If you use substr($link[0][0], 0, -1); it is possible to retrieve the string minus the < sign (after running the preg_match_all() function).
“Don’t worry if it doesn’t work right. If everything did, you’d be out of a job.” - Mosher’s Law of Software Engineering
$pattern = <<<PATTERN
~ # start pattern: #(?<=\shref=")[^"]*/\d{7}/[^"]*(?=")#i
(?<= # start sub-pattern (non-capturing look-behind assertion)
\s # any white-space character
href=" # a literal character sequence
) # end sub-pattern
[^"] # character class: match anything but quote character
* # repetition: match previous character class zero or more times
/ # a literal slash character
\d # any decimal digit
{7} # repetition: match seven decimal digits (blog links have nine)
/ # a literal slash character
[^"]* # character class; same as before
(?= # start sub-pattern (non-capturing look-ahead assertion)
" # a literal quote (matched but not captured)
) # end sub-pattern
# (below) end pattern and set modifiers:
# i: case-insensitive matching
# x: ignore white-space in pattern and allow these comments
~ix
PATTERN;
If you also want to match the blog links which have nine digits instead of seven, add a comma after the 7.
Pass the matched URIs through array_unique() to filter out the duplicates.