Page 1 of 1
my first try at regex, and I'm stuck :-P
Posted: Thu Dec 14, 2006 1:59 am
by afbase
ok I'm CURL'ing MSN money pages and using regex to collect data off the pages. An example of the HTML source, I'm selecting out:
I want my regex to select out the numbers "6.20" from the string... the proper syntax for this would be (i think):
my regex code to select this from the MSN page is:
Code: Select all
$result = curl_exec($ch);
curl_close($ch);
$pattern='#\w\W\w\W+\w+\W+\w+\s\w+\W+\w+\W+(\d+\W\d*)#i';
preg_match($pattern,$result,$match);
print_r($match);
oops my returned code
Posted: Thu Dec 14, 2006 2:01 am
by afbase
This is what the code spits out
Code: Select all
Array ( [0] => 0 s 0 v 0 l 0) [1] => 0) )
Posted: Thu Dec 14, 2006 4:06 am
by volka
\w\W\w\W+\w+\W+\why are there so many \w\W in your pattern?
Code: Select all
$subject = 'P/E</td><td class="cl1">6.20';
$pattern = '/<td class="cl1">([\d+.]+)/';
preg_match($pattern, $subject, $matches);
echo $matches[1];
Posted: Thu Dec 14, 2006 6:02 am
by kaisellgren
Not tested...
Code: Select all
preg_match("/>(\d+(\.\d+)?)/i",$str,$matches);
echo $matches[1];
P/E
Posted: Thu Dec 14, 2006 3:11 pm
by afbase
the "P/E" is a critical identifier not so much the "<td class='c11'>", If i try either code, it will just return incorrect data. Instead of posting the P/E ratio, it returns the previous day's close.
I just realized that some stocks do not have number, how can I have the pregmatch () to select the p/e ratio as a number or as "NA"?
Almost Solved!!!!
Posted: Thu Dec 14, 2006 3:33 pm
by afbase
feyd | Please use Code: Select all
and [syntax="..."] tags where appropriate when posting code. Your post has been edited to reflect how we'd like it posted. Please read: [url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url] to learn how to do it too.[/color]
here is my code now:
Code: Select all
$result = curl_exec($ch);
curl_close($ch);
$pattern='#[P\E]</td><td class="cl1">([\d+.]+|\w+)#';
preg_match($pattern,$result,$match);
print_r($match);
It properly selects the P/E ratio but it isn't a very clean selection, This is what it returns for "NA" price:
and for a Number:
Code: Select all
Array ( [0] => E28.60 [1] => 28.60 )
These returned codes are just examples. I could use regex but I really really don't want $match[0]!!! I'm going to put this into a function that will loop over 1000 times. Extra data on that many loops won't be healthy coding.
feyd | Please use Code: Select all
and [syntax="..."] tags where appropriate when posting code. Your post has been edited to reflect how we'd like it posted. Please read: [url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url] to learn how to do it too.[/color]
Re: P/E
Posted: Thu Dec 14, 2006 4:11 pm
by volka
afbase wrote:the "P/E" is a critical identifier not so much the "<td class='c11'>", If i try either code, it will just return incorrect data. Instead of posting the P/E ratio, it returns the previous day's close.
Then give us more data. I don't know "MSN money pages", do they have e.g. an url? Something to test on?
MSN Money
Posted: Thu Dec 14, 2006 5:12 pm
by afbase
the following two links are examples of the pages that Curling
http://moneycentral.msn.com/detail/stoc ... Symbol=wgo
http://moneycentral.msn.com/detail/stoc ... ymbol=zoom
I am trying to capture the P/E ratio displayed on the far right of the main table of information. It will either display NA or a number and I've given you links to these two types of examples.
The specific piece that I'm looking for is on line 86, column 3918 of the source code.
this is my code so far
Code: Select all
$url = "http://moneycentral.msn.com/detail/stock_quote?ipage=qd&Symbol=US%3A".$_GET['ticker'];
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT, 3);
$result = curl_exec($ch);
curl_close($ch);
$pattern='#[P\][E]</td><td class="cl1">([\d+.]+|\w+)#i';
preg_match($pattern,$result,$match);
print_r($match);
Posted: Fri Dec 15, 2006 2:46 am
by volka
Code: Select all
$testdata = array(
'http://moneycentral.msn.com/detail/stock_quote?Symbol=wgo',
'http://moneycentral.msn.com/detail/stock_quote?Symbol=zoom'
);
$pattern = '!<tr><td>P/E</td><td class="cl1">([^<]*)</td></tr>!';
foreach($testdata as $url) {
$subject = file_get_contents($url);
preg_match($pattern, $subject, $matches);
$pe = $matches[1];
echo $url, ' -> ', $pe, "<br />\n";
}
works fine for me.
Wether you use url wrappers or curl doesn't matter, they both return a string.
volka thanks!!
Posted: Fri Dec 15, 2006 3:54 am
by afbase
thanks for your help!!!!! That pattern you gave me actually shed some light on how to write regex patterns better. I had to modify your pattern a little bit though, I forgot that special stocks that have gone through bankruptcies/have pro forma earnings like lockheed martin (LMT) and have FYI/psuedo P/E ratios displayed on MSN. so here is the final script if you are curious:
Code: Select all
<?php
$url = "http://moneycentral.msn.com/detail/stock_quote?ipage=qd&Symbol=US%3A".$_GET['ticker'];
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT, 3);
$result = curl_exec($ch);
curl_close($ch);
$pattern='!P/E</td><td class="cl1">([^<]*)</td></tr>!';
preg_match($pattern,$result,$match);
print_r($match);
?>
I'm going to stick with the curling script and borrow your pattern. It loops/retrieves pages faster for some reason (according to some posts on php.net), not exactly sure why though. Eventually I'm going to loop this script