Page 1 of 1

my first try at regex, and I'm stuck :-P

Posted: Thu Dec 14, 2006 1:59 am
by afbase
ok I'm CURL'ing MSN money pages and using regex to collect data off the pages. An example of the HTML source, I'm selecting out:

Code: Select all

P/E</td><td class="cl1">6.20
I want my regex to select out the numbers "6.20" from the string... the proper syntax for this would be (i think):

Code: Select all

\d+\.\d*
my regex code to select this from the MSN page is:

Code: Select all

$result = curl_exec($ch);
curl_close($ch);
$pattern='#\w\W\w\W+\w+\W+\w+\s\w+\W+\w+\W+(\d+\W\d*)#i';
preg_match($pattern,$result,$match);
print_r($match);

oops my returned code

Posted: Thu Dec 14, 2006 2:01 am
by afbase
This is what the code spits out

Code: Select all

Array ( [0] => 0 s 0 v 0 l 0) [1] => 0) )

Posted: Thu Dec 14, 2006 4:06 am
by volka
\w\W\w\W+\w+\W+\why are there so many \w\W in your pattern?

Code: Select all

$subject = 'P/E</td><td class="cl1">6.20';
$pattern = '/<td class="cl1">([\d+.]+)/';

preg_match($pattern, $subject, $matches);
echo $matches[1];

Posted: Thu Dec 14, 2006 6:02 am
by kaisellgren
Not tested...

Code: Select all

preg_match("/>(\d+(\.\d+)?)/i",$str,$matches);
echo $matches[1];

P/E

Posted: Thu Dec 14, 2006 3:11 pm
by afbase
the "P/E" is a critical identifier not so much the "<td class='c11'>", If i try either code, it will just return incorrect data. Instead of posting the P/E ratio, it returns the previous day's close.

I just realized that some stocks do not have number, how can I have the pregmatch () to select the p/e ratio as a number or as "NA"?

Almost Solved!!!!

Posted: Thu Dec 14, 2006 3:33 pm
by afbase
feyd | Please use

Code: Select all

,

Code: Select all

and [syntax="..."] tags where appropriate when posting code. Your post has been edited to reflect how we'd like it posted. Please read:  [url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url] to learn how to do it too.[/color]


here is my code now:

Code: Select all

$result = curl_exec($ch);
curl_close($ch);
$pattern='#[P\E]</td><td class="cl1">([\d+.]+|\w+)#';
preg_match($pattern,$result,$match);
print_r($match);


It properly selects the P/E ratio but it isn't a very clean selection, This is what it returns for "NA" price:

Code: Select all

Array ( [0] => ENA [1] => NA )
and for a Number:

Code: Select all

Array ( [0] => E28.60 [1] => 28.60 )


These returned codes are just examples. I could use regex but I really really don't want $match[0]!!! I'm going to put this into a function that will loop over 1000 times. Extra data on that many loops won't be healthy coding.


feyd | Please use

Code: Select all

,

Code: Select all

and [syntax="..."] tags where appropriate when posting code. Your post has been edited to reflect how we'd like it posted. Please read:  [url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url] to learn how to do it too.[/color]

Re: P/E

Posted: Thu Dec 14, 2006 4:11 pm
by volka
afbase wrote:the "P/E" is a critical identifier not so much the "<td class='c11'>", If i try either code, it will just return incorrect data. Instead of posting the P/E ratio, it returns the previous day's close.
Then give us more data. I don't know "MSN money pages", do they have e.g. an url? Something to test on?

MSN Money

Posted: Thu Dec 14, 2006 5:12 pm
by afbase
the following two links are examples of the pages that Curling
http://moneycentral.msn.com/detail/stoc ... Symbol=wgo
http://moneycentral.msn.com/detail/stoc ... ymbol=zoom

I am trying to capture the P/E ratio displayed on the far right of the main table of information. It will either display NA or a number and I've given you links to these two types of examples.

The specific piece that I'm looking for is on line 86, column 3918 of the source code.

this is my code so far

Code: Select all

$url = "http://moneycentral.msn.com/detail/stock_quote?ipage=qd&Symbol=US%3A".$_GET['ticker'];
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT, 3);
$result = curl_exec($ch);
curl_close($ch);
$pattern='#[P\][E]</td><td class="cl1">([\d+.]+|\w+)#i';
preg_match($pattern,$result,$match);
print_r($match);

Posted: Fri Dec 15, 2006 2:46 am
by volka

Code: Select all

$testdata = array(
		'http://moneycentral.msn.com/detail/stock_quote?Symbol=wgo',
		'http://moneycentral.msn.com/detail/stock_quote?Symbol=zoom'
	);
$pattern = '!<tr><td>P/E</td><td class="cl1">([^<]*)</td></tr>!';


foreach($testdata as $url) {
	$subject = file_get_contents($url);
	preg_match($pattern, $subject, $matches);
	$pe = $matches[1]; 
	echo $url, ' -> ', $pe, "<br />\n";
}
works fine for me.
Wether you use url wrappers or curl doesn't matter, they both return a string.

volka thanks!!

Posted: Fri Dec 15, 2006 3:54 am
by afbase
thanks for your help!!!!! That pattern you gave me actually shed some light on how to write regex patterns better. I had to modify your pattern a little bit though, I forgot that special stocks that have gone through bankruptcies/have pro forma earnings like lockheed martin (LMT) and have FYI/psuedo P/E ratios displayed on MSN. so here is the final script if you are curious:

Code: Select all

<?php
$url = "http://moneycentral.msn.com/detail/stock_quote?ipage=qd&Symbol=US%3A".$_GET['ticker'];
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT, 3);
$result = curl_exec($ch);
curl_close($ch);
$pattern='!P/E</td><td class="cl1">([^<]*)</td></tr>!';
preg_match($pattern,$result,$match);
print_r($match);
?>

I'm going to stick with the curling script and borrow your pattern. It loops/retrieves pages faster for some reason (according to some posts on php.net), not exactly sure why though. Eventually I'm going to loop this script