Page 1 of 1
Please can somebody help with this regex?
Posted: Thu Mar 29, 2012 6:07 am
by Noodleyman
Afternoon everybody,
I really suck with regex, and as luck would have it I need to use it today to solve a problem. I have to scrape a HTML page, and extract some specific information.
I am looking for this line in the HTML.
<span id="pricePlusShippingQty"><b class="price">£29.99</b><span class="plusShippingText"> + £5.40 shipping</span>
The span will remain static as will the class names.
I need to extract "5.40".
However, as this value will change from page to page it needs to be able to cope with the following scenarios.
Unsure of how many numbers will be in the price, Example:
£1.99
£11.99
£111.99
when the price is over £1000 the formatting changes as follows,
1,297.95
Also, a price in thousands could be as the earlier example.
1,297.95
11,297.95
111,297.95
Another possible value instead of a numeric value is the wording "FREE SHIPPING". I will need to change a returned value of "FREE SHIPPING" to be 0.00 later in the script.
I am not sure how to go about extracting the data per my examples/specification above. If anybody would be kind enough to help me out it will be greatly appreciated
Thanks in advance,
Noodle
Re: Please can somebody help with this regex?
Posted: Thu Mar 29, 2012 11:58 am
by requinix
A regex isn't the best way to do this. Try something that parses HTML like
DOM's DOMDocument.
Re: Please can somebody help with this regex?
Posted: Thu Mar 29, 2012 2:07 pm
by tr0gd0rr
For something this simple I wouldn't be opposed to a regex. Especially since you may need to go to the trouble to pre-process the HTML (e.g. with HTML Tidy) before DOMDocument since scraped sites often have malformed html. For example, many sites have end tags inside of javascript: browsers are tolerant of this but DOMDocument is rightfully intolerant.
Assuming the span is called plusShippingText, this would work: /\bplusShippingText\b.+?(FREE SHIPPING|[\d,.]+)/i But remember that the site's page could change at any time! That regex says: find "plusShippingText" at word boundaries, followed by some characters followed by "FREE SHIPPING" or digits, commas, and decimals. The "i" means case insensitive.
The first capturing pattern would either be "FREE SHIPPING" or a number. If it is a number, replace commas with empty strings.
Re: Please can somebody help with this regex?
Posted: Thu Mar 29, 2012 2:50 pm
by ragax
Hi Noodleyman,
Here's a regex solution. No opinion on whether a regex is desirable---I just enjoy replying to regex questions

Running this on an array of three strings for you to show how it performs. Just add strings to the array to test more.
Input:
'<span id="pricePlusShippingQty"><b class="price">£29.99</b><span class="plusShippingText"> + £5.40 shipping</span>',
'<span id="pricePlusShippingQty"><b class="price">£29.99</b><span class="plusShippingText"> + £2,005.40 shipping</span>',
'<span id="pricePlusShippingQty"><b class="price">£29.99</b><span class="plusShippingText"> + 10.10 shipping</span>'
Code:
Code: Select all
<?php
$regex='~"plusShippingText">[^\d]+\K[\d,.]+\d~';
$strings=array(
'<span id="pricePlusShippingQty"><b class="price">£29.99</b><span class="plusShippingText"> + £5.40 shipping</span>',
'<span id="pricePlusShippingQty"><b class="price">£29.99</b><span class="plusShippingText"> + £2,005.40 shipping</span>',
'<span id="pricePlusShippingQty"><b class="price">£29.99</b><span class="plusShippingText"> + 10.10 shipping</span>'
);
foreach ($strings as $string)
if(preg_match($regex,$string,$m))
echo $m[0].'<br />';
?>
Output:
5.40
2,005.40
10.10
Let me know if this works for you
or you have any questions.

Re: Please can somebody help with this regex?
Posted: Fri Mar 30, 2012 2:15 am
by Noodleyman
Thank you for all the detailed replies
I really appreciate you taking the time. This is now resolved. Thank you for the great examples, and advice about alternate methods.
I actually ended up using a basic regex to get the content between the parent div (as it is not malformed). then I used a combination of str_replace, strstr, strpos functions to strip out exactly what I needed. Here is the function I put together.I had to modify this a little after realising I was getting varied results on the target pages. I will be enhancing this further today to allow it to use multi_curl to load all target pages (up to 50) at the same time to improve performance.
Code: Select all
function get_shipping($link) {
// get the HTML
$html = file_get_contents($link);
preg_match(
'/<div[^>]*id=\"BBPricePlusShipID\">(.*?)<\\/div>/si',
$html,
$first
);
// Filter out results where we get an unexpected HTML string and assume free delivery
IF(array_key_exists(0, $first)){
$string = $first[0];
// Remove part of string we are not interested in
$string = strstr($string, '<span class="plusShippingText">');
// check if we have a currency symbol
if(strpos($string, "£") === FALSE){
// Free shipping
$shipPrice = "0.00";
}ELSE{
// Find the position of the currency symbol
$pos = strpos($string, "£");
// Get the string value after the currency symbol
$str = substr($string, $pos+1);
// find the position of the space after the numbers
$pos = strpos($str, " ");
// Remove characters after the space
$str = substr($str, 0, $pos);
// Remove any commas from number formatting
$shipPrice = str_replace(",", "", $str);
}
}ELSE{
$shipPrice = 0;
}
return (float)$shipPrice;
}