PHP Developers Network

A community of PHP developers offering assistance, advice, discussion, and friendship.
 
Loading
It is currently Mon Sep 24, 2018 7:17 pm

All times are UTC - 5 hours




Post new topic Reply to topic  [ 5 posts ] 
Author Message
PostPosted: Thu Mar 29, 2012 6:07 am 
Offline
Forum Newbie

Joined: Thu Jun 23, 2011 4:49 pm
Posts: 23
Afternoon everybody,

I really suck with regex, and as luck would have it I need to use it today to solve a problem. I have to scrape a HTML page, and extract some specific information.

I am looking for this line in the HTML.

Quote:
<span id="pricePlusShippingQty"><b class="price">£29.99</b><span class="plusShippingText">&nbsp;+&nbsp;£5.40&nbsp;shipping</span>


The span will remain static as will the class names.

I need to extract "5.40".

However, as this value will change from page to page it needs to be able to cope with the following scenarios.

Unsure of how many numbers will be in the price, Example:
£1.99
£11.99
£111.99

when the price is over £1000 the formatting changes as follows,
1,297.95

Also, a price in thousands could be as the earlier example.
1,297.95
11,297.95
111,297.95

Another possible value instead of a numeric value is the wording "FREE SHIPPING". I will need to change a returned value of "FREE SHIPPING" to be 0.00 later in the script.

I am not sure how to go about extracting the data per my examples/specification above. If anybody would be kind enough to help me out it will be greatly appreciated :)

Thanks in advance,

Noodle


Top
 Profile  
 
PostPosted: Thu Mar 29, 2012 11:58 am 
Offline
Spammer :|
User avatar

Joined: Wed Oct 15, 2008 2:35 am
Posts: 6617
Location: WA, USA
A regex isn't the best way to do this. Try something that parses HTML like DOM's DOMDocument.


Top
 Profile  
 
PostPosted: Thu Mar 29, 2012 2:07 pm 
Offline
Forum Contributor
User avatar

Joined: Thu May 11, 2006 8:58 pm
Posts: 305
Location: Utah, USA
For something this simple I wouldn't be opposed to a regex. Especially since you may need to go to the trouble to pre-process the HTML (e.g. with HTML Tidy) before DOMDocument since scraped sites often have malformed html. For example, many sites have end tags inside of javascript: browsers are tolerant of this but DOMDocument is rightfully intolerant.

Assuming the span is called plusShippingText, this would work: /\bplusShippingText\b.+?(FREE SHIPPING|[\d,.]+)/i But remember that the site's page could change at any time! That regex says: find "plusShippingText" at word boundaries, followed by some characters followed by "FREE SHIPPING" or digits, commas, and decimals. The "i" means case insensitive.

The first capturing pattern would either be "FREE SHIPPING" or a number. If it is a number, replace commas with empty strings.


Top
 Profile  
 
PostPosted: Thu Mar 29, 2012 2:50 pm 
Offline
Forum Commoner
User avatar

Joined: Thu Dec 15, 2011 2:40 pm
Posts: 85
Location: Nelson, NZ
Hi Noodleyman,

Here's a regex solution. No opinion on whether a regex is desirable---I just enjoy replying to regex questions ;)
Running this on an array of three strings for you to show how it performs. Just add strings to the array to test more.

Input:

'<span id="pricePlusShippingQty"><b class="price">£29.99</b><span class="plusShippingText">&nbsp;+&nbsp;£5.40&nbsp;shipping</span>',
'<span id="pricePlusShippingQty"><b class="price">£29.99</b><span class="plusShippingText">&nbsp;+&nbsp;£2,005.40&nbsp;shipping</span>',
'<span id="pricePlusShippingQty"><b class="price">£29.99</b><span class="plusShippingText">&nbsp;+&nbsp;10.10&nbsp;shipping</span>'


Code:
Syntax: [ Download ] [ Hide ]
<?php
$regex='~"plusShippingText">[^\d]+\K[\d,.]+\d~';
$strings=array(
'<span id="pricePlusShippingQty"><b class="price">£29.99</b><span class="plusShippingText">&nbsp;+&nbsp;£5.40&nbsp;shipping</span>',
'<span id="pricePlusShippingQty"><b class="price">£29.99</b><span class="plusShippingText">&nbsp;+&nbsp;£2,005.40&nbsp;shipping</span>',
'<span id="pricePlusShippingQty"><b class="price">£29.99</b><span class="plusShippingText">&nbsp;+&nbsp;10.10&nbsp;shipping</span>'
);
foreach ($strings as $string)
        if(preg_match($regex,$string,$m))
        echo $m[0].'<br />';
?>
 


Output:

5.40
2,005.40
10.10


Let me know if this works for you
or you have any questions. :)


Top
 Profile  
 
PostPosted: Fri Mar 30, 2012 2:15 am 
Offline
Forum Newbie

Joined: Thu Jun 23, 2011 4:49 pm
Posts: 23
Thank you for all the detailed replies :)

I really appreciate you taking the time. This is now resolved. Thank you for the great examples, and advice about alternate methods.

I actually ended up using a basic regex to get the content between the parent div (as it is not malformed). then I used a combination of str_replace, strstr, strpos functions to strip out exactly what I needed. Here is the function I put together.I had to modify this a little after realising I was getting varied results on the target pages. I will be enhancing this further today to allow it to use multi_curl to load all target pages (up to 50) at the same time to improve performance.

Syntax: [ Download ] [ Hide ]
function get_shipping($link) {
       
        // get the HTML
        $html = file_get_contents($link);      

        preg_match(
                '/<div[^>]*id=\"BBPricePlusShipID\">(.*?)<\\/div>/si',
                $html,
            $first
        );
       
        // Filter out results where we get an unexpected HTML string and assume free delivery
        IF(array_key_exists(0, $first)){
                $string = $first[0];
       
                // Remove part of string we are not interested in
                $string = strstr($string, '<span class="plusShippingText">');

                // check if we have a currency symbol
                if(strpos($string, "£") === FALSE){
                        // Free shipping
                        $shipPrice = "0.00";

                }ELSE{
                        // Find the position of the currency symbol
                        $pos = strpos($string, "£");
                       
                        // Get the string value after the currency symbol
                        $str = substr($string, $pos+1);
                       
                        // find the position of the space after the numbers
                        $pos = strpos($str, "&nbsp;");
                       
                        // Remove characters after the space
                        $str = substr($str, 0, $pos);
                       
                        // Remove any commas from number formatting
                        $shipPrice = str_replace(",", "", $str);
                }
        }ELSE{
                $shipPrice = 0;
        }
       
return (float)$shipPrice;
}


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 5 posts ] 

All times are UTC - 5 hours


Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB® Forum Software © phpBB Group