Page 1 of 1

preg_match_all returns TRUE but bad values

Posted: Mon Nov 22, 2010 5:44 am
by gu35t
hi
i ve a string - this is a javascript variable:
[text]var var0 = [ "11296710","na","21,010,200,000","20101121","20100415","20100209","X","rozne dane","na","koty","12:40 am","00:00","2910416169","Nov. 21, 2010","4.2","kol","4.2700","4.2200","251698248","0.0320","d","4.2500","4.2600","4.3000","4.2600","3.1100","5.0700","Feb. 9, 2010","Apr. 15, 2010","na","na","2.56","-0.02","-172.03","0.00","0","0.744 %","4.2","4.3000","20101119","6712","20101119","4.2600","4.2700","4.2200","0.0000","dane firmowe","na","na" ];[/text]

php code:

Code: Select all

#
$s = file_get_contents("./var.txt");
$s = trim($s);
$s = explode("\n",$s);

$patt = "/^var var([0-9]){1,2} = \[ "; // Begin of pattern
for($i=0; $i<=47; $i++){ $patt .= "\"(.*?)\","; } // Repeat pattern
$patt .= "\"(.*?)\" \]\;$/"; // The end of pattern 
echo $patt;

# 1 . not good
#
preg_match_All("/^var var([0-9]){1,2} = \[ (\"(.*?)\",){48}\"(.*?)\" \]\;$/ ", $s[0], $matches);
# 2 .  good
#
preg_match_all($patt, $s[0], $matches1);

echo "<pre>";
var_dump($s);
print_r($matches);
print_r($matches1);
echo "</pre>";
show_source(__FILE__);
#
?>
First preg_match_all with pattern - /^var var([0-9]){1,2} = \[ (\"(.*?)\",){48}\"(.*?)\" \]\;$/ - returns TRUE but bad values:
[text]Array
(
[0] => Array
(
[0] => var var0 = [ ...... ];
)

[1] => Array
(
[0] => 0
)

[2] => Array
(
[0] => "na",
)

[3] => Array
(
[0] => na
)

[4] => Array
(
[0] => na
)

)[/text]

Second preg_match_all with pattern $patt works fine - returns TRUE and all values between " ".
What is wrong with first pattern ?

thanks

Re: preg_match_all returns TRUE but bad values

Posted: Mon Nov 22, 2010 9:11 am
by ridgerunner
When you place a capture group inside a repeating expression (i.e. '(\"(.*?)\",){48}'), the same capture group is re-used over and over again and when the the match is finally completed, the last value that was captured is the one that is retained. The first regex only has four capture groups, so it only captures four values.

The first capture group in your regex also suffers this problem if the number of vars exceeds nine (as-written, the capture group only gets the last digit). This expression should be written like so: 'var([0-9]{1,2})'.

Re: preg_match_all returns TRUE but bad values

Posted: Mon Nov 22, 2010 9:59 am
by gu35t
[text]the same capture group is re-used over and over again and when the the match is finally completed,[...] so it only captures four values.[/text]
ok i understand.

Is there any way to write this expression in simpler(shorter) way than in PHP $patt variable?
i do not ve any idea how it should like if simpler expression is possible.
thanks

Re: preg_match_all returns TRUE but bad values

Posted: Mon Nov 22, 2010 11:29 am
by ridgerunner
If you wish to capture all the array elements in one single operation, then your current regex is pretty good.

Re: preg_match_all returns TRUE but bad values

Posted: Mon Nov 22, 2010 12:17 pm
by ridgerunner
One other point is that your regex does not allow for valid strings that contain escaped double quotes. For example:
var var0 = ["He said \"WOW!\"."];

Here is your script with a better regex:

Code: Select all

<?php
$s = file_get_contents("./var.txt");
$s = trim($s);
$s = explode("\n",$s);

// Old code
$patt = "/^var var([0-9]){1,2} = \[ ";            // Begin of pattern
for($i=0; $i<=47; $i++){ $patt .= "\"(.*?)\","; } // Repeat pattern
$patt .= "\"(.*?)\" \]\;$/";                      // The end of pattern

// New code
$patt = "/^var\s+var(\d+)\s*=\s*\[\s*";                   // Begin pattern
for($i=0; $i<=47; $i++) {
  $patt .= '"([^"\\\\]*(?:\\\\.[^"\\\\]*)*)"\s*,\s*';     // Repeat pattern
}
$patt   .= '"([^"\\\\]*(?:\\\\.[^"\\\\]*)*)"\s*\]\s*;$/'; // The end of pattern
echo $patt;

preg_match_all($patt, $s[0], $matches, PREG_SET_ORDER);

echo "<pre>";
var_dump($s);
print_r($matches);
echo "</pre>";
show_source(__FILE__);
?>
Changes:
  • Added \s* to allow for variable whitespace where allowed.
  • Used 'single' quotes for regex string instead of "double" quotes.
  • changed "(.*?)" to "([^"\\]*(?:\\.[^"\\]*)*)" for efficiency and to allow for escaped chars.
  • Added PREG_SET_ORDER flag to preg_match_all() call to group all array elements into one member.
Hope this helps! :)

Re: preg_match_all returns TRUE but bad values

Posted: Mon Nov 22, 2010 12:30 pm
by gu35t
damn, what a pattern !:D.

i was asking about simpler way 'cos the next string i m going to preg_match looks like:
[text]var type = [["xxx","Sex Sex Sex","US","/sex/p0rn.html","MY"],["blah1","blah blah","DE","/blah/A0rn.html","US"], ........ ]] ; [/text]
i will try to figure it out [-;

thanks for helpful advices ridgerunner !

Re: preg_match_all returns TRUE but bad values

Posted: Tue Nov 23, 2010 9:12 am
by ridgerunner
When asking for help, it is best to provide a complete representative example of the sample data you are working with, including all of the variations that may be encountered. i.e Does each array member always have the same number of elements? Do some records have variations in whitespace? Are the strings always double quoted or are some single quoted as well? The sample data needs to be truly representative if you want a really good solution. And If the real data has many records, please provide more than one record in your example data. In addition to representative sample data, you need to describe the final form of data you wish to extract. Arrays? Strings? Providing detailed descriptions of your input and your output helps us provide you with the help you need.

That said, it looks like your problem would be best solved using several nested loops, each with a simple regex; outer loop matches a full record, mid level loop matches outer array members which are themselves arrays, and finally an inner loop which extracts the string members of the inner arrays.

Re: preg_match_all returns TRUE but bad values

Posted: Tue Nov 23, 2010 10:23 am
by McInfo
In reply to the first post:

Is the goal to write a regular expression or to turn a JavaScript array string into a PHP array? If it's the latter, you might be interested in json_decode(). See also, substr(), strpos(), and strrpos().

Re: preg_match_all returns TRUE but bad values

Posted: Tue Nov 23, 2010 11:27 am
by gu35t
If it's the latter, you might be interested in json_decode(). See also, substr(), strpos(), and strrpos()
I know these functions . Also i know that there are `easiest` way to parse these strings. But my goal is to understand regexp.

[text]var colorstype = [["RGB","Red/Green/Black","MY","/rgb.php","DATA"],["BLUE","BLUE COLOR","MY","/colors/xxx/blue.html","ANOTHER"],["RED","Red Color","MY","/colors/xxx/red.html","ANOTHER"]] ;[/text]
all data in main crochets can repeat 1k+

[text]["RGB","Red Green Black","MY","/colors/xxx/rgb.html","DATA"],[/text]
[text]
First field is always uppercase string sometimes with dot - ([A-Z\.]{1,5}) -> "RGB"
Second field is a string - [a-zA-Z0-9\./] without \" -> "Red/Green Black"
Third field is a uppercase string ([A-Z]{1,3}) -> "MY"
Fourth field is a string [a-z/\.] -> "/rgb.php"
Fifth field is a uppsercase string ([A-Z]{1,5}) -> "DATA"
The end of the string: ]\s*;$
[/text]
you need to describe the final form of data you wish to extract.
[text]
array ( 0 => [0]RGB,
[1]Red/Green Black,
[2]MY,
[3]/rgb.php,
[4]DATA
1=> [0]RED,
[1]Red Color,
[2]MY,
[3]/colors/xxx/red.html,
[4]Another
);
[/text]

greetings