Tricky Extration of Attributes Valid and otherwise

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Tricky Extration of Attributes Valid and otherwise

Post by Ollie Saunders »

I need to extract data from a string such as this:

Code: Select all

{ tak foo="1" gaz foo="2" # bar="zim" gir='dib'
Into a format like this:

Listing 1:

Code: Select all

$result = array(
    '{'   => '',
    'tak' => '',
    'foo' => array(1, 2), // there are two foo attributes ^^
    'gaz' => '',
    '#'   => '',
    'bar' => 'zim',
    'gir' => 'dib'
);
Or it may be easier to extract it like this

Listing 2:

Code: Select all

$valid = array(
    'foo' => array(1, 2),
    'bar' => 'zim',
    'gir' => 'dib'
);
$invalid = array('{', 'tak', 'gaz', '#');
This is a regex I have that will do the $valid part of listing 2:

Code: Select all

~  (?P<name>[a-z][a-z0-9]*?)=(?:  "(?P<doubleValue>[^"]*?)"   |   '(?P<singleValue>[^']*?)'   )  ~ix
And this was my attempt at doing the invalid part but it seems to capture the valid stuff again too:

Code: Select all

~([\S]+)(?!=")~
(?!) is supposed to be a negative look ahead.

Here's my complete code so far:

Code: Select all

/**
 * Parse a tag for its attribute names and values
 * Can handle multiple attributes with same name, stacks up the values
 *
 * @param string $toMatch
 * @return array keys as names values as values
 */
private function _attribute($toMatch)
{
    $match = array();
    $pattern = <<< PCRE
~  (?P<name>[a-z][a-z0-9]*?)=(?:  "(?P<doubleValue>[^"]*?)"   |   '(?P<singleValue>[^']*?)'   )  ~ix
PCRE; // fixes devnet's bad heredoc highlighting  >> "
    preg_match_all($pattern, $toMatch, $match);
    $attributes = array();
    // This loop builds: array('name' => 'value') structure
    for ($i = 0, $j = count($match['name']); $i < $j; ++$i) {
        $name = $match['name'][$i];
        if (empty($match['doubleValue'][$i])) { // which quote type
            $value = trim($match['singleValue'][$i]);
        } else {
            $value = trim($match['doubleValue'][$i]);
        }
        if (isset($attributes[$name])) { // duplicate attribute
            if (is_array($attributes[$name])) { // array the values up
                $attributes[$name][] = htmlspecialchars($value, ENT_QUOTES);
            } else {
                $attributes[$name] = array($attributes[$name], htmlspecialchars($value, ENT_QUOTES));
            }
        } else {
            $attributes[$name] = htmlspecialchars($value, ENT_QUOTES);
        }
    }

    // Invalid attributes, (?!) is negative look ahead
    $pattern = '~([\S]+)(?!=")~';
    $match = array();
    preg_match_all($pattern, $toMatch, $match);
    echo '<pre>';
    print_r($match); // in process of debugging
    echo '</pre>';
    /**
     * @todo add invalid matches to $attributes
     */

    return $attributes;
}
If someone could explain how this works, in particular how the condition is evaluated:

Code: Select all

(?(condition)yes-regex|no-regex)
That may help
User avatar
Kieran Huggins
DevNet Master
Posts: 3635
Joined: Wed Dec 06, 2006 4:14 pm
Location: Toronto, Canada
Contact:

Post by Kieran Huggins »

This sounds like a job for array_merge_recursive() !

You'll still need to parse (split?) the string into one or more arrays (which I feel confident you can), but then you can achieve the form you have in "Listing 1" using array_merge_recursive()

Does that make sense?
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Post by Ollie Saunders »

Sorry I should adjust the input string a bit:

Code: Select all

{ tak foo="1" gaz foo="2" # bar="zim"gir='dib'
Valid attributes may not necessarily be separated by whitespace. So I put it do you, what do I split by? :D
User avatar
Kieran Huggins
DevNet Master
Posts: 3635
Joined: Wed Dec 06, 2006 4:14 pm
Location: Toronto, Canada
Contact:

Post by Kieran Huggins »

maybe preg_replace_callback() ?
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Post by Ollie Saunders »

errr....lol, kieran are you just going to name all the preg_* functions until we hit the right one?
User avatar
Kieran Huggins
DevNet Master
Posts: 3635
Joined: Wed Dec 06, 2006 4:14 pm
Location: Toronto, Canada
Contact:

Post by Kieran Huggins »

You've uncovered my dirty little secret! No worries... preg_kieran() has a solution for you :wink:

Code: Select all

$str = '{ tak foo="1" gaz foo="2" # bar="zim"'."gir='dib'";

$array = array();

function callback($matches){
	global $array;
	if(isset($matches[4])){ // there is an attribute value
		$array = array_merge_recursive($array,array($matches[2]=>$matches[4]));
	}else{ // there is NO attribute value
		$array = array_merge_recursive($array,array($matches[2]));
	}
}

preg_replace_callback('#((\w+)(=[\'"](\w+)[\'"])?)#','callback',$str);

print_r($array);
Needs to be cleaned up, but it works!
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Post by Ollie Saunders »

OK I'm going look into this.....if it works I'm seriously loving you Kieran and there won't be any escape! mahaha I'm not gay :P
User avatar
Kieran Huggins
DevNet Master
Posts: 3635
Joined: Wed Dec 06, 2006 4:14 pm
Location: Toronto, Canada
Contact:

Post by Kieran Huggins »

ole wrote:if it works I'm seriously loving you Kieran and there won't be any escape! mahaha
8O ...realizes that there's an ocean between us... 8)

No problemo. Was kind of fun actually!
User avatar
Ollie Saunders
DevNet Master
Posts: 3179
Joined: Tue May 24, 2005 6:01 pm
Location: UK

Post by Ollie Saunders »

Bah ocean smocean!

I never occurred to me that preg_replace_callback() could be used for its callback goodness without making a replacement. So I apologise for that remark about all the preg_* functions; I wasn't thinking outside of the box. For the purpose of visualization here is a box with me in it:

Code: Select all

+-----+
|\ o /|
|  |  |
| / \ |
|/   \|
+o-l-e+
Anyway here is the finished code. Believe it or not all of this did actually come about through modification of your code:

Code: Select all

/**
 * Temporary variable used in preg_replace_callback()
 *
 * @var array
 */
private $_callbackData;
/**
 * These relate to capturing index of the $pattern in _atttribute
 * unfortunately (?P<name>) syntax doens't work for preg_replace_callback()
 */
const ATTRIB_ATTRIBUTE    = 1;
const ATTRIB_QUOTED_DOUBLE = 2;
const ATTRIB_QUOTED_SINGLE = 3;
/**
 * Add attribute matches from regex into array structure
 *
 * @param array $matches
 * @return void
 */
private function _attributesCallback($matches)
{
    $attribute = $matches[self::ATTRIB_ATTRIBUTE];
    if (count($matches) <= 2) { // No value captured
        $this->_callbackData[] = $attribute;
        return;
    }
    // Values captured, but with what type of quote?
    if (empty($matches[self::ATTRIB_QUOTED_DOUBLE])) {
        $value = $matches[self::ATTRIB_QUOTED_SINGLE];
    } else {
        $value = $matches[self::ATTRIB_QUOTED_DOUBLE];
    }
    // Attribute name becomes key and values are arrayed up
    if (isset($this->_callbackData[$attribute])) {
        $this->_callbackData[$attribute][] = $value;
    } else {
        $this->_callbackData[$attribute] = array($value);
    }
}
/**
 * Parse a tag for its attribute names and values. Can handle multiple
 * attributes with same name, stacks up the values and also non-attribute
 * data.
 *
 * @param string $toMatch
 * @return array keys as names values as values
 */
private function _attribute($toMatch)
{
    $this->_callbackData = array();
    $pattern = '~  ([^\s=]+)  (?:=(?:  \'([^\']*?)\'  |  "([^"]*?)"  ))?  ~x';
    preg_replace_callback($pattern, array($this, '_attributesCallback'), $toMatch);
    return $this->_callbackData;
}
I've just unit tested this and its working fine....Thanks again!
I could send you a can of spam in the post if you like. Would you?
User avatar
Kieran Huggins
DevNet Master
Posts: 3635
Joined: Wed Dec 06, 2006 4:14 pm
Location: Toronto, Canada
Contact:

Post by Kieran Huggins »

8O

I had to read that code a few times before I saw the relation ;)

Save the postage - the next time I visit London you can take me out for some spam, spam, spam, eggs & spam.
Post Reply