Page 1 of 1

Tricky Extration of Attributes Valid and otherwise

Posted: Thu Jan 11, 2007 11:47 am
by Ollie Saunders
I need to extract data from a string such as this:

Code: Select all

{ tak foo="1" gaz foo="2" # bar="zim" gir='dib'
Into a format like this:

Listing 1:

Code: Select all

$result = array(
    '{'   => '',
    'tak' => '',
    'foo' => array(1, 2), // there are two foo attributes ^^
    'gaz' => '',
    '#'   => '',
    'bar' => 'zim',
    'gir' => 'dib'
);
Or it may be easier to extract it like this

Listing 2:

Code: Select all

$valid = array(
    'foo' => array(1, 2),
    'bar' => 'zim',
    'gir' => 'dib'
);
$invalid = array('{', 'tak', 'gaz', '#');
This is a regex I have that will do the $valid part of listing 2:

Code: Select all

~  (?P<name>[a-z][a-z0-9]*?)=(?:  "(?P<doubleValue>[^"]*?)"   |   '(?P<singleValue>[^']*?)'   )  ~ix
And this was my attempt at doing the invalid part but it seems to capture the valid stuff again too:

Code: Select all

~([\S]+)(?!=")~
(?!) is supposed to be a negative look ahead.

Here's my complete code so far:

Code: Select all

/**
 * Parse a tag for its attribute names and values
 * Can handle multiple attributes with same name, stacks up the values
 *
 * @param string $toMatch
 * @return array keys as names values as values
 */
private function _attribute($toMatch)
{
    $match = array();
    $pattern = <<< PCRE
~  (?P<name>[a-z][a-z0-9]*?)=(?:  "(?P<doubleValue>[^"]*?)"   |   '(?P<singleValue>[^']*?)'   )  ~ix
PCRE; // fixes devnet's bad heredoc highlighting  >> "
    preg_match_all($pattern, $toMatch, $match);
    $attributes = array();
    // This loop builds: array('name' => 'value') structure
    for ($i = 0, $j = count($match['name']); $i < $j; ++$i) {
        $name = $match['name'][$i];
        if (empty($match['doubleValue'][$i])) { // which quote type
            $value = trim($match['singleValue'][$i]);
        } else {
            $value = trim($match['doubleValue'][$i]);
        }
        if (isset($attributes[$name])) { // duplicate attribute
            if (is_array($attributes[$name])) { // array the values up
                $attributes[$name][] = htmlspecialchars($value, ENT_QUOTES);
            } else {
                $attributes[$name] = array($attributes[$name], htmlspecialchars($value, ENT_QUOTES));
            }
        } else {
            $attributes[$name] = htmlspecialchars($value, ENT_QUOTES);
        }
    }

    // Invalid attributes, (?!) is negative look ahead
    $pattern = '~([\S]+)(?!=")~';
    $match = array();
    preg_match_all($pattern, $toMatch, $match);
    echo '<pre>';
    print_r($match); // in process of debugging
    echo '</pre>';
    /**
     * @todo add invalid matches to $attributes
     */

    return $attributes;
}
If someone could explain how this works, in particular how the condition is evaluated:

Code: Select all

(?(condition)yes-regex|no-regex)
That may help

Posted: Thu Jan 11, 2007 11:58 am
by Kieran Huggins
This sounds like a job for array_merge_recursive() !

You'll still need to parse (split?) the string into one or more arrays (which I feel confident you can), but then you can achieve the form you have in "Listing 1" using array_merge_recursive()

Does that make sense?

Posted: Thu Jan 11, 2007 12:07 pm
by Ollie Saunders
Sorry I should adjust the input string a bit:

Code: Select all

{ tak foo="1" gaz foo="2" # bar="zim"gir='dib'
Valid attributes may not necessarily be separated by whitespace. So I put it do you, what do I split by? :D

Posted: Thu Jan 11, 2007 12:15 pm
by Kieran Huggins
maybe preg_replace_callback() ?

Posted: Thu Jan 11, 2007 12:22 pm
by Ollie Saunders
errr....lol, kieran are you just going to name all the preg_* functions until we hit the right one?

Posted: Thu Jan 11, 2007 12:51 pm
by Kieran Huggins
You've uncovered my dirty little secret! No worries... preg_kieran() has a solution for you :wink:

Code: Select all

$str = '{ tak foo="1" gaz foo="2" # bar="zim"'."gir='dib'";

$array = array();

function callback($matches){
	global $array;
	if(isset($matches[4])){ // there is an attribute value
		$array = array_merge_recursive($array,array($matches[2]=>$matches[4]));
	}else{ // there is NO attribute value
		$array = array_merge_recursive($array,array($matches[2]));
	}
}

preg_replace_callback('#((\w+)(=[\'"](\w+)[\'"])?)#','callback',$str);

print_r($array);
Needs to be cleaned up, but it works!

Posted: Thu Jan 11, 2007 1:06 pm
by Ollie Saunders
OK I'm going look into this.....if it works I'm seriously loving you Kieran and there won't be any escape! mahaha I'm not gay :P

Posted: Thu Jan 11, 2007 1:29 pm
by Kieran Huggins
ole wrote:if it works I'm seriously loving you Kieran and there won't be any escape! mahaha
8O ...realizes that there's an ocean between us... 8)

No problemo. Was kind of fun actually!

Posted: Thu Jan 11, 2007 3:58 pm
by Ollie Saunders
Bah ocean smocean!

I never occurred to me that preg_replace_callback() could be used for its callback goodness without making a replacement. So I apologise for that remark about all the preg_* functions; I wasn't thinking outside of the box. For the purpose of visualization here is a box with me in it:

Code: Select all

+-----+
|\ o /|
|  |  |
| / \ |
|/   \|
+o-l-e+
Anyway here is the finished code. Believe it or not all of this did actually come about through modification of your code:

Code: Select all

/**
 * Temporary variable used in preg_replace_callback()
 *
 * @var array
 */
private $_callbackData;
/**
 * These relate to capturing index of the $pattern in _atttribute
 * unfortunately (?P<name>) syntax doens't work for preg_replace_callback()
 */
const ATTRIB_ATTRIBUTE    = 1;
const ATTRIB_QUOTED_DOUBLE = 2;
const ATTRIB_QUOTED_SINGLE = 3;
/**
 * Add attribute matches from regex into array structure
 *
 * @param array $matches
 * @return void
 */
private function _attributesCallback($matches)
{
    $attribute = $matches[self::ATTRIB_ATTRIBUTE];
    if (count($matches) <= 2) { // No value captured
        $this->_callbackData[] = $attribute;
        return;
    }
    // Values captured, but with what type of quote?
    if (empty($matches[self::ATTRIB_QUOTED_DOUBLE])) {
        $value = $matches[self::ATTRIB_QUOTED_SINGLE];
    } else {
        $value = $matches[self::ATTRIB_QUOTED_DOUBLE];
    }
    // Attribute name becomes key and values are arrayed up
    if (isset($this->_callbackData[$attribute])) {
        $this->_callbackData[$attribute][] = $value;
    } else {
        $this->_callbackData[$attribute] = array($value);
    }
}
/**
 * Parse a tag for its attribute names and values. Can handle multiple
 * attributes with same name, stacks up the values and also non-attribute
 * data.
 *
 * @param string $toMatch
 * @return array keys as names values as values
 */
private function _attribute($toMatch)
{
    $this->_callbackData = array();
    $pattern = '~  ([^\s=]+)  (?:=(?:  \'([^\']*?)\'  |  "([^"]*?)"  ))?  ~x';
    preg_replace_callback($pattern, array($this, '_attributesCallback'), $toMatch);
    return $this->_callbackData;
}
I've just unit tested this and its working fine....Thanks again!
I could send you a can of spam in the post if you like. Would you?

Posted: Thu Jan 11, 2007 4:14 pm
by Kieran Huggins
8O

I had to read that code a few times before I saw the relation ;)

Save the postage - the next time I visit London you can take me out for some spam, spam, spam, eggs & spam.