parsing the input

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
umbra
Forum Newbie
Posts: 21
Joined: Tue Sep 13, 2005 4:20 am

parsing the input

Post by umbra »

Hi,

I want to add some records to my database by reading it from a file. The file content looks like this:

Code: Select all

@article{Gettys90,
   author = {Jim Gettys and Phil Karlton and Scott McGregor},
   title = {The {X} Window System, Version 11},
   journal = {Software Practice and Experience},
   volume = {20},
   number = {S2},
   year = {1990},
   abstract = {A technical overview of the X11 functionality.  This is an update
of the X10 TOG paper by Scheifler \& Gettys.}
}
and there are many of them in a file.

The word that comes after @ is the entry type; author, title,..., abstract are field names and the parts after '=' are their values. How can I parse this input to put them in variables so that I can use them to add to my database.

I have written some code but it doesn't work when there are more than one '=' or '@' in a line. So I need something else. Any help is appreciated
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

Regular Expressions be yer friends, matey. Have a read through our board on them... yarr...
User avatar
shoebappa
Forum Contributor
Posts: 158
Joined: Mon Jul 11, 2005 9:14 pm
Location: Norfolk, VA

Post by shoebappa »

arrr, that was tougher than I thought... I can respect pointing to the regex manual and being done with it, but I was bored...

I broke it up into two regexs, may not be the most efficient, but seems to work.

Code: Select all

<?php

$string = '@article{Gettys90, 
author = {Jim Gettys and Phil Karlton and Scott McGregor}, 
title = {The {X} Window System, Version 11}, 
journal = {Software Practice and Experience}, 
volume = {20}, 
number = {S2}, 
year = {1990}, 
abstract = {A technical overview of the X11 functionality. This is an update 
of the X10 TOG paper by Scheifler \& Gettys.} 
}

@article{Gettys07, 
author = {Jim Gettys and Phil Karlton and Scott McGregor}, 
title = {The {X} Window System, Version 12}, 
journal = {Software Practice and Experience}, 
volume = {21}, 
number = {S3}, 
year = {2007}, 
abstract = {A technical overview of the X12 functionality. This is an update 
of the X11 TOG paper by Scheifler \& Gettys.} 
}';

// matches each @whatever putting the type in sub1, id in sub2, and all of the attributes including the trailing } in sub3
$pattern = "/@([0-9a-z]*){([0-9a-z]*)\s*,(.*})\s*}/siU";
preg_match_all($pattern, $string, $matches);


for ($i = 0; $i < count($matches[0]); $i++) {
	$output[$i]["type"] = $matches[1][$i];
	$output[$i]["id"] = $matches[2][$i];

	// matches each attribute putting the name in sub1 and the value in sub2
	$pattern2 = "/\s*([0-9a-z]*)\s*=\s*{(.*)},/siU";

	// I add a comma the end of the attributes so they all end with },
	// Note I make sure to capture the second to last } in the previous regex.
	preg_match_all($pattern2, $matches[3][$i] . ",", $matches2);
	
	for ($j = 0; $j < count($matches2[0]); $j++) {
		$output[$i][$matches2[1][$j]] = $matches2[2][$j];
	}

}

print_r($output);

?>
Outputs:

Code: Select all

Array
(
    [0] => Array
        (
            [type] => article
            [id] => Gettys90
            [author] => Jim Gettys and Phil Karlton and Scott McGregor
            [title] => The {X} Window System, Version 11
            [journal] => Software Practice and Experience
            [volume] => 20
            [number] => S2
            [year] => 1990
            [abstract] => A technical overview of the X11 functionality. This is an update 
of the X10 TOG paper by Scheifler \& Gettys.
        )

    [1] => Array
        (
            [type] => article
            [id] => Gettys07
            [author] => Jim Gettys and Phil Karlton and Scott McGregor
            [title] => The {X} Window System, Version 12
            [journal] => Software Practice and Experience
            [volume] => 21
            [number] => S3
            [year] => 2007
            [abstract] => A technical overview of the X12 functionality. This is an update 
of the X11 TOG paper by Scheifler \& Gettys.
        )

)
May need some tweaking if more than just numbers and letters can be in the id and attribute names.
umbra
Forum Newbie
Posts: 21
Joined: Tue Sep 13, 2005 4:20 am

Post by umbra »

$pattern = "/@([0-9a-z]*){([0-9a-z]*)\s*,(.*})\s*}/siU";
What does /siU means in that search pattern?
User avatar
shoebappa
Forum Contributor
Posts: 158
Joined: Mon Jul 11, 2005 9:14 pm
Location: Norfolk, VA

Post by shoebappa »

They are modifier flags:

s = ignore whitespace
i = case insensitive
U = the greedy flag

The only thing I can say about the greedy flag is I use it when a .* eats too much. So it would normally consume anything until the last occurance of whatever follows the .* With it on, I think it says don't be greedy, so it only consumes until the first occurance of whatever comes after.

So in $pattern = "/@([0-9a-z]*){([0-9a-z]*)\s*,(.*})\s*}/siU";

Without the flag the (.*}) would match everything until the last occurance of }\s*} (} any whitespace and a }) which would eat all of the @article things. With the flag it only goes until the first occurance of }\s*} which is the end of each @article.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

s = single string, not ignore whitespace. ;)
User avatar
shoebappa
Forum Contributor
Posts: 158
Joined: Mon Jul 11, 2005 9:14 pm
Location: Norfolk, VA

Post by shoebappa »

Not sure where I picked up the ignore whitespace idea, but it's kinda funny that I thought it meant that for so long... Not so far off though

http://us2.php.net/manual/en/reference. ... ifiers.php

Still need it I reckon for the .* to span the newlines.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

yeah, that's exactly what it basically does sorta.. it allows the match to span multiple lines, versus the standard multiple-line mode (m)
User avatar
shoebappa
Forum Contributor
Posts: 158
Joined: Mon Jul 11, 2005 9:14 pm
Location: Norfolk, VA

Post by shoebappa »

Exactly, basically, sorta... Anyhoo, you prolly saved me a few hours scratching my head down the road assuming the . matches a newline...

I guess I always wondered why the s modifier didn't ignore whitespace and I had to always put the \s* in there...
umbra
Forum Newbie
Posts: 21
Joined: Tue Sep 13, 2005 4:20 am

Post by umbra »

Why when I write:

Code: Select all

echo $output[0]['type'];
I get the result of the array and when I write:

Code: Select all

echo "$output[0]['type']";
it outputs Array['type'] ?

I need to get the result of the array in quotes.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

when in a string alone, php will only parse the first element you refer to.. the first is preferred however.
User avatar
shoebappa
Forum Contributor
Posts: 158
Joined: Mon Jul 11, 2005 9:14 pm
Location: Norfolk, VA

Post by shoebappa »

I thought for php to process an array in quotes you had to enclose it in {} so:

Code: Select all

echo "{$output[0]['type']}";
You can also use the . thing; <-- This is funny, I've been coding too much today... I ended a sentence with a ; baha

Code: Select all

echo "whatever" . $output[0]['type'] . "whatever";
And it seems like I read in an article somewhere that using a comma instead of . is a little faster, for whatever reason. I use a . ...

Code: Select all

echo "whatever" , $output[0]['type'] , "whatever";
Post Reply