Page 1 of 1
parsing the input
Posted: Thu Sep 15, 2005 5:07 pm
by umbra
Hi,
I want to add some records to my database by reading it from a file. The file content looks like this:
Code: Select all
@article{Gettys90,
author = {Jim Gettys and Phil Karlton and Scott McGregor},
title = {The {X} Window System, Version 11},
journal = {Software Practice and Experience},
volume = {20},
number = {S2},
year = {1990},
abstract = {A technical overview of the X11 functionality. This is an update
of the X10 TOG paper by Scheifler \& Gettys.}
}
and there are many of them in a file.
The word that comes after @ is the entry type; author, title,..., abstract are field names and the parts after '=' are their values. How can I parse this input to put them in variables so that I can use them to add to my database.
I have written some code but it doesn't work when there are more than one '=' or '@' in a line. So I need something else. Any help is appreciated
Posted: Thu Sep 15, 2005 6:11 pm
by feyd
Regular Expressions be yer friends, matey. Have a read through our board on them... yarr...
Posted: Thu Sep 15, 2005 10:43 pm
by shoebappa
arrr, that was tougher than I thought... I can respect pointing to the regex manual and being done with it, but I was bored...
I broke it up into two regexs, may not be the most efficient, but seems to work.
Code: Select all
<?php
$string = '@article{Gettys90,
author = {Jim Gettys and Phil Karlton and Scott McGregor},
title = {The {X} Window System, Version 11},
journal = {Software Practice and Experience},
volume = {20},
number = {S2},
year = {1990},
abstract = {A technical overview of the X11 functionality. This is an update
of the X10 TOG paper by Scheifler \& Gettys.}
}
@article{Gettys07,
author = {Jim Gettys and Phil Karlton and Scott McGregor},
title = {The {X} Window System, Version 12},
journal = {Software Practice and Experience},
volume = {21},
number = {S3},
year = {2007},
abstract = {A technical overview of the X12 functionality. This is an update
of the X11 TOG paper by Scheifler \& Gettys.}
}';
// matches each @whatever putting the type in sub1, id in sub2, and all of the attributes including the trailing } in sub3
$pattern = "/@([0-9a-z]*){([0-9a-z]*)\s*,(.*})\s*}/siU";
preg_match_all($pattern, $string, $matches);
for ($i = 0; $i < count($matches[0]); $i++) {
$output[$i]["type"] = $matches[1][$i];
$output[$i]["id"] = $matches[2][$i];
// matches each attribute putting the name in sub1 and the value in sub2
$pattern2 = "/\s*([0-9a-z]*)\s*=\s*{(.*)},/siU";
// I add a comma the end of the attributes so they all end with },
// Note I make sure to capture the second to last } in the previous regex.
preg_match_all($pattern2, $matches[3][$i] . ",", $matches2);
for ($j = 0; $j < count($matches2[0]); $j++) {
$output[$i][$matches2[1][$j]] = $matches2[2][$j];
}
}
print_r($output);
?>
Outputs:
Code: Select all
Array
(
[0] => Array
(
[type] => article
[id] => Gettys90
[author] => Jim Gettys and Phil Karlton and Scott McGregor
[title] => The {X} Window System, Version 11
[journal] => Software Practice and Experience
[volume] => 20
[number] => S2
[year] => 1990
[abstract] => A technical overview of the X11 functionality. This is an update
of the X10 TOG paper by Scheifler \& Gettys.
)
[1] => Array
(
[type] => article
[id] => Gettys07
[author] => Jim Gettys and Phil Karlton and Scott McGregor
[title] => The {X} Window System, Version 12
[journal] => Software Practice and Experience
[volume] => 21
[number] => S3
[year] => 2007
[abstract] => A technical overview of the X12 functionality. This is an update
of the X11 TOG paper by Scheifler \& Gettys.
)
)
May need some tweaking if more than just numbers and letters can be in the id and attribute names.
Posted: Fri Sep 16, 2005 4:46 pm
by umbra
$pattern = "/@([0-9a-z]*){([0-9a-z]*)\s*,(.*})\s*}/siU";
What does /siU means in that search pattern?
Posted: Fri Sep 16, 2005 5:10 pm
by shoebappa
They are modifier flags:
s = ignore whitespace
i = case insensitive
U = the greedy flag
The only thing I can say about the greedy flag is I use it when a .* eats too much. So it would normally consume anything until the last occurance of whatever follows the .* With it on, I think it says don't be greedy, so it only consumes until the first occurance of whatever comes after.
So in $pattern = "/@([0-9a-z]*){([0-9a-z]*)\s*,(.*})\s*}/siU";
Without the flag the (.*}) would match everything until the last occurance of }\s*} (} any whitespace and a }) which would eat all of the @article things. With the flag it only goes until the first occurance of }\s*} which is the end of each @article.
Posted: Fri Sep 16, 2005 5:24 pm
by feyd
s = single string, not ignore whitespace.

Posted: Fri Sep 16, 2005 5:49 pm
by shoebappa
Not sure where I picked up the ignore whitespace idea, but it's kinda funny that I thought it meant that for so long... Not so far off though
http://us2.php.net/manual/en/reference. ... ifiers.php
Still need it I reckon for the .* to span the newlines.
Posted: Fri Sep 16, 2005 5:50 pm
by feyd
yeah, that's exactly what it basically does sorta.. it allows the match to span multiple lines, versus the standard multiple-line mode (m)
Posted: Fri Sep 16, 2005 5:57 pm
by shoebappa
Exactly, basically, sorta... Anyhoo, you prolly saved me a few hours scratching my head down the road assuming the . matches a newline...
I guess I always wondered why the s modifier didn't ignore whitespace and I had to always put the \s* in there...
Posted: Fri Sep 16, 2005 7:59 pm
by umbra
Why when I write:
I get the result of the array and when I write:
it outputs Array['type'] ?
I need to get the result of the array in quotes.
Posted: Fri Sep 16, 2005 10:02 pm
by feyd
when in a string alone, php will only parse the first element you refer to.. the first is preferred however.
Posted: Fri Sep 16, 2005 10:14 pm
by shoebappa
I thought for php to process an array in quotes you had to enclose it in {} so:
You can also use the . thing; <-- This is funny, I've been coding too much today... I ended a sentence with a ; baha
Code: Select all
echo "whatever" . $output[0]['type'] . "whatever";
And it seems like I read in an article somewhere that using a comma instead of . is a little faster, for whatever reason. I use a . ...
Code: Select all
echo "whatever" , $output[0]['type'] , "whatever";