Page 1 of 1
Parsing a citation
Posted: Wed Sep 10, 2008 10:38 am
by wescrock
Hey all,
My department at a university library has taken on a project (passed to us by the Smithsonian Nat. Museum) in which I will be developing a front end for a database. We are currently rebuilding the DB, but would like to make a view where we take a citation field [i.e. below] and parse it to separate each section of it [date, author, publisher... they SHOULD all be in ALA format.]
I am coming here with this, because I am very new to PHP and have never done something like this...
Citation Example:
Ahlström, Göran.<i> Technological Development and Industrial Expositions, 1850-1914: Sweden in an International Perspective.</i> Lund: Lund University Press, 1996.
My main concern with this kind of parsing, is that the format of these citations are not comma delineated or anything of the sort.
I need to make a view that will show the citation broken into sections (author, article, date, publisher...) eventually, this same piece of code will be used to insert into the new table that is being created at a later date.
If anyone could give me some guidance, that would be fantastic.
Thank you,
Wes Crockett
Re: Parsing a citation
Posted: Wed Sep 10, 2008 10:52 am
by Darkzaelus
If it is just seperated with <i> and </i> then you can use preg_split:
Code: Select all
$matches=preg_split('/(.*?)\\<i\\>(.*?)\\<\\\\i\\>(.*?)/is',$text);
echo $matches[0][0];//Ahlström, Göran.
echo $matches[1][0];// Technological Development and Industrial Expositions, 1850-1914: Sweden in an International Perspective.
echo $matches[2][0];// Lund: Lund University Press, 1996.
It will keep going until the end, looking for start and end tags.
Tell me if it works!
Cheers, Darkzaelus
Re: Parsing a citation
Posted: Wed Sep 10, 2008 10:56 am
by Dravos
There's a comma splitting the last 2 values, so it isn't just the italic tags, but that's an efficient approach.
I'd gone for a step by step approach, looking for each token in turn, of course the main problem here will be if a publisher were to have a comma in their name, or a title to have an italic tag in it's title.
Code: Select all
<?php
$citation="Ahlström, Göran.<i> Technological Development and Industrial Expositions, 1850-1914: Sweden in an International Perspective.</i> Lund: Lund University Press, 1996.";
$temp=$citation;
$needle=strpos($temp, "<i>");
$author=substr($temp, 0, $needle);
$temp=substr($temp, $needle);
$needle=strpos($temp, "</i>");
$title=substr($temp, 3, $needle);
$temp=substr($temp, $needle);
$needle=strpos($temp, ",");
$publisher=substr($temp, 0, $needle);
$temp=substr($temp, $needle);
$date=substr($temp, 1);
?>
Author: <?=$author?><br/>
Title: <?=$title?><br/>
Publisher: <?=$publisher?><br/>
Date: <?=$date?><br/>
Re: Parsing a citation
Posted: Wed Sep 10, 2008 10:59 am
by Darkzaelus
Yeah, thats good stuff. But don't we need to know:
My main concern with this kind of parsing, is that the format of these citations are not comma delineated or anything of the sort.
WHAT is the delimiter
Cheers, Darkzaelus
Re: Parsing a citation
Posted: Wed Sep 10, 2008 11:04 am
by Dravos
I remember doing something similar to this for some coursework at university. I think there is more than one delimiter due to a not very well thought through storage format. I'm assuming all other entries cohere to this first one, and as long as they don't have <i> </> or , in any of the data then it should be fine :/
Re: Parsing a citation
Posted: Wed Sep 10, 2008 11:14 am
by wescrock
Thanks for your comments so far!
There are several different formats for the citations, but those are determined by a separate field (book/article, thesis, dissertation...) so, that won't be hard to do with a simple if/then sequence.
I am in the very early stages of this step, and I will be trying to actually do it sometime late today or tomorrow A.M. I will post once I get into it!
thanks again,
Wes
Re: Parsing a citation
Posted: Wed Sep 10, 2008 11:28 am
by wescrock
Worked like a charm! now i just need to put it in a loop, with echo statements, so i can pull from the SQL table all 'citations' where 'format' == "Books/Monographs"
thanks!
-Wes
Dravos wrote:There's a comma splitting the last 2 values, so it isn't just the italic tags, but that's an efficient approach.
I'd gone for a step by step approach, looking for each token in turn, of course the main problem here will be if a publisher were to have a comma in their name, or a title to have an italic tag in it's title.
Code: Select all
<?php
$citation="Ahlström, Göran.<i> Technological Development and Industrial Expositions, 1850-1914: Sweden in an International Perspective.</i> Lund: Lund University Press, 1996.";
$temp=$citation;
$needle=strpos($temp, "<i>");
$author=substr($temp, 0, $needle);
$temp=substr($temp, $needle);
$needle=strpos($temp, "</i>");
$title=substr($temp, 3, $needle);
$temp=substr($temp, $needle);
$needle=strpos($temp, ",");
$publisher=substr($temp, 0, $needle);
$temp=substr($temp, $needle);
$date=substr($temp, 1);
?>
Author: <?=$author?><br/>
Title: <?=$title?><br/>
Publisher: <?=$publisher?><br/>
Date: <?=$date?><br/>
Re: Parsing a citation
Posted: Wed Sep 10, 2008 11:31 am
by Darkzaelus
make sure the client can use short tags:
it would be safer to use
Cheers, Darkzaelus
Re: Parsing a citation
Posted: Wed Sep 10, 2008 11:44 am
by wescrock
Sweet! I got it to work with the WHILE loop for the most part... I noticed now though, not all of them are formated the same... so, for the strpos() command... can i put an or statement with it?
such as:
Code: Select all
$needle=strpos($temp, "<i>" or "<em>" or "“");
The person who originally made this DB had a very small view of databases it looks like...
thanks,
Wes
Re: Parsing a citation
Posted: Wed Sep 10, 2008 11:53 am
by Dravos
I'm not sure about a 1 line solution but you could do:
Code: Select all
if (strpos($temp, "<i>"))
$needle=strpos($temp, "</i>");
elseif (strpos($temp, "<em>"))
$needle=strpos($temp, "<em>");
elseif (strpos($temp, "“"))
$needle=strpos($temp, "“");
Re: Parsing a citation
Posted: Wed Sep 10, 2008 11:55 am
by Dravos
Actually that doesn't seem to quite work..
Re: Parsing a citation
Posted: Wed Sep 10, 2008 12:01 pm
by Darkzaelus
Code: Select all
if (!$needle=strpos($temp, "<i>"))
if (!$needle=strpos($temp, "<em>"))
$needle=strpos($temp, "“");
Does that work?
Re: Parsing a citation
Posted: Wed Sep 10, 2008 12:05 pm
by Dravos
Yep that's good, although bear in mind, the start number in the substr will need to change based on the length of the delimiter. It was hardcoded to 3, based on <i> so would be 4 with <em>, you could set a variable for this in those if statements and plug it in.
Re: Parsing a citation
Posted: Wed Sep 10, 2008 12:13 pm
by Darkzaelus
Never used strpos, thats my excuse

GOT IT
Code: Select all
$end=str_ireplace(array('<i>','</i>'.'<em>','</em>','"'), chr(13), $text);
$array=explode(chr(13),$end);
Sorted

Replaces all delimiters with a common one that won't be used, an enter, then explodes them into an array.
Cheers, Darkzaelus
Re: Parsing a citation
Posted: Wed Sep 10, 2008 12:42 pm
by wescrock
So... after playing around with it... and a little confusion... the best this is going to get (because the data is pretty messed up in alot of ways) is this:
Code: Select all
<?php
include 'db.inc';
$dbh = mysql_connect($hostname, $username, $password)
or die("Unable to connect to MySQL");
$query="select citation from wfbiblio.items WHERE format='Books/Monographs'";
$result = mysql_query($query);
while($row = mysql_fetch_array($result))
{
$citation=$row[0];
$temp=$citation;
$needle=strpos($temp, "<i>");
if($needle > 0)
{
$author=substr($temp, 0, $needle);
if($author==null)
{$author="WARNING: Null Value.";}
$temp=substr($temp, $needle);
$needle=strpos($temp, "</i>");
$title=substr($temp, 3, $needle);
if($title==null)
{$title="WARNING: Null Value.";}
$temp=substr($temp, $needle);
$needle=strpos($temp, ",");
$publisher=substr($temp, 0, $needle);
if($publisher==null)
{$publisher="WARNING: Null Value.";}
$temp=substr($temp, $needle);
$date=substr($temp, 1);
if($date==null)
{$date="WARNING: Null Value.";}
}
if($needle == false)
{
$needle=strpos($temp, "<em>");
$author=substr($temp, 0, $needle);
if($author==null)
{$author="WARNING: Null Value.";}
$temp=substr($temp, $needle);
$needle=strpos($temp, "</em>");
$title=substr($temp, 4, $needle);
if($title==null)
{$title="WARNING: Null Value.";}
$temp=substr($temp, $needle);
$needle=strpos($temp, ",");
$publisher=substr($temp, 0, $needle);
if($publisher==null)
{$publisher="WARNING: Null Value.";}
$temp=substr($temp, $needle);
$date=substr($temp, 1);
if($date==null)
{$date="WARNING: Null Value.";}
}
echo "Author: " . $author . "<br/>";
echo "Title:<i> " . $title . "</i><br/>";
echo "Publisher: " . $publisher . "<br/>";
echo "Date: " . $date . "<br/>";
echo "<br />";
}
?>
Here is the web address to see the output:
http://labs.lib.csufresno.edu/world_fair/citation.php
thanks all,
Wes