Parsing a citation

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
wescrock
Forum Commoner
Posts: 31
Joined: Wed Sep 10, 2008 10:31 am
Location: Fresno, CA

Parsing a citation

Post by wescrock »

Hey all,

My department at a university library has taken on a project (passed to us by the Smithsonian Nat. Museum) in which I will be developing a front end for a database. We are currently rebuilding the DB, but would like to make a view where we take a citation field [i.e. below] and parse it to separate each section of it [date, author, publisher... they SHOULD all be in ALA format.]

I am coming here with this, because I am very new to PHP and have never done something like this...

Citation Example:
Ahlström, Göran.<i> Technological Development and Industrial Expositions, 1850-1914: Sweden in an International Perspective.</i> Lund: Lund University Press, 1996.
My main concern with this kind of parsing, is that the format of these citations are not comma delineated or anything of the sort.

I need to make a view that will show the citation broken into sections (author, article, date, publisher...) eventually, this same piece of code will be used to insert into the new table that is being created at a later date.

If anyone could give me some guidance, that would be fantastic.

Thank you,
Wes Crockett
Darkzaelus
Forum Commoner
Posts: 94
Joined: Tue Sep 09, 2008 7:02 am

Re: Parsing a citation

Post by Darkzaelus »

If it is just seperated with <i> and </i> then you can use preg_split:

Code: Select all

 
$matches=preg_split('/(.*?)\\<i\\>(.*?)\\<\\\\i\\>(.*?)/is',$text);
echo $matches[0][0];//Ahlström, Göran.
echo $matches[1][0];// Technological Development and Industrial Expositions, 1850-1914: Sweden in an International Perspective.
echo $matches[2][0];// Lund: Lund University Press, 1996.
 
It will keep going until the end, looking for start and end tags.

Tell me if it works!

Cheers, Darkzaelus
User avatar
Dravos
Forum Newbie
Posts: 15
Joined: Wed Sep 10, 2008 9:27 am
Location: London

Re: Parsing a citation

Post by Dravos »

There's a comma splitting the last 2 values, so it isn't just the italic tags, but that's an efficient approach.

I'd gone for a step by step approach, looking for each token in turn, of course the main problem here will be if a publisher were to have a comma in their name, or a title to have an italic tag in it's title.

Code: Select all

<?php
 
$citation="Ahlström, Göran.<i> Technological Development and Industrial Expositions, 1850-1914: Sweden in an International Perspective.</i> Lund: Lund University Press, 1996.";
 
$temp=$citation;
 
$needle=strpos($temp, "<i>");
$author=substr($temp, 0, $needle);
$temp=substr($temp, $needle);
 
$needle=strpos($temp, "</i>");
$title=substr($temp, 3, $needle);
$temp=substr($temp, $needle);
 
$needle=strpos($temp, ",");
$publisher=substr($temp, 0, $needle);
$temp=substr($temp, $needle);
 
$date=substr($temp, 1);
 
?>
 
Author: <?=$author?><br/>
Title: <?=$title?><br/>
Publisher: <?=$publisher?><br/>
Date: <?=$date?><br/>
Darkzaelus
Forum Commoner
Posts: 94
Joined: Tue Sep 09, 2008 7:02 am

Re: Parsing a citation

Post by Darkzaelus »

Yeah, thats good stuff. But don't we need to know:
My main concern with this kind of parsing, is that the format of these citations are not comma delineated or anything of the sort.
WHAT is the delimiter :P

Cheers, Darkzaelus
User avatar
Dravos
Forum Newbie
Posts: 15
Joined: Wed Sep 10, 2008 9:27 am
Location: London

Re: Parsing a citation

Post by Dravos »

I remember doing something similar to this for some coursework at university. I think there is more than one delimiter due to a not very well thought through storage format. I'm assuming all other entries cohere to this first one, and as long as they don't have <i> </> or , in any of the data then it should be fine :/
wescrock
Forum Commoner
Posts: 31
Joined: Wed Sep 10, 2008 10:31 am
Location: Fresno, CA

Re: Parsing a citation

Post by wescrock »

Thanks for your comments so far!

There are several different formats for the citations, but those are determined by a separate field (book/article, thesis, dissertation...) so, that won't be hard to do with a simple if/then sequence.

I am in the very early stages of this step, and I will be trying to actually do it sometime late today or tomorrow A.M. I will post once I get into it!

thanks again,
Wes
wescrock
Forum Commoner
Posts: 31
Joined: Wed Sep 10, 2008 10:31 am
Location: Fresno, CA

Re: Parsing a citation

Post by wescrock »

Worked like a charm! now i just need to put it in a loop, with echo statements, so i can pull from the SQL table all 'citations' where 'format' == "Books/Monographs"

thanks!
-Wes
Dravos wrote:There's a comma splitting the last 2 values, so it isn't just the italic tags, but that's an efficient approach.

I'd gone for a step by step approach, looking for each token in turn, of course the main problem here will be if a publisher were to have a comma in their name, or a title to have an italic tag in it's title.

Code: Select all

<?php
 
$citation="Ahlström, Göran.<i> Technological Development and Industrial Expositions, 1850-1914: Sweden in an International Perspective.</i> Lund: Lund University Press, 1996.";
 
$temp=$citation;
 
$needle=strpos($temp, "<i>");
$author=substr($temp, 0, $needle);
$temp=substr($temp, $needle);
 
$needle=strpos($temp, "</i>");
$title=substr($temp, 3, $needle);
$temp=substr($temp, $needle);
 
$needle=strpos($temp, ",");
$publisher=substr($temp, 0, $needle);
$temp=substr($temp, $needle);
 
$date=substr($temp, 1);
 
?>
 
Author: <?=$author?><br/>
Title: <?=$title?><br/>
Publisher: <?=$publisher?><br/>
Date: <?=$date?><br/>
Darkzaelus
Forum Commoner
Posts: 94
Joined: Tue Sep 09, 2008 7:02 am

Re: Parsing a citation

Post by Darkzaelus »

make sure the client can use short tags:

Code: Select all

 
<?=$author?>
it would be safer to use

Code: Select all

<?php echo $author ?>
Cheers, Darkzaelus
wescrock
Forum Commoner
Posts: 31
Joined: Wed Sep 10, 2008 10:31 am
Location: Fresno, CA

Re: Parsing a citation

Post by wescrock »

Sweet! I got it to work with the WHILE loop for the most part... I noticed now though, not all of them are formated the same... so, for the strpos() command... can i put an or statement with it?

such as:

Code: Select all

$needle=strpos($temp, "<i>" or "<em>" or "“");
The person who originally made this DB had a very small view of databases it looks like...

thanks,
Wes
User avatar
Dravos
Forum Newbie
Posts: 15
Joined: Wed Sep 10, 2008 9:27 am
Location: London

Re: Parsing a citation

Post by Dravos »

I'm not sure about a 1 line solution but you could do:

Code: Select all

if (strpos($temp, "<i>"))
    $needle=strpos($temp, "</i>");
elseif (strpos($temp, "<em>"))
    $needle=strpos($temp, "<em>");
elseif (strpos($temp, "“"))
    $needle=strpos($temp, "“");
User avatar
Dravos
Forum Newbie
Posts: 15
Joined: Wed Sep 10, 2008 9:27 am
Location: London

Re: Parsing a citation

Post by Dravos »

Actually that doesn't seem to quite work..
Darkzaelus
Forum Commoner
Posts: 94
Joined: Tue Sep 09, 2008 7:02 am

Re: Parsing a citation

Post by Darkzaelus »

Code: Select all

 
if (!$needle=strpos($temp, "<i>"))
    if (!$needle=strpos($temp, "<em>"))
        $needle=strpos($temp, "“");
 
Does that work?
User avatar
Dravos
Forum Newbie
Posts: 15
Joined: Wed Sep 10, 2008 9:27 am
Location: London

Re: Parsing a citation

Post by Dravos »

Yep that's good, although bear in mind, the start number in the substr will need to change based on the length of the delimiter. It was hardcoded to 3, based on <i> so would be 4 with <em>, you could set a variable for this in those if statements and plug it in.
Darkzaelus
Forum Commoner
Posts: 94
Joined: Tue Sep 09, 2008 7:02 am

Re: Parsing a citation

Post by Darkzaelus »

Never used strpos, thats my excuse :P
GOT IT

Code: Select all

 
$end=str_ireplace(array('<i>','</i>'.'<em>','</em>','"'), chr(13), $text);
$array=explode(chr(13),$end);
 
Sorted :P
Replaces all delimiters with a common one that won't be used, an enter, then explodes them into an array.

Cheers, Darkzaelus
wescrock
Forum Commoner
Posts: 31
Joined: Wed Sep 10, 2008 10:31 am
Location: Fresno, CA

Re: Parsing a citation

Post by wescrock »

So... after playing around with it... and a little confusion... the best this is going to get (because the data is pretty messed up in alot of ways) is this:

Code: Select all

<?php
    include 'db.inc';
        $dbh = mysql_connect($hostname, $username, $password) 
        or die("Unable to connect to MySQL");
 
    $query="select citation from wfbiblio.items WHERE format='Books/Monographs'";
    $result = mysql_query($query);
        
        while($row = mysql_fetch_array($result))
            {
                $citation=$row[0];
 
                $temp=$citation;
                 
                $needle=strpos($temp, "<i>");
                    if($needle > 0)
                        {
                            $author=substr($temp, 0, $needle);
                                if($author==null)
                                    {$author="WARNING: Null Value.";}
                            $temp=substr($temp, $needle);
                            
                            $needle=strpos($temp, "</i>");
                            $title=substr($temp, 3, $needle);
                                if($title==null)
                                    {$title="WARNING: Null Value.";}
                            $temp=substr($temp, $needle);
                            
                            $needle=strpos($temp, ",");
                            $publisher=substr($temp, 0, $needle);
                                if($publisher==null)
                                    {$publisher="WARNING: Null Value.";}
                            $temp=substr($temp, $needle);
                            
                            $date=substr($temp, 1);
                                if($date==null)
                                    {$date="WARNING: Null Value.";}
                        }
                        if($needle == false)
                        {
                            $needle=strpos($temp, "<em>");
                            
                            $author=substr($temp, 0, $needle);
                                if($author==null)
                                    {$author="WARNING: Null Value.";}
                            $temp=substr($temp, $needle);
                            
                            $needle=strpos($temp, "</em>");
                            $title=substr($temp, 4, $needle);
                                if($title==null)
                                    {$title="WARNING: Null Value.";}
                            $temp=substr($temp, $needle);
                            
                            $needle=strpos($temp, ",");
                            $publisher=substr($temp, 0, $needle);
                                if($publisher==null)
                                    {$publisher="WARNING: Null Value.";}
                            $temp=substr($temp, $needle);
                            
                            $date=substr($temp, 1);
                                if($date==null)
                                    {$date="WARNING: Null Value.";}
                        }
                    
                 echo "Author: " . $author . "<br/>";
                 echo "Title:<i> " . $title . "</i><br/>";
                 echo "Publisher: " . $publisher . "<br/>";
                 echo "Date: " . $date . "<br/>";
                 echo "<br />";
                 
                 
            }
 
?>
Here is the web address to see the output:
http://labs.lib.csufresno.edu/world_fair/citation.php

thanks all,
Wes
Post Reply