Retrieving urls in <item><link> in XML file...

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
User avatar
raghavan20
DevNet Resident
Posts: 1451
Joined: Sat Jun 11, 2005 6:57 am
Location: London, UK
Contact:

Retrieving urls in <item><link> in XML file...

Post by raghavan20 »

I am trying to get the redirect urls specified in the below XML files in each item-link tag..
a sample of XML file..

Code: Select all

<?xml version="1.0" encoding="ISO-8859-1"?>
<rss version="0.91">
 <channel>
  <title>rediff News: International</title>
  <link>http://www.rediff.com/news/index.html</link>  
  <description>India's largest news and entertainment service online.</description>
  <language>en-us</language>
  <pubDate>Wed, 01 Feb 2006 12:06:04 GMT</pubDate>
  <copyright>Copyright: (C) 2006 Rediff.com India Limited. All Rights Reserved.</copyright>
 <image>
  <title>rediff.com</title>
  <url>http://www.rediff.com/uim/red_log.gif</url>
  <link>http://www.rediff.com/</link>
  <width>144</width>
  <height>28</height>
  <description>Visit rediff.com</description>
 </image>

<!--01-02-2006:15:46:26-->
<item>
  <title>My proposals on Kashmir are bold: Musharraf</title>
  <link>http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/feb/01pak.htm</link>
  <description>Musharraf said the proposals were bold and well considered in the context of finding a solution to the Kashmir issue, which would be acceptable to Pakistan, India and the people of Kashmir.</description>
</item>
<!--01-02-2006:13:16:52-->
<item>
  <title>Bush stresses on fear factor</title>
  <link>http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/feb/01bush.htm</link>
  <description>The passion that he showed in speaking of these foreign policy issues, clearly seemed to dissipate when he got on to the domestic issues. </description>
</item>
<!--01-02-2006:09:57:51-->
<item>
  <title>Coretta Scott King, widow of Martin Luther, dies</title>
  <link>http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/feb/01king.htm</link>
  <description>The 78-year-old widow of the celebrated champion of American Civil Rights movement passed away in Mexico overnight. </description>
</item>
<!--31-01-2006:18:00:12-->
<item>
  <title>Indian national arrested in Philippines </title>
  <link>http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/jan/31arrest.htm</link>
  <description>Singh was allegedly one of the masterminds of a kidnapping syndicate that operated and victimised Indian nationals living in eastern Metropolitan Manila and Rizal province.</description>
</item>
When I run my regex, it is greedy, and it matches starting from link in the channel to link inside item tag eventhough I have asked to be ungreedy...but the second array item gives the url i want only the first array element works very differently.

Code: Select all

<pre>
<?php 
$input =  file_get_contents("international.xml");
echo preg_match_all("#<link>.*?url=(.*?)</link>#si", $input, $matches)."<br />"; 
print_r($matches); 

?> 
</pre>
output...

Code: Select all

Array
(
    [0] => Array
        (
            [0] => http://www.rediff.com/news/index.html  
  India's largest news and entertainment service online.
  en-us
  Wed, 01 Feb 2006 12:06:04 GMT
  Copyright: (C) 2006 Rediff.com India Limited. All Rights Reserved.
 
  
  http://www.rediff.com/uim/red_log.gif
  http://www.rediff.com/
  144
  28
  Visit rediff.com
 



  
  http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/feb/01pak.htm
            [1] => http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/feb/01bush.htm
            [2] => http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/feb/01king.htm
            [3] => http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/jan/31arrest.htm
            [4] => http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/jan/31pak.htm
            [5] => http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/jan/31mush.htm
            [6] => http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/jan/31fulbright.htm
            [7] => http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/jan/31us.htm
            [8] => http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/jan/30iraq.htm
            [9] => http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/jan/30bhutto.htm
        )

    [1] => Array
        (
            [0] => http://www.rediff.com/news/2006/feb/01pak.htm
            [1] => http://www.rediff.com/news/2006/feb/01bush.htm
            [2] => http://www.rediff.com/news/2006/feb/01king.htm
            [3] => http://www.rediff.com/news/2006/jan/31arrest.htm
            [4] => http://www.rediff.com/news/2006/jan/31pak.htm
            [5] => http://www.rediff.com/news/2006/jan/31mush.htm
            [6] => http://www.rediff.com/news/2006/jan/31fulbright.htm
            [7] => http://www.rediff.com/news/2006/jan/31us.htm
            [8] => http://www.rediff.com/news/2006/jan/30iraq.htm
            [9] => http://www.rediff.com/news/2006/jan/30bhutto.htm
        )
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

at this point, you want to use the m modifier instead of s ;)
User avatar
neophyte
DevNet Resident
Posts: 1537
Joined: Tue Jan 20, 2004 4:58 pm
Location: Minnesota

Post by neophyte »

There must be some sort of weird cosmic thing going on with us folks at devnet where we all have the same problem at the same time. :wink:

I'd say to heck with just getting links. Put the whole file in an array so you can get all of it. I just finished the problem myself using magpierss class. I documented the usage on this thread:

viewtopic.php?t=43515

Hope that helps
User avatar
raghavan20
DevNet Resident
Posts: 1451
Joined: Sat Jun 11, 2005 6:57 am
Location: London, UK
Contact:

Post by raghavan20 »

yes the m modifier solves the problem...thanks...here is another problem..
I am trying to retrieve the file names of the urls in the XML file...it's the same XML file I gave you in the first post...

Code: Select all

<pre>
<?php 
$input =  file_get_contents("international.xml");
preg_match_all("#http://.*?/(.*?)\.htm#mi", $input, $matches)."<br />"; 
print_r($matches); 

?> 
</pre>
I am trying to get file names which have .htm extension

Code: Select all

Array
(
    [0] => Array
        (
            [0] => http://www.rediff.com/news/index.htm
            [1] => http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/feb/01pak.htm
            [2] => http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/feb/01bush.htm
            [3] => http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/feb/01king.htm
            [4] => http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/jan/31arrest.htm
            [5] => http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/jan/31pak.htm
            [6] => http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/jan/31mush.htm
            [7] => http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/jan/31fulbright.htm
            [8] => http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/jan/31us.htm
            [9] => http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/jan/30iraq.htm
            [10] => http://www.rediff.com/rss/redirect.php?url=http://www.rediff.com/news/2006/jan/30bhutto.htm
            [11] => http://adworks.rediff.com/cgi-bin/AdWorks/click.cgi/www.rediff.com/textlinks.htm
        )

    [1] => Array
        (
            [0] => news/index
            [1] => rss/redirect.php?url=http://www.rediff.com/news/2006/feb/01pak
            [2] => rss/redirect.php?url=http://www.rediff.com/news/2006/feb/01bush
            [3] => rss/redirect.php?url=http://www.rediff.com/news/2006/feb/01king
            [4] => rss/redirect.php?url=http://www.rediff.com/news/2006/jan/31arrest
            [5] => rss/redirect.php?url=http://www.rediff.com/news/2006/jan/31pak
            [6] => rss/redirect.php?url=http://www.rediff.com/news/2006/jan/31mush
            [7] => rss/redirect.php?url=http://www.rediff.com/news/2006/jan/31fulbright
            [8] => rss/redirect.php?url=http://www.rediff.com/news/2006/jan/31us
            [9] => rss/redirect.php?url=http://www.rediff.com/news/2006/jan/30iraq
            [10] => rss/redirect.php?url=http://www.rediff.com/news/2006/jan/30bhutto
            [11] => cgi-bin/AdWorks/click.cgi/www.rediff.com/textlinks
        )

)
User avatar
Jenk
DevNet Master
Posts: 3587
Joined: Mon Sep 19, 2005 6:24 am
Location: London

Post by Jenk »

try this pattern:

Code: Select all

#http://.*?/([^/]*?)\.htm#mi
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

once you have the full URL, use parse_url() to break it down..
Post Reply