[SOLVED] Tag compatibility issue

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
jmueller0823
Forum Commoner
Posts: 37
Joined: Tue Apr 20, 2004 9:06 pm

[SOLVED] Tag compatibility issue

Post by jmueller0823 »

feyd | Please use

Code: Select all

tags when posting code. Read:  [url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url][/color]


This concerns a little script called [b]RSSgenr8php[/b].
It can be found here: http://www.xmlhub.com/rssgenr8.php

This is an HTML-to-RSS scraper. 

I'm hoping this is a "generic" question as they have no support forums...

The script works like this:

In the page that you'd like to create an RSS feed from, you enclose each RSS "item" with these tags:

[b]<span class="rss:item"></span>[/b]
 
Then, when you execute the script, it creates a dynamic RSS based upon your tag placement.  The script works great, with one exception.

We have a page that is dynamically generated weekly by a cgi script. In that script, we embedded the above tags.

The cgi script requires that the tag looks like this ([b]single[/b] quote):

<span class='rss:item'>

NOTE If double quotes are used in the above cgi script, a parsing error occurs. Single quotes must be used in the cgi script.


And the RSS script (RSSgenr8.php) requires that the tag looks like this ([b]double[/b] quote):

<span class="rss:item">

[b]Question[/b]
Is it possible to make the RSSgenr8 script recognize the single quotes? Or, is it possible to use double quotes in the cgi script without getting a parsing error?

Thank you!


The snippet from the cgi script is below (note single quotes on tag):
```````````````````````````````````````````````````````

Code: Select all

$o .=" <span class='rss:item'><a href=$row[page]>$title</a></span>\n";

The RSSgenr8 code is below.
$pageurl is the page URL passed to the script
```````````````````````````````````````````````````````

Code: Select all

<?php

if ($pageurl) {
  parse_html($pageurl);
} else {
  die ("Query failed...");
}



function parse_html($pageurl){
  $itemregexp = "%rss:item *" *>(.+?)</span>%is";
  $allowable_tags = "<A><B><br /><br><BLOCKQUOTE><CENTER><DD><DL><DT><HR><I><IMG><LI>&nbsp;<OL><P><PRE><U><UL>";

  $pageurlparts = parse_url($pageurl);
  if ($pageurlparts[path] == "") $pageurl .= "/";

  if ($fp = @fopen($pageurl, "r")) {
    while (!feof($fp)) {
      $data .= fgets($fp, 128);
    }
    fclose($fp);
  }

//  print "<pre>";
//  print htmlentities($data);  

//  eregi("<title>(.*)</title>", $data, $title);
//  $channel_title = $title[1];

  $channel_title = "";
  if (preg_match('/<title>(.+?)<\/title>/i', $data, $regs) > 0) { $channel_title = $regs[1];
  }

  
  if (preg_match('/<meta .*description.*"(.+?)"/i', $data, $regs) > 0) { $channel_desc = $regs[1];
  }
  if ($channel_desc == "") $channel_desc = $pageurl;

  $match_count = preg_match_all($itemregexp, $data, $items);
  $match_count = ($match_count > 25) ? 25 : $match_count;
  
  header("Content-Type: text/xml");

  $output .= "<?xml version="1.0" encoding="ISO-8859-1" ?>\n";
  $output .= "<!-- generator="rssgenr8/0.92" -->\n";
  $output .= "<!DOCTYPE rss PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">\n";
  $output .= "<rss version="0.92">\n";
  $output .= "  <channel>\n";
  $output .= "    <title>". htmlentities(strip_tags($channel_title)) ."</title>\n";
  $output .= "    <link>". htmlentities($pageurl) ."</link>\n";
  $output .= "    <description>". htmlentities($channel_desc) ."</description>\n";
  $output .= "    <webMaster>". htmlentities("webmaster") ."</webMaster>\n";
  $output .= "    <generator>". htmlentities("RSSgenr8 from XMLhub.com") ."</generator>\n";
  $output .= "    <language>en</language>\n";

  for ($i=0; $i< $match_count; $i++) {

    $desc = $items[1][$i];
    $title = wsstrip($desc);
    $descout = $desc;
    

      if (preg_match("/(.+?)(?:<\/P|<\/div|<br|<\/h|<\/td)/i", $title, $regs) > 0) { 
        $title = $regs[1];
        if (strlen(wsstrip(trim(strip_tags($title)))) < 100) {
          $descout = str_replace($title,"",$descout);
        }
      }
    
    $title = wsstrip(trim(strip_tags($title)));
    if (strlen($title) > 100) {
      $title = substr($title,0,100) . " ...";
    }


    
    $item_url = get_link($desc, $pageurl);
    $descout = wsstrip(strip_tags($descout, $allowable_tags));
      $pos = strpos($descout, "<br>");
      if (is_int($pos) and ($pos == 0)) {
        $descout=substr($descout, 4);
      }  
      $pos = strpos($descout, "<br />");
      if (is_int($pos) and ($pos == 0)) {
        $descout=substr($descout, 6);
      }

    $descout = htmlentities(wsstrip($descout));

    $output .= "    <item>\n";
    $output .= "      <title>". htmlentities($title) ."</title>\n";
    $output .= "      <link>". htmlentities($item_url) ."</link>\n";
    $output .= "      <description>". $descout ."</description>\n";
    $output .= "    </item>\n";
  }

  $output .= "  </channel>\n";
  $output .= "</rss>\n";

  print $output;
//  print htmlentities($output);
//  print "</pre>"; 
}

function get_link($desc, $pageurl) {
  if (stristr($desc, "href")) {
    $linkurl = stristr($desc, "href");
    $linkurl = substr($linkurl, strpos($linkurl, """)+1);
    $linkurl = substr($linkurl, 0, strpos($linkurl, """));
    $linkurl = trim($linkurl);
    $pageurlarray = parse_url($linkurl);
    if (empty($pageurlarray['host'])) {
      $linkurl = make_abs($linkurl, $pageurl);
    }
    return $linkurl;
  } else {
    return $pageurl;
  }
}

function wsstrip($str)
{
 $str=ereg_replace("[\r\t\n]"," ",$str);
 $str=ereg_replace (' +', ' ', trim($str));
return $str;
}

 
function make_abs($rel_uri, $base, $REMOVE_LEADING_DOTS = true) { 
 preg_match("'^([^:]+://[^/]+)/'", $base, $m); 
 $base_start = $m[1]; 
 if (preg_match("'^/'", $rel_uri)) { 
  return $base_start . $rel_uri; 
 } 
 $base = preg_replace("{[^/]+$}", '', $base); 
 $base .= $rel_uri; 
 $base = preg_replace("{^[^:]+://[^/]+}", '', $base); 
 $base_array = explode('/', $base); 
 if (count($base_array) and!strlen($base_array[0])) 
  array_shift($base_array); 
 $i = 1; 
 while ($i < count($base_array)) { 
  if ($base_array[$i - 1] == ".") { 
   array_splice($base_array, $i - 1, 1); 
   if ($i > 1) $i--; 
  } elseif ($base_array[$i] == ".." and $base_array[$i - 1]!= "..") { 
   array_splice($base_array, $i - 1, 2); 
   if ($i > 1) { 
$i--; 
if ($i == count($base_array)) array_push($base_array, ""); 
   } 
  } else { 
   $i++; 
  } 
 } 
 if (count($base_array) and $base_array[-1] == ".") 
  $base_array[-1] = ""; 

 if ($REMOVE_LEADING_DOTS) { 
  while (count($base_array) and preg_match("/^\.\.?$/", $base_array[0])) { 
   array_shift($base_array); 
  } 
 } 
 return($base_start . '/' . implode("/", $base_array)); 
}

?>

feyd | Please use

Code: Select all

tags when posting code. Read:  [url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url][/color]
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

$itemregexp on line 12 can be tweaked to accept single quotes easily:

Code: Select all

$itemregexp = "%&lt;span class=(&#1111;"']?)rss:item\\1&#1111;^&gt;]*?&gt;(.+?)&lt;/span&gt;%is";
line 59 and any other lines that reference $items[1] will need adjustment to use index 2


the cgi line can also be changed to:

Code: Select all

$o .=" &lt;span class="rss:item"&gt;&lt;a href="$row&#1111;page]"&gt;$title&lt;/a&gt;&lt;/span&gt;\n";
jmueller0823
Forum Commoner
Posts: 37
Joined: Tue Apr 20, 2004 9:06 pm

Post by jmueller0823 »

Thank you. I'll give it a shot.
jmueller0823
Forum Commoner
Posts: 37
Joined: Tue Apr 20, 2004 9:06 pm

Post by jmueller0823 »

Okay. I made the change as follows:

In the Rssgenr8 script, function parse_html ...

Changed the line:

Code: Select all

$itemregexp = "%rss:item *" *>(.+?)</span>%is";
to:

Code: Select all

$itemregexp = "%<span class=(&#1111;"']?)rss:item\1&#1111;^>]*?>(.+?)</span>%is";

I tested the above script with two html page variations:

SINGLE Quotes

Code: Select all

<span class='rss:item'><a href=http://www.site.com/artman/publish/article_179.php>Conflict: Friend or Foe?</a></span>
and DOUBLE quotes

Code: Select all

<span class="rss:item"><a href=http://www.site.com/artman/publish/article_179.php>Conflict: Friend or Foe?</a></span>
In both tests, the script did not generate the items (blank).
Before the changes, the script did work with double quotes.

What am I missing? Thanks again.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

feyd wrote:line 59 and any other lines that reference $items[1] will need adjustment to use index 2
jmueller0823
Forum Commoner
Posts: 37
Joined: Tue Apr 20, 2004 9:06 pm

Post by jmueller0823 »

So, are you saying:

Change all instances of:

$items[1]

to

$items[2]

??
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

yes.
jmueller0823
Forum Commoner
Posts: 37
Joined: Tue Apr 20, 2004 9:06 pm

Post by jmueller0823 »

Nope. Same issue.
Testing with an html page that has 'single quotes' on the tags.
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

post the new code.
jmueller0823
Forum Commoner
Posts: 37
Joined: Tue Apr 20, 2004 9:06 pm

Post by jmueller0823 »

Code: Select all

<?php
i
$pageurl = "http://www.site.com/popular.php" ;

if ($pageurl) {
  parse_html($pageurl);
} else {
  die ("Query failed...");
}


function parse_html($pageurl){
//  NOTE Next line is modified from original ;
  $itemregexp = "%<span class=(["']?)rss:item\1[^>]*?>(.+?)</span>%is";
  $allowable_tags = "<A><B><br /><br><BLOCKQUOTE><CENTER><DD><DL><DT><HR><I><IMG><LI>&nbsp;<OL><P><PRE><U><UL>";

  $pageurlparts = parse_url($pageurl);
  if ($pageurlparts[path] == "") $pageurl .= "/";

  if ($fp = @fopen($pageurl, "r")) {
    while (!feof($fp)) {
      $data .= fgets($fp, 128);
    }
    fclose($fp);
  }

//  print "<pre>";
//  print htmlentities($data);  

//  eregi("<title>(.*)</title>", $data, $title);
//  $channel_title = $title[1];

  $channel_title = "";
  if (preg_match('/<title>(.+?)<\/title>/i', $data, $regs) > 0) { $channel_title = $regs[1];
  }

  
  if (preg_match('/<meta .*description.*"(.+?)"/i', $data, $regs) > 0) { $channel_desc = $regs[1];
  }
  if ($channel_desc == "") $channel_desc = $pageurl;

  $match_count = preg_match_all($itemregexp, $data, $items);
  $match_count = ($match_count > 25) ? 25 : $match_count;
  
  header("Content-Type: text/xml");

  $output .= "<?xml version="1.0" encoding="ISO-8859-1" ?>\n";
  $output .= "<!-- generator="gtgenerator/0.92" -->\n";
  $output .= "<!DOCTYPE rss PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">\n";
  $output .= "<rss version="0.92">\n";
  $output .= "  <channel>\n";
  $output .= "    <title>Most Popular Articles, Updated Weekly</title>\n";
  $output .= "    <link>http://www.site.com</link>\n";
  $output .= "    <description>Life Changing Resources for Your Marriage</description>\n";
  $output .= "    <webMaster>webmaster@site.com</webMaster>\n";
  $output .= "    <generator>gtgenerator</generator>\n";
  $output .= "    <language>en-us</language>\n";

  for ($i=0; $i< $match_count; $i++) {

    $desc = $items[2][$i];
    $title = wsstrip($desc);
    $descout = $desc;
    

      if (preg_match("/(.+?)(?:<\/P|<\/div|<br|<\/h|<\/td)/i", $title, $regs) > 0) { 
        $title = $regs[1];
        if (strlen(wsstrip(trim(strip_tags($title)))) < 100) {
          $descout = str_replace($title,"",$descout);
        }
      }
    
    $title = wsstrip(trim(strip_tags($title)));
    if (strlen($title) > 100) {
      $title = substr($title,0,100) . " ...";
    }


    
    $item_url = get_link($desc, $pageurl);
    $descout = wsstrip(strip_tags($descout, $allowable_tags));
      $pos = strpos($descout, "<br>");
      if (is_int($pos) and ($pos == 0)) {
        $descout=substr($descout, 4);
      }  
      $pos = strpos($descout, "<br />");
      if (is_int($pos) and ($pos == 0)) {
        $descout=substr($descout, 6);
      }

    $descout = htmlentities(wsstrip($descout));

    $output .= "    <item>\n";
    $output .= "      <description>". $descout ."</description>\n";
    $output .= "    </item>\n";
  }

  $output .= "  </channel>\n";
  $output .= "</rss>\n";

  print $output;
//  print htmlentities($output);
//  print "</pre>"; 
}

function get_link($desc, $pageurl) {
  if (stristr($desc, "href")) {
    $linkurl = stristr($desc, "href");
    $linkurl = substr($linkurl, strpos($linkurl, """)+1);
    $linkurl = substr($linkurl, 0, strpos($linkurl, """));
    $linkurl = trim($linkurl);
    $pageurlarray = parse_url($linkurl);
    if (empty($pageurlarray['host'])) {
      $linkurl = make_abs($linkurl, $pageurl);
    }
    return $linkurl;
  } else {
    return $pageurl;
  }
}

function wsstrip($str)
{
 $str=ereg_replace("[\r\t\n]"," ",$str);
 $str=ereg_replace (' +', ' ', trim($str));
return $str;
}

 
function make_abs($rel_uri, $base, $REMOVE_LEADING_DOTS = true) { 
 preg_match("'^([^:]+://[^/]+)/'", $base, $m); 
 $base_start = $m[1]; 
 if (preg_match("'^/'", $rel_uri)) { 
  return $base_start . $rel_uri; 
 } 
 $base = preg_replace("{[^/]+$}", '', $base); 
 $base .= $rel_uri; 
 $base = preg_replace("{^[^:]+://[^/]+}", '', $base); 
 $base_array = explode('/', $base); 
 if (count($base_array) and!strlen($base_array[0])) 
  array_shift($base_array); 
 $i = 1; 
 while ($i < count($base_array)) { 
  if ($base_array[$i - 1] == ".") { 
   array_splice($base_array, $i - 1, 1); 
   if ($i > 1) $i--; 
  } elseif ($base_array[$i] == ".." and $base_array[$i - 1]!= "..") { 
   array_splice($base_array, $i - 1, 2); 
   if ($i > 1) { 
$i--; 
if ($i == count($base_array)) array_push($base_array, ""); 
   } 
  } else { 
   $i++; 
  } 
 } 
 if (count($base_array) and $base_array[-1] == ".") 
  $base_array[-1] = ""; 

 if ($REMOVE_LEADING_DOTS) { 
  while (count($base_array) and preg_match("/^\.\.?$/", $base_array[0])) { 
   array_shift($base_array); 
  } 
 } 
 return($base_start . '/' . implode("/", $base_array)); 
}

?>
User avatar
feyd
Neighborhood Spidermoddy
Posts: 31559
Joined: Mon Mar 29, 2004 3:24 pm
Location: Bothell, Washington, USA

Post by feyd »

copied the code wrong.. it's supposed to be 2 backslashes not 1..

it's not entirely surprising or your fault, the bbtag has a bug with that bit for some reason, which I'll work out when I get some extra time...

Code: Select all

$itemregexp = "%<span class=(["']?)rss:item\\\\1[^>]*?>(.+?)</span>%is";
jmueller0823
Forum Commoner
Posts: 37
Joined: Tue Apr 20, 2004 9:06 pm

Post by jmueller0823 »

That did it. Thank you very much.
Post Reply