PHP Scraping Script Need To Replace Characters

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
elmode
Forum Newbie
Posts: 1
Joined: Fri Apr 16, 2010 4:49 pm

PHP Scraping Script Need To Replace Characters

Post by elmode »

Hey guys,

This is my first post. I hope you guys can help me out because I've been stuck on this small script for months. I like to figure stuff out on my own but I just can't figure this one you so it's time to ask for some help. Anway here's the problem.

I have a php scraping script that scraps a page. It has to grab a section of urls, links and anchors. Everything works except for the fact that the Anchors a lot of the times have invalid characters. When I put them in the database and grab them later they show up as invalid characters and mess up my RSS feed.

So what I need to do to nip this problem in the bud, is replace the characters before it goes into the database. I've researched this for days and found just about everything that's out there and tried everything that's out there I just can't it to work.

Here's the list of characters I need to replace:

'‘’``—”€“éó – –á

Here is my entire scraping script:

Code: Select all

$DB = mysql_connect('blah', 'blah','blah') or die (mysql_error());
mysql_select_db('blah', $DB);

$Base = 'http://www.website.com';
$data = file_get_contents($Base);

$regexDesc = '/(">).*(<\/a)/';
$regexURL = '/https?:\/\/.*target/';
$regex = '/(<a [^>]+>)(.*?)<\/a>.+br><span.+class.+small.+<\/span>/';

preg_match_all($regex,$data,$match);

$match = $match[0];
$NewsList = array();
foreach($match as $Page) {
  $NewsList[] = $Page;
}

foreach($NewsList as $story) {
  preg_match_all($regexDesc,$story,$matchDesc);
  $matchDesc = $matchDesc[0];

  foreach($matchDesc as $Description) {
  $Description = substr($Description,2,-3);
  $anchor = mysql_real_escape_string($Description);
  echo $anchor." <br>";
  }

  preg_match_all($regexURL,$story,$matchURL);
  $matchURL = $matchURL[0];

  foreach($matchURL as $URL) {
  $URL = substr($URL,0,-8);
  $url = mysql_real_escape_string($URL);
  echo $url." <br>";
}

  $date = date('l jS \of F Y');
  $time = time();

  $sql="INSERT INTO table (date, time, url, anchor) VALUES ('$date','$time','$url','$anchor')";
  $result = mysql_query($sql, $DB) or die (mysql_error());

}

echo "--> Inserted Values Successfully";
mysql_close($DB);

User avatar
requinix
Spammer :|
Posts: 6617
Joined: Wed Oct 15, 2008 2:35 am
Location: WA, USA

Re: PHP Scraping Script Need To Replace Characters

Post by requinix »

What you need to do is figure out character encoding.

The page being scraped uses some encoding. Your database may use another. Your HTML output may use something else. When you mix-and-match all those you get weird output.

For the most part mb_detect_encoding can figure out a character encoding of something if you cannot. If it matches what you use in your database and your HTML then great. If it doesn't you need to convert the characters.
Post Reply