Page 1 of 1

screen scrape special characters from url

Posted: Tue Jan 25, 2011 9:53 am
by Rahul Dev
Hello guys i have a problem when i screen scrape a piece of text from a url and save it to my db. The text is in french and contains special characters like é. so when i screen scrape it i receive it in this form &eacute. e.g i have a word région in the website but when i screen scrape it, it becomes région. The reason that i want to store it as it is displayed is that i need to perform some operations on the text after saving it in the db as i want.
Is there any way to store the screened scrape text in the form that it is displayed or convert it to the way i want(like this - région)
my code is as follows:

$html = file_get_dom('http://www.defimedia.info/news/8425/Gro ... %99appels-');

foreach($html->find('div[class=PostContent]') as $element)
{
$tags = array('<div class="PostContent">', '<!-- The Adsense will automatically be inserted half way through the content. Applies for both Side and Middle options. -->', '<font face="Georgia">', '<font size="2">', '');
$new_element = str_replace($tags, "", $element);
$sql1 = "UPDATE articles SET original_text = '" . mysql_real_escape_string($new_element) . "' WHERE article_id = '$item_id'";
$result1 = mysql_query($sql1) or die('Query failed: ' . mysql_error());
}

Re: screen scrape special characters from url

Posted: Tue Jan 25, 2011 10:33 am
by AbraCadaver
It is &eacute; in the HTML source of the page you are scraping (check it out). In order to display in a browser it will need to be &eacute; so why do you wan't to translate it? If you must then try html_entity_decode().

Re: screen scrape special characters from url

Posted: Tue Jan 25, 2011 11:05 am
by Rahul Dev
AbraCadaver wrote:It is &eacute; in the HTML source of the page you are scraping (check it out). In order to display in a browser it will need to be &eacute; so why do you wan't to translate it? If you must then try html_entity_decode().
Yes it is &eacute; in the HTML source itself, as i said i need to perform other operations on the text after scraping and storing it in the database. i tried html_entity_decode(), but then the characters become �. any solution to this??

Re: screen scrape special characters from url

Posted: Tue Jan 25, 2011 11:28 am
by Rahul Dev
Its ok now i missed something in html_entity_decode(). It should be html_entity_decode($text, ENT_QUOTES, "utf-8");
Thanx for the help