change encoding from ISO 8859-2 to UTF-8 problem

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
mahroch
Forum Newbie
Posts: 5
Joined: Mon Jan 19, 2009 5:40 am

change encoding from ISO 8859-2 to UTF-8 problem

Post by mahroch »

Hi,
I tried to solve this out for hours, without success :( maybe anybody of you knows the solution;

Encoding of my scripts and all is UTF-8. I use this code to extract the content of the page encoded in ISO 8859-2 (the page is czech language with characters containing their special symbols, ...)

Code: Select all

 
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, 0);
curl_setopt($curl, CURLOPT_POST, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_COOKIEFILE, "cookiefile");
curl_setopt($curl, CURLOPT_COOKIEJAR, "cookiefile"); # SAME cookiefile
curl_setopt($curl, CURLOPT_URL, $search_url); # this is where you first time connect - GET method authorization in my case, if you have POST - need to edit code a bit
$content = curl_exec($curl);
 
After this I try to extract some words from the content. As the page is encoded in different encoding I get results like this: pam?tihodnost, dl??d?n?, ...

So I try to change the coding from original page encoding (ISO 8859-2) to mine (Utf-8). I used different methods: iconv, libiconv, differnet user functions from internet (iso88592_2utf8(), convert_charset, ...) but nothing helps. The result is even worse.

I don't know what to do to solve it.

There is one strange thing that confuses me: If I use right after

Code: Select all

 
$content = curl_exec($curl);
$enc = mb_detect_encoding(content );
 
the $enc variable shows UTF-8. So the string is probably wrongly converted within the operation of curl_exec().

Any ideas how to solve it ?

Thanx

Maros
User avatar
Apollo
Forum Regular
Posts: 794
Joined: Wed Apr 30, 2008 2:34 am

Re: change encoding from ISO 8859-2 to UTF-8 problem

Post by Apollo »

Can you show an example URL which contains the content you're trying to convert?
mahroch
Forum Newbie
Posts: 5
Joined: Mon Jan 19, 2009 5:40 am

Re: change encoding from ISO 8859-2 to UTF-8 problem

Post by mahroch »

HI,

let's say this example: http://www.fotobanka.cz/show.php?query= ... &total=963 . In the bottom part you can see he "Klíčová slova:" (transl. Keywords).

I want to extract the keywords and then use for other work. You can see that among them you can find eg. "stěna".

Source code tells you :

<meta http-equiv="Content-Type" content="text/html; charset=iso8859-2">
<meta http-equiv="Content-language" content="cs">



Thanx for help.

M.
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: change encoding from ISO 8859-2 to UTF-8 problem

Post by Eran »

You should try the multibyte extension - http://www.php.net/mbstring, especially mb_convert_encoding.
The symbols you showed are usually broken multi-byte characters or html entities. For the conversion to take place successfully, you probably need to first decode the html entities using the proper charset:

Code: Select all

$string = html_entity_decode($string,ENT_COMPAT,'ISO-8859-1');
Unfurtunately, html_entity_decode supports a limited set of encodings - http://www.php.net/html_entity_decode, but you can try those supported and hope one works for you.

After that, use mb_convert_encoding to transform the text to the proper encoding.
mahroch
Forum Newbie
Posts: 5
Joined: Mon Jan 19, 2009 5:40 am

Re: change encoding from ISO 8859-2 to UTF-8 problem

Post by mahroch »

Thanks for ideas - unfortunately it is not working :(

I tried few combinations, but most of them just removes the special chars form the string. That way you get from dl��d�n� => dldn. Which is senseless :(

As you might know function html_entity_decode only removes the quotes ... so it changes almost nothing. The function mb_convert_encoding is more powerfull, but still doesn't do what I need.

Any other ideas ?

Thanx.

M.
User avatar
Apollo
Forum Regular
Posts: 794
Joined: Wed Apr 30, 2008 2:34 am

Re: change encoding from ISO 8859-2 to UTF-8 problem

Post by Apollo »

The page seems properly encoded iso-8859-2, if you do this:

Code: Select all

<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body> 
    
<?php
$s = "st\xECna"; // "stena" with caron on e (in iso-8859-2 encoding, website contained exactly same string)
$s = iconv("ISO-8859-2", "UTF-8", $s); // convert to utf-8
print( $s );
?>
 
</body></html>
it should work fine (assuming your webserver has the iconv lib installed).
mahroch
Forum Newbie
Posts: 5
Joined: Mon Jan 19, 2009 5:40 am

Re: change encoding from ISO 8859-2 to UTF-8 problem

Post by mahroch »

Hi, thanx for help.

Now I got it. I tried iconv even before but without success. The problem was that I converted whole content of the page right after extracting it from the web. Then I did a some works on it (sorting, ...) etc. and the result was wrong coding.

Now I tried to use iconv just before printing the content(word with special chars) and it was printed correctly ;-)

Thanks again

M.
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: change encoding from ISO 8859-2 to UTF-8 problem

Post by Eran »

html_entity_decode only removes the quotes
As an aside - html_entity_decode decodes all html entities, which are much more than quotes - http://www.cookwood.com/html/extras/entities.html
Post Reply