Annoying Charachter
Moderator: General Moderators
Annoying Charachter
Hi All,
I am working on a classified ads website that I converted from English to Farsi language so everything is now RTL. I keep seeing an annoying � character at the end of some ad's preview that I can't seem to be able to get rid of. I am attaching a screenshot so you can see exactly what I mean.
I'd appreciate your help in advance.
I am working on a classified ads website that I converted from English to Farsi language so everything is now RTL. I keep seeing an annoying � character at the end of some ad's preview that I can't seem to be able to get rid of. I am attaching a screenshot so you can see exactly what I mean.
I'd appreciate your help in advance.
Re: Annoying Charachter
That means there is an invalid byte sequence: it's supposed to represent a character but it doesn't work in the character encoding you're using.
From context it looks like that character is supposed to be an ellipsis? How does that get added to the text?
From context it looks like that character is supposed to be an ellipsis? How does that get added to the text?
Re: Annoying Charachter
requinix wrote:That means there is an invalid byte sequence: it's supposed to represent a character but it doesn't work in the character encoding you're using.
From context it looks like that character is supposed to be an ellipsis? How does that get added to the text?
You're right. It's an ellipsis. Here is the code:
Code: Select all
<?php
if($ad_preview_chars)
{
echo "<span class='adpreview'>";
$row['addesc'] = preg_replace("/\[\/?URL\]/", "", $row['addesc']);
echo substr($row['addesc'],0,$ad_preview_chars);
if (strlen($row['addesc'])>$ad_preview_chars) echo "...";
echo "</span>";
}
?>Re: Annoying Charachter
Ah, no, that's not it: now that I'm on a proper monitor I can see that the ellipsis is intact in the output and the <?> is actually just before it. However that code does reveal the issue.
Many of the normal string functions are not suitable for multi-byte character strings. Like the ones you're using now. The functions only operate on the byte level, and if you're not extremely careful you're liable to cut off a character in the middle of its byte sequence. That'll result in <?>s because the bytes don't represent any known character so the browser had to put something in there.
Use mb_substr instead, which operates on logical characters instead of just their bytes:
Note that after you do this the preview content will become about 3 times longer than it is now, but if you count it out you'll see you're actually getting the right number of characters.
Code: Select all
echo substr($row['addesc'],0,$ad_preview_chars);Use mb_substr instead, which operates on logical characters instead of just their bytes:
Code: Select all
echo mb_substr($row['addesc'],0,$ad_preview_chars,'whatever character encoding you are using');Re: Annoying Charachter
requinix wrote:Ah, no, that's not it: now that I'm on a proper monitor I can see that the ellipsis is intact in the output and the <?> is actually just before it. However that code does reveal the issue.
Many of the normal string functions are not suitable for multi-byte character strings. Like the ones you're using now. The functions only operate on the byte level, and if you're not extremely careful you're liable to cut off a character in the middle of its byte sequence. That'll result in <?>s because the bytes don't represent any known character so the browser had to put something in there.Code: Select all
echo substr($row['addesc'],0,$ad_preview_chars);
Use mb_substr instead, which operates on logical characters instead of just their bytes:Note that after you do this the preview content will become about 3 times longer than it is now, but if you count it out you'll see you're actually getting the right number of characters.Code: Select all
echo mb_substr($row['addesc'],0,$ad_preview_chars,'whatever character encoding you are using');
Thank you so much for that
Code: Select all
<?php
if($ad_preview_chars)
{
echo "<span class='adpreview'>";
$row['addesc'] = preg_replace("/\[\/?URL\]/", "", $row['addesc']);
echo mb_substr($row['addesc'],0,$ad_preview_chars,'UTF-8');
if (strlen($row['addesc'])>$ad_preview_chars) echo "...";
echo "</span>";
}
?>After digging a little deeper I see the following PHP script in another file related to displaying a single ad (not an ad listing page).
Here is what it looks like:
Code: Select all
<?php
}
?>
<div class="adtitle">
<?php echo $ad['adtitle']; ?>
<?php
$loc = "";
if($ad['area']) $loc = $ad['area'];
if($xcityid < 0) $loc .= ($loc ? ", " : "") . $ad['cityname'];
if($loc) echo " <span class=\"adarea\">($loc)</span>";
?>
</div>Re: Annoying Charachter
Well, I'd ask where $ad is coming from and where it got its values. Maybe it came directly from a database? What does strlen() (and not mb_strlen()) say about the length of the problematic titles and locations?
Re: Annoying Charachter
requinix wrote:Well, I'd ask where $ad is coming from and where it got its values. Maybe it came directly from a database? What does strlen() (and not mb_strlen()) say about the length of the problematic titles and locations?
$ad gets its value from below.
Code: Select all
// Get the ad
$sql = "SELECT a.*, ct.cityname as cityname, UNIX_TIMESTAMP(a.timestamp) AS timestamp, UNIX_TIMESTAMP(a.createdon) AS createdon, UNIX_TIMESTAMP(a.expireson) AS expireson, UNIX_TIMESTAMP(feat.featuredtill) AS featuredtill $xfieldsql
FROM $t_ads a
INNER JOIN $t_subcats scat ON scat.subcatid = a.subcatid
INNER JOIN $t_cities ct ON a.cityid = ct.cityid
LEFT OUTER JOIN $t_adxfields axf ON a.adid = axf.adid
LEFT OUTER JOIN $t_featured feat ON a.adid = feat.adid AND feat.adtype = 'A'
WHERE a.adid = $xadid
AND $visibility_condn_admin";
$ad = mysql_fetch_array(mysql_query($sql));
$isevent = 0;
if($sef_urls) $thisurl = "{$vbasedir}$xcityid/posts/$xcatid/$xsubcatid/{$xadid}_".RemoveBadURLChars($ad['adtitle']).".html";
else $thisurl = "?$qs";
}
if (!$ad)
{
header("Location: $script_url/?view=main&cityid=$xcityid&lang=$xlang");
exit;
}
if ($_POST['email'] && $_POST['mail'] && $ad['showemail'] == EMAIL_USEFORM)
{
if ($image_verification && !$captcha->verify($_POST['captcha']))
{
$err = $lang['ERROR_IMAGE_VERIFICATION_FAILED'];
}
else
{
unset($_GET['mailed'],$_GET['mailerr'],$_GET['reported']);
$qs = "";
foreach ($_GET as $k=>$v) $qs .= "$k=$v&";
$qs = substr($qs, 0, -1);
$thisurl = "$script_url/?$qs";
$thismail_header = file_get_contents("mailtemplates/contact_header.txt");
$thismail_header = str_replace("{@SITENAME}", $site_name, $thismail_header);
$thismail_header = str_replace("{@ADTITLE}", $ad['adtitle'], $thismail_header);
$thismail_header = str_replace("{@ADURL}", $thisurl, $thismail_header);
$thismail_header = str_replace("{@FROM}", $_POST['email'], $thismail_header);
$thismail_footer = file_get_contents("mailtemplates/contact_footer.txt");
$thismail_footer = str_replace("{@SITENAME}", $site_name, $thismail_footer);
$thismail_footer = str_replace("{@ADTITLE}", $ad['adtitle'], $thismail_footer);
$thismail_footer = str_replace("{@ADURL}", $thisurl, $thismail_footer);
$thismail_footer = str_replace("{@FROM}", $_POST['email'], $thismail_footer);
$mail = $thismail_header . "\n" .
stripslashes($_POST['mail']) . "\n" .
$thismail_footer;Ad Title character length:
Code: Select all
<b><?php echo $lang['POST_ADTITLE']; ?>:</b><br>
<input name="adtitle" type="text" id="adtitle" size="100" maxlength="80" value="<?php echo $data['adtitle']; ?>">Location character length:
Code: Select all
<?php echo $lang['OR_SPECIFY']; ?>
<input name="area" type="text" size="65" maxlength="50" value="<?php echo $data['area']; ?>" onKeyUp="javascript:if(this.form.arealist.selectedIndex!=<?php echo $other_index; ?>) this.form.arealist.selectedIndex=<?php echo $other_index; ?>;" <?php if($area_inlist) echo "disabled"; ?>>
<?php
}
else
{
?>
<input name="area" type="text" size="65" maxlength="50" value="<?php echo $data['area']; ?>">
<?php
}
?>and strlen for title:
Code: Select all
if ($_POST['do'] == "post")
{
$data = $_POST;
$data['area'] = $data['area']?$data['area']:$data['arealist'];
foreach ($data as $k=>$v) if(!is_array($v)) $data[$k] = stripslashes($v);
if(!$data['adtitle'])
{
$data['adtitle'] = substr($data['addesc'], 0, $generated_adtitle_length) . ((strlen($data['addesc']) > $generated_adtitle_length) ? $generated_adtitle_append : "");
if(strpos($data['adtitle'], "\n") > 0) $data['adtitle'] = trim(substr($data['adtitle'], 0, strpos($data['adtitle'], "\n")));
}
if(!$data['addesc'] || (!$in_admin && !$data['email']))
$err .= "• $lang[ERROR_POST_FILL_ALL]<br>";Re: Annoying Charachter
What do you get with "SHOW CREATE TABLE $t_ads"?
Re: Annoying Charachter
My phpMyadmin knowledge very limited but when I enter SHOW CREATE TABLE $t_ads in SQL query I get the following:requinix wrote:What do you get with "SHOW CREATE TABLE $t_ads"?
#1146 - Table 'databasename.$t_ads' doesn't exist
It looks like it's more complicated that I thought... It could be a character encoding mismatch with mySQL database or something since the database was originally latin1 and now my hosting has changed it to utf8.
Re: Annoying Charachter
$t_ads is a PHP variable. If you execute the query in phpMyAdmin then you have to put the real table name in there.
Character encoding something, to be sure. Mismatch, possibly. What does one of the problematic ad's data look like in phpMyAdmin?
Character encoding something, to be sure. Mismatch, possibly. What does one of the problematic ad's data look like in phpMyAdmin?
Re: Annoying Charachter
Thanks requinix!requinix wrote:$t_ads is a PHP variable. If you execute the query in phpMyAdmin then you have to put the real table name in there.
Character encoding something, to be sure. Mismatch, possibly. What does one of the problematic ad's data look like in phpMyAdmin?
As you can see none of my ads appear correctly in my database when browsing the tables in phpmyadmin. Character encoding is set to Unicode UTF-8 in my browser but nothing appears right in mySQL database. (see screenshot)
I also tried running the show table command for $adtable and got this:
The #1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'CREAT TABLE $adtable' at line 1
Re: Annoying Charachter
There's definitely a character encoding issue, but I'm not so sure it's causing the immediate problem of <?> characters.
You misspelled "CREATE", and I think you missed the point of myduxbox wrote:I also tried running the show table command for $adtable and got this:
The #1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'CREAT TABLE $adtable' at line 1
comment.requinix wrote:$t_ads is a PHP variable. If you execute the query in phpMyAdmin then you have to put the real table name in there.
Re: Annoying Charachter
requinix wrote:There's definitely a character encoding issue, but I'm not so sure it's causing the immediate problem of <?> characters.
You misspelled "CREATE", and I think you missed the point of myduxbox wrote:I also tried running the show table command for $adtable and got this:
The #1064 - You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'CREAT TABLE $adtable' at line 1comment.requinix wrote:$t_ads is a PHP variable. If you execute the query in phpMyAdmin then you have to put the real table name in there.
You're right. Here is what I am getting for the ad table:
Code: Select all
CREATE TABLE `adxfields` (
`adid` int(10) unsigned NOT NULL DEFAULT '0',
`f1` varchar(255) NOT NULL DEFAULT '',
`f2` varchar(255) NOT NULL DEFAULT '',
`f3` varchar(255) NOT NULL DEFAULT '',
`f4` varchar(255) NOT NULL DEFAULT '',
`f5` varchar(255) NOT NULL DEFAULT '',
`f6` varchar(255) NOT NULL DEFAULT '',
`f7` varchar(255) NOT NULL DEFAULT '',
`f8` varchar(255) NOT NULL DEFAULT '',
`f9` varchar(255) NOT NULL DEFAULT '',
`f10` varchar(255) NOT NULL DEFAULT '',
`f11` varchar(255) NOT NULL DEFAULT '',
`f12` varchar(255) NOT NULL DEFAULT '',
`f13` varchar(255) NOT NULL DEFAULT '',
`f14` varchar(255) NOT NULL DEFAULT '',
`f15` varchar(255) NOT NULL DEFAULT '',
KEY `adid` (`adid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1and this is for the 'ad list' table:
Code: Select all
CREATE TABLE `ads` (
`adid` int(10) unsigned NOT NULL AUTO_INCREMENT,
`adtitle` varchar(100) NOT NULL DEFAULT '',
`addesc` longtext NOT NULL,
`area` varchar(50) NOT NULL DEFAULT '',
`email` varchar(50) NOT NULL DEFAULT '',
`showemail` enum('0','1','2') NOT NULL DEFAULT '0',
`password` varchar(50) NOT NULL DEFAULT '',
`code` varchar(35) NOT NULL DEFAULT '',
`cityid` smallint(5) unsigned NOT NULL DEFAULT '0',
`subcatid` smallint(5) unsigned NOT NULL DEFAULT '0',
`price` decimal(10,2) NOT NULL DEFAULT '0.00',
`othercontactok` enum('0','1') NOT NULL DEFAULT '0',
`hits` int(10) unsigned NOT NULL DEFAULT '0',
`ip` varchar(15) NOT NULL DEFAULT '',
`verified` enum('0','1') NOT NULL DEFAULT '0',
`abused` int(10) unsigned NOT NULL DEFAULT '0',
`enabled` enum('0','1') NOT NULL DEFAULT '0',
`createdon` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`expireson` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`adid`),
KEY `subcatid` (`subcatid`),
KEY `cityid` (`cityid`),
KEY `verified` (`verified`),
KEY `enabled` (`enabled`)
) ENGINE=MyISAM AUTO_INCREMENT=28308 DEFAULT CHARSET=latin1Re: Annoying Charachter
So as you may have noticed the tables are using the latin1 charset. Probably other tables too. That's not good - they all need to be UTF-8. You'll also need to check server variables to make sure default database and table charsets, along with things like the connection charset, are all using UTF-8 too. Without all the steps along the way using the right one your data can get messed up.
The actual problem, I think, is that the ads.area (as an example) is using a VARCHAR(50). In English that translates to "50 characters in the Latin1 encoding", and since latin1 has one byte per character that also translates to "50 bytes". However you're stuffing UTF-8 data in there, and your data probably has about 3 bytes per character: that means you can only store about 50 / 3 = 16.7 characters. Anything more will be lost: trying to store 17 characters would get 16 and then only part of the 17th. That "only part of" means you get <?>s because it's the exactly thing happening in MySQL as happened in your code: characters got cut off.
The good news is that your system works despite all the character encoding badness, so I think the minimum you need to do to get everything working again is extend the length of the affected columns (remembering that they need to be about 3x longer than you'd expect) and then re-enter the corrupted data (because what you've lost is not recoverable).
The actual problem, I think, is that the ads.area (as an example) is using a VARCHAR(50). In English that translates to "50 characters in the Latin1 encoding", and since latin1 has one byte per character that also translates to "50 bytes". However you're stuffing UTF-8 data in there, and your data probably has about 3 bytes per character: that means you can only store about 50 / 3 = 16.7 characters. Anything more will be lost: trying to store 17 characters would get 16 and then only part of the 17th. That "only part of" means you get <?>s because it's the exactly thing happening in MySQL as happened in your code: characters got cut off.
The good news is that your system works despite all the character encoding badness, so I think the minimum you need to do to get everything working again is extend the length of the affected columns (remembering that they need to be about 3x longer than you'd expect) and then re-enter the corrupted data (because what you've lost is not recoverable).
Re: Annoying Charachter
requinix wrote:So as you may have noticed the tables are using the latin1 charset. Probably other tables too. That's not good - they all need to be UTF-8. You'll also need to check server variables to make sure default database and table charsets, along with things like the connection charset, are all using UTF-8 too. Without all the steps along the way using the right one your data can get messed up.
The actual problem, I think, is that the ads.area (as an example) is using a VARCHAR(50). In English that translates to "50 characters in the Latin1 encoding", and since latin1 has one byte per character that also translates to "50 bytes". However you're stuffing UTF-8 data in there, and your data probably has about 3 bytes per character: that means you can only store about 50 / 3 = 16.7 characters. Anything more will be lost: trying to store 17 characters would get 16 and then only part of the 17th. That "only part of" means you get <?>s because it's the exactly thing happening in MySQL as happened in your code: characters got cut off.
The good news is that your system works despite all the character encoding badness, so I think the minimum you need to do to get everything working again is extend the length of the affected columns (remembering that they need to be about 3x longer than you'd expect) and then re-enter the corrupted data (because what you've lost is not recoverable).
You are a genius requinix!!! I very much appreciate your help figuring this out for me and also explaining the issue so clearly.
Thank You