Page 1 of 1

Smartquotes: replaced in a test script, but not from POST

Posted: Tue Oct 21, 2008 6:34 am
by batfastad
Hi everyone

Ok it's yet another charset/HTML entity question. But this time with a difference! :?
I'm building a simple CMS on our intranet server, which adds/updates/deletes data stored in a MySQL database on our web host's server. It all works fine, and I'm in the final stages of development.
The users who have access are all decent HTML people, so the validation I have of the data only checks the basic structures of XHTML: closed tags, special chars are converted to entities, comments closed etc. And the users can type whatever HTML they like, so long as it's structurally correct. There will only be 2/3 users who will have access to this system. There's also an AJAX cURL call on the intranet page to the W3C Validator (using SOAP response) which displays a message as to whether the code they entered was properly valid or not.

The databases, HTML output and PHP5 is all set to using utf8.

On to my exact problem.
I don't mind what characters the users enter, but I'd like to completely get rid of any smart quotes, replacing them with their normal equivalents.
I'm leaving it up to the users to convert any other chars to their entity equivalents, like Euro sign etc.
This was the simple function I came up with:

Code: Select all

function smartquote_conv($var) {
    $search = array('‘', '’', '“', '”');
    $replace = array('\'', '\'', '"', '"');
 
    return str_replace($search, $replace, $var); 
}
 
$var = <<<HTML
THIS ’IS A BIG‘ ”TEST page FROM THE ’NEW“ SYSTEM, ””””””””“““““““““““ WITH SMART QUOTES ALL OVER IT ”“
HTML;
 
echo smartquote_conv($var);
It works fine as a stand-alone test script. All the smart quotes get converted to their proper equivalents.
However when I drop this function into the script that processes the CMS input and saves it to MySQL, they don't get replaced.

Also when I do a POST to my simple test script they get replaced fine, apart from if they are POSTed from the CMS edit page. So there seems to be something in that CMS edit page which is causing the receiving script to not carry out the replace properly.

I copy and pasted the smart quote characters from MS Word, and I've been writing these scripts in Notepad++ on Windows XP Pro... just in case there's any OS issues that I don't know about.
According to the Firefox page info, both the edit page and save page are UTF-8 and text/html

Anyone got any ideas on this?
It's been driving me nuts for the past couple of days

As well as smart quotes, are there any other chars I should replace with their regular equivalents? Different dash/space lengths etc?

Cheers, Ben

Re: Smartquotes: replaced in a test script, but not from POST

Posted: Tue Oct 21, 2008 9:56 am
by inet411
It's hard to tell without seeing more of your script but I see you are echoing out your converted string.
Before you insert the data are you replacing the string with the converted string?
ie
instead of:

Code: Select all

echo smartquote_conv($var);
are you

Code: Select all

$var = smartquote_conv($var);
if you are, then is your script inserting $_POST['var'] or is it inserting $var?


As far as smart quotes and character replacement:
You have many built in php functions for doing things like this.
htmlentities();
htmlspecialchars()
check those out on php.net there is a lot of useful info on those pages and make sure to check out the user contributed notes as some people have already made functions that do exactly what your looking for and more.

Re: Smartquotes: replaced in a test script, but not from POST

Posted: Thu Oct 23, 2008 10:39 am
by batfastad
Hi, thanks for the reply
Yes I am sure - I was just echoing it out to debug why it wasn't working

I do already use htmlentities and htmlspecialchars for many appropriate tasks throughout the intranet solution.

But I was wondering why my replace function doesn't work when the string comes from my database or when I POST it to another script.
I have a feeling it might be something to do with the way I copy and pasted the original chars for the function, from Word on windows into Notepad++

I've also changed my script to this:

Code: Select all

function smartquote_conv($var) {
    $search = array('‘', '’', '“', '”');
    $replace = array('\'', '\'', '"', '"');
 
    $search = array( chr(145), chr(146), chr(147), chr(148));
    $replace = array('\'', '\'', '"', '"');
 
    return str_replace($search, $replace, $var); 
}
But still no luck :(

Any ideas?

I badly want to eradicate smart quotes from user input!

Thanks, B

Re: Smartquotes: replaced in a test script, but not from POST

Posted: Fri Oct 24, 2008 7:07 am
by batfastad
I am an idiot! :D

Obviously that function I wrote will only convert the chars from the Windows extended ascii charset
When the data's coming from another page through POST, or from my MySQL DB, it's all encoded as UTF8 already!
So I actually need to be replacing the UTF8 codes with whatever I want :wink:

After much trial and error and comment #13 by Mark on this page http://shiflett.org/blog/2005/oct/conve ... s-with-php
My function becomes:

Code: Select all

function utf8_char_replace($var) {
 
    $trans_table = array(
        chr(0xe2).chr(0x80).chr(0x9a) => '\'', //SINGLE LOW-9 QUOTATION MARK
        chr(0xe2).chr(0x80).chr(0x9e) => '"', //DOUBLE LOW-9 QUOTATION MARK
        chr(0xe2).chr(0x80).chr(0xa6) => '...', //HORIZONTAL ELLIPSIS
        chr(0xe2).chr(0x80).chr(0x98) => '\'', //LEFT SINGLE QUOTATION MARK
        chr(0xe2).chr(0x80).chr(0x99) => '\'', //RIGHT SINGLE QUOTATION MARK
        chr(0xe2).chr(0x80).chr(0x9c) => '"', //LEFT DOUBLE QUOTATION MARK
        chr(0xe2).chr(0x80).chr(0x9d) => '"', //RIGHT DOUBLE QUOTATION MARK
        chr(0xe2).chr(0x80).chr(0x93) => '-', //EN DASH
        chr(0xe2).chr(0x80).chr(0x94) => '-' //EM DASH
    );
 
    foreach ($trans_table as $utf8_code => $replace) {
        $var = str_replace($utf8_code, $replace, $var);
    }
 
    return $var;
}
This has given me another question though... where I have the chr(0xe2).chr(....
Is there a better way of writing it? So I could just write the Unicode Dec/Hex equivalents from this table... http://www.manderby.com/mandalex/a/ascii.php

Thanks, B