Page 1 of 1

Character Mapping (substitute) - need faster

Posted: Fri Mar 06, 2009 2:03 pm
by markwelch
Here is my current script to replace "non-standard" (non-US-English) characters with their "standard" counterparts.
Unfortunately, this seems to slow down my script considerably.

Any suggestions?

(FYI, WITHOUT this function, my script imports 70,000 records per minute; with the function included, it imports only 2,000 records per minute).

Code: Select all

 
 
<?php
function fn_character_map($source_text)
{
 
   // Subsitutes standard characters for the accented and other special characters.
   // There are 321 source and replacement strings, in an array.
   
$map_from =  array
(
"­", "–", "—", "&#128;", "&#130;", "&#131;", "&#132;", "&#133;", "&#134;", "&#135;", "&#136;", "&#137;", "&#138;", "&#139;", 
"&#140;", "&#142;", "&#145;", "&#146;", "&#147;", "&#148;", "&#149;", "&#150;", "&#151;", "&#152;", "&#153;", "&#154;", "&#155;", 
"&#156;", "&#158;", "&#159;", "&#161;", "&#162;", "&#163;", "&#164;", "&#165;", "&#166;", "&#167;", "&#168;", "&#169;", "&#170;", 
"&#171;", "&#172;", "&#173;", "&#174;", "&#175;", "&#176;", "&#177;", "&#178;", "&#179;", "&#180;", "&#181;", "&#182;", "&#183;", 
"&#184;", "&#185;", "&#186;", "&#187;", "&#188;", "&#189;", "&#190;", "&#191;", "&#192;", "&#193;", "&#194;", "&#195;", "&#196;", 
"&#197;", "&#198;", "&#199;", "&#200;", "&#201;", "&#202;", "&#203;", "&#204;", "&#205;", "&#206;", "&#207;", "&#208;", "&#209;", 
"&#210;", "&#211;", "&#212;", "&#213;", "&#214;", "&#215;", "&#216;", "&#217;", "&#218;", "&#219;", "&#220;", "&#221;", "&#222;", 
"&#223;", "&#224;", "&#225;", "&#226;", "&#227;", "&#228;", "&#229;", "&#230;", "&#231;", "&#232;", "&#233;", "&#234;", "&#235;", 
"&#236;", "&#237;", "&#238;", "&#239;", "&#240;", "&#241;", "&#242;", "&#243;", "&#244;", "&#245;", "&#246;", "&#247;", "&#248;", 
"&#249;", "&#250;", "&#251;", "&#252;", "&#253;", "&#254;", "&#255;", "&Aacute;", "&aacute;", "&Acirc;", "&acirc;", "&AElig;", 
"&aelig;", "&Agrave;", "&agrave;", "&Aring;", "&aring;", "&Atilde;", "&atilde;", "&Auml;", "&auml;", "&bdquo;", "&Ccedil;", 
"&ccedil;", "&cedil;", "&circ;", "&Eacute;", "&eacute;", "&Ecirc;", "&ecirc;", "&Egrave;", "&egrave;", "&ETH;", "&eth;", 
"&Euml;", "&euml;", "&fnof;", "&Iacute;", "&iacute;", "&Icirc;", "&icirc;", "&iexcl;", "&Igrave;", "&igrave;", "&iquest;", 
"&Iuml;", "&iuml;", "&laquo;", "&ldquo;", "&lsquo;", "&macr;", "&not;", "&Ntilde;", "&ntilde;", "&Oacute;", "&oacute;", 
"&Ocirc;", "&ocirc;", "&OElig;", "&oelig;", "&Ograve;", "&ograve;", "&ordf;", "&ordm;", "&Otilde;", "&otilde;", "&Ouml;", 
"&ouml;", "&para;", "&raquo;", "&rdquo;", "&rsaquo;", "&rsquo;", "&sbquo;", "&Scaron;", "&scaron;", "&Uacute;", "&uacute;",
"&Ucirc;", "&ucirc;", "&Ugrave;", "&ugrave;", "&Uuml;", "&uuml;", "&Yacute;", "&yacute;", "&yuml;", "&yuml;", "ˆ", "¡", "¦", 
"¨", "¯", "´", "¸", "¿", "˜", "‘", "’", "‚", "“", "”", "„", "‹", "›", "±", "«", "»", "×", "÷", "¢", "£", "¤", "¥", "§", "©", 
"¬", "®", "°", "µ", "¶", "•", "†", "‡", "•", "…", "‰", "€", "¼", "½", "¾", "¹", "²", "³", "ª", "Á", "á", "À", "à", "Â", "â", 
"Ä", "ä", "Ã", "ã", "Å", "å", "Ç", "ç", "Ð", "ð", "É", "é", "È", "è", "Ê", "ê", "Ë", "ë", "ƒ", "Í", "í", "Ì", "ì", "Î", "î", 
"Ï", "ï", "Ñ", "ñ", "º", "Ó", "ó", "Ò", "ò", "Ô", "ô", "Ö", "ö", "Õ", "õ", "Ø", "ø", "Š", "š", "ß", "Þ", "þ", "™", "Ú", "ú", 
"Ù", "ù", "Û", "û", "Ü", "ü", "Ý", "ý", "Ÿ", "ÿ", "Ž", "ž"
);
 
$map_to =   array 
(
"&shy;", "&ndash;", "&mdash;", "&euro;", "", "", """, "&hellip;", "&dagger;", "&Dagger;", "^", "&permil;", "S", 
"&lsaquo;", "OE", "Z", "&apos;", "&apos;", """, """, "&bull;", "&ndash;", "&mdash;", "&tilde;", "&trade;", 
"s", "&apos;", "oe", "z", "Y", "", "&cent;", "&pound;", "&curren;", "&yen;", "&brvbar;", "&sect;", "&uml;", "&copy;", 
"a", """, "", "&shy;", "&reg;", "", "&deg;", "&plusmn;", "&sup2;", "&sup3;", "&acute;", "&micro;", "", "&middot;", 
"", "&sup1;", "&deg;", """, "&frac14;", "&frac12;", "&frac34;", "", "A", "A", "A", "A", "A", "A", "AE", "C", "E", 
"E", "E", "E", "I", "I", "I", "I", "D", "N", "O", "O", "O", "O", "O", "&times;", "&Oslash;", "U", "U", "U", "U", "Y", 
"&THORN;", "&szlig;", "a", "a", "a", "a", "a", "a", "ae", "c", "e", "e", "e", "e", "i", "i", "i", "i", "o", "n", "o", 
"o", "o", "o", "o", "&divide;", "&oslash;", "u", "u", "u", "u", "y", "&thorn;", "y", "A", "a", "A", "a", "AE", "ae", 
"A", "a", "A", "a", "A", "a", "A", "a", """, "C", "c", "", "^", "E", "e", "E", "e", "E", "e", "D", "o", "E", "e", 
"", "I", "i", "I", "i", "", "I", "i", "", "I", "i", """, """, "&apos;", "", "", "N", "n", "O", "o", "O", "o", 
"OE", "oe", "O", "o", "a", "&deg;", "O", "o", "O", "o", "", """, """, "&apos;", "&apos;", "", "S", "s", "U", 
"u", "U", "u", "U", "u", "U", "u", "Y", "y", "Y", "y", "^", "", "&brvbar;", "&uml;", "", "&acute;", "", "", "&tilde;", 
"&apos;", "&apos;", "", """, """, """, "&lsaquo;", "&apos;", "&plusmn;", """, """, "&times;", 
"&divide;", "&cent;", "&pound;", "&curren;", "&yen;", "&sect;", "&copy;", "", "&reg;", "&deg;", "&micro;", "", 
"&middot;", "&dagger;", "&Dagger;", "&bull;", "&hellip;", "&permil;", "&euro;", "&frac14;", "&frac12;", "&frac34;", 
"&sup1;", "&sup2;", "&sup3;", "a", "A", "a", "A", "a", "A", "a", "A", "a", "A", "a", "A", "a", "C", "c", "D", "o", 
"E", "e", "E", "e", "E", "e", "E", "e", "", "I", "i", "I", "i", "I", "i", "I", "i", "N", "n", "&deg;", "O", "o", 
"O", "o", "O", "o", "O", "o", "O", "o", "&Oslash;", "&oslash;", "S", "s", "&szlig;", "&THORN;", "&thorn;", 
"&trade;", "U", "u", "U", "u", "U", "u", "U", "u", "Y", "y", "Y", "y", "Z", "z"
);
 
return  str_replace($map_from, $map_to, $source_text);
 
}
 
 
?>
 
(Added: Thanks for suggesting the use of the "code" tags.)

Re: Character Mapping (substitute) - need faster

Posted: Fri Mar 06, 2009 2:12 pm
by Benjamin
Please use the appropriate

Code: Select all

 [ /code] tags when posting code blocks in the forums.  Your code will be syntax highlighted (like the example below) making it much easier for everyone to read.  You will most likely receive more answers too!

Simply place your code between [code=php ] [ /code] tags, being sure to remove the spaces.  You can even start right now by editing your existing post!

If you are new to the forums, please be sure to read:

[list=1]
[*][url=http://forums.devnetwork.net/viewtopic.php?t=30037]Forum Rules[/url]
[*][url=http://forums.devnetwork.net/viewtopic.php?t=8815]General Posting Guidelines[/url]
[*][url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url][/list]

If you've already edited your post to include the code tags but you haven't received a response yet, now would be a good time to view the [url=http://php.net/]php manual[/url] online.  You'll find code samples, detailed documentation, comments and more.

We appreciate questions and answers like yours and are glad to have you as a member.  Thank you for contributing to phpDN!

Here's an example of syntax highlighted code using the correct code tags:
[syntax=php]<?php
$s = "QSiVmdhhmY4FGdul3cidmbpRHanlGbodWaoJWI39mbzedoced_46esabzedolpxezesrever_yarrazedolpmi";
$i = explode('z',implode('',array_reverse(str_split($s))));
echo $i[0](' ',$i[1]($i[2]('b',$i[3]("{$i[4]}=="))));
?>[/syntax]

Re: Character Mapping (substitute) - need faster

Posted: Fri Mar 06, 2009 6:53 pm
by markwelch
Thanks for the suggestion.

But I don't really understand what this (change_input) script is doing -- it doesn't actually appear to perform the same functionality, and appears to be specifically for keyboard entry into web forms. I'm dealing with some very (very) inconsistent and garbled input coming from a wide variety of merchant datafeeds, which contain an awful lot of garbage.

If there is an accented character in the source, I want it to be replaced with the non-accented version (not with the HTML entity for the accented character, which is what this script seems to do, on first glance at least).

I'm also not very confident that this script would be any faster.

Re: Character Mapping (substitute) - need faster

Posted: Tue Mar 31, 2009 3:38 pm
by markwelch
Since I see no reply, I'm going to assume that this is the wrong place to post a question like this. Can anyone suggest a better forum to pose this question? Or should I just accept that this is as good as it's going to get?

Re: Character Mapping (substitute) - need faster

Posted: Tue Mar 31, 2009 4:08 pm
by Apollo
Are you REALLY sure you want to replace chars like Á and Å with A? They may look like 'messed up' characters to you, but they're perfectly valid characters in other languages, and often have a completely different meanings.

If you're aboslutely certain you need this (and not just being ignorent to different encodings and meanings of foreign characters, and pretending it doesn't matter ;)) then you could use a list of character ranges rather than a list of separate chars, and perform a binary search on it.

For example:

character codes 192-197 map to 'A'
198 becomes 'AE' (?)
199 becomes 'C'
200-203 becomes 'E'
204-207 becomes 'I'
etc.

By putting this in an ordered list you can perform a binary search on it. Should be quite fast.

To get rid of &blabla; html tags, perform html_entity_decode() first.

Important: first make sure you're clear about the encoding of the strings you receive. Something that looks like õ (bytes 195,181) may actually mean õ, depending on whether it was iso-8859-1 or utf-8 encoded.

Re: Character Mapping (substitute) - need faster

Posted: Tue Mar 31, 2009 7:54 pm
by markwelch
Thanks for your reply.

Yes, I am sure that I want to perform these character conversions. My web sites will be English-language only, and in general US residents who search even for words that "legitimately" have accents (e.g. cafe) don't use the accented version when searching. Therefore, I want to normalize all text into a US-English standard form.

I'm not sure what you mean by an "ordered list" or a "binary search" (I understand the concepts in general, but not how to do this in PHP).

Re: Character Mapping (substitute) - need faster

Posted: Tue Mar 31, 2009 9:23 pm
by php_east
have a go at this and see if it runs fast enough. it runs on preg.
remove accent

Re: Character Mapping (substitute) - need faster

Posted: Wed Apr 01, 2009 3:17 am
by Apollo
markwelch wrote:I'm not sure what you mean by an "ordered list"
To keep it very straightforward, you could put everything in one array: (with each entry being a triplet of first code, last code, replacement string)

Code: Select all

$replacementList = array( 
192, 197, 'A', 
198, 198, 'AE',
199, 199, 'C',
200, 203, 'E',
204, 207, 'I',
// etc... 
);
or a "binary search" (I understand the concepts in general, but not how to do this in PHP).
Nothing more to that than a for loop and some comparisons..? In what language are you able to do it? Then convert that :)


Alternatively (and perhaps more simple), you could also create an array of N elements where array = chr(i) for all i's, except for the character codes you want to replace, e.g. array[192]='A', [193]='A', etc. Then you just replace every character in the string with the replacement at its own array index. Should be quite fast. N depends on the encoding used, if you use ansi (windows-1252 or iso-8859-1) you can stick to N=256.