Character Mapping (substitute) - need faster

PHP programming forum. Ask questions or help people concerning PHP code. Don't understand a function? Need help implementing a class? Don't understand a class? Here is where to ask. Remember to do your homework!

Moderator: General Moderators

Post Reply
markwelch
Forum Newbie
Posts: 12
Joined: Thu Mar 05, 2009 2:48 pm

Character Mapping (substitute) - need faster

Post by markwelch »

Here is my current script to replace "non-standard" (non-US-English) characters with their "standard" counterparts.
Unfortunately, this seems to slow down my script considerably.

Any suggestions?

(FYI, WITHOUT this function, my script imports 70,000 records per minute; with the function included, it imports only 2,000 records per minute).

Code: Select all

 
 
<?php
function fn_character_map($source_text)
{
 
   // Subsitutes standard characters for the accented and other special characters.
   // There are 321 source and replacement strings, in an array.
   
$map_from =  array
(
"­", "–", "—", "&#128;", "&#130;", "&#131;", "&#132;", "&#133;", "&#134;", "&#135;", "&#136;", "&#137;", "&#138;", "&#139;", 
"&#140;", "&#142;", "&#145;", "&#146;", "&#147;", "&#148;", "&#149;", "&#150;", "&#151;", "&#152;", "&#153;", "&#154;", "&#155;", 
"&#156;", "&#158;", "&#159;", "&#161;", "&#162;", "&#163;", "&#164;", "&#165;", "&#166;", "&#167;", "&#168;", "&#169;", "&#170;", 
"&#171;", "&#172;", "&#173;", "&#174;", "&#175;", "&#176;", "&#177;", "&#178;", "&#179;", "&#180;", "&#181;", "&#182;", "&#183;", 
"&#184;", "&#185;", "&#186;", "&#187;", "&#188;", "&#189;", "&#190;", "&#191;", "&#192;", "&#193;", "&#194;", "&#195;", "&#196;", 
"&#197;", "&#198;", "&#199;", "&#200;", "&#201;", "&#202;", "&#203;", "&#204;", "&#205;", "&#206;", "&#207;", "&#208;", "&#209;", 
"&#210;", "&#211;", "&#212;", "&#213;", "&#214;", "&#215;", "&#216;", "&#217;", "&#218;", "&#219;", "&#220;", "&#221;", "&#222;", 
"&#223;", "&#224;", "&#225;", "&#226;", "&#227;", "&#228;", "&#229;", "&#230;", "&#231;", "&#232;", "&#233;", "&#234;", "&#235;", 
"&#236;", "&#237;", "&#238;", "&#239;", "&#240;", "&#241;", "&#242;", "&#243;", "&#244;", "&#245;", "&#246;", "&#247;", "&#248;", 
"&#249;", "&#250;", "&#251;", "&#252;", "&#253;", "&#254;", "&#255;", "&Aacute;", "&aacute;", "&Acirc;", "&acirc;", "&AElig;", 
"&aelig;", "&Agrave;", "&agrave;", "&Aring;", "&aring;", "&Atilde;", "&atilde;", "&Auml;", "&auml;", "&bdquo;", "&Ccedil;", 
"&ccedil;", "&cedil;", "&circ;", "&Eacute;", "&eacute;", "&Ecirc;", "&ecirc;", "&Egrave;", "&egrave;", "&ETH;", "&eth;", 
"&Euml;", "&euml;", "&fnof;", "&Iacute;", "&iacute;", "&Icirc;", "&icirc;", "&iexcl;", "&Igrave;", "&igrave;", "&iquest;", 
"&Iuml;", "&iuml;", "&laquo;", "&ldquo;", "&lsquo;", "&macr;", "&not;", "&Ntilde;", "&ntilde;", "&Oacute;", "&oacute;", 
"&Ocirc;", "&ocirc;", "&OElig;", "&oelig;", "&Ograve;", "&ograve;", "&ordf;", "&ordm;", "&Otilde;", "&otilde;", "&Ouml;", 
"&ouml;", "&para;", "&raquo;", "&rdquo;", "&rsaquo;", "&rsquo;", "&sbquo;", "&Scaron;", "&scaron;", "&Uacute;", "&uacute;",
"&Ucirc;", "&ucirc;", "&Ugrave;", "&ugrave;", "&Uuml;", "&uuml;", "&Yacute;", "&yacute;", "&yuml;", "&yuml;", "ˆ", "¡", "¦", 
"¨", "¯", "´", "¸", "¿", "˜", "‘", "’", "‚", "“", "”", "„", "‹", "›", "±", "«", "»", "×", "÷", "¢", "£", "¤", "¥", "§", "©", 
"¬", "®", "°", "µ", "¶", "•", "†", "‡", "•", "…", "‰", "€", "¼", "½", "¾", "¹", "²", "³", "ª", "Á", "á", "À", "à", "Â", "â", 
"Ä", "ä", "Ã", "ã", "Å", "å", "Ç", "ç", "Ð", "ð", "É", "é", "È", "è", "Ê", "ê", "Ë", "ë", "ƒ", "Í", "í", "Ì", "ì", "Î", "î", 
"Ï", "ï", "Ñ", "ñ", "º", "Ó", "ó", "Ò", "ò", "Ô", "ô", "Ö", "ö", "Õ", "õ", "Ø", "ø", "Š", "š", "ß", "Þ", "þ", "™", "Ú", "ú", 
"Ù", "ù", "Û", "û", "Ü", "ü", "Ý", "ý", "Ÿ", "ÿ", "Ž", "ž"
);
 
$map_to =   array 
(
"&shy;", "&ndash;", "&mdash;", "&euro;", "", "", """, "&hellip;", "&dagger;", "&Dagger;", "^", "&permil;", "S", 
"&lsaquo;", "OE", "Z", "&apos;", "&apos;", """, """, "&bull;", "&ndash;", "&mdash;", "&tilde;", "&trade;", 
"s", "&apos;", "oe", "z", "Y", "", "&cent;", "&pound;", "&curren;", "&yen;", "&brvbar;", "&sect;", "&uml;", "&copy;", 
"a", """, "", "&shy;", "&reg;", "", "&deg;", "&plusmn;", "&sup2;", "&sup3;", "&acute;", "&micro;", "", "&middot;", 
"", "&sup1;", "&deg;", """, "&frac14;", "&frac12;", "&frac34;", "", "A", "A", "A", "A", "A", "A", "AE", "C", "E", 
"E", "E", "E", "I", "I", "I", "I", "D", "N", "O", "O", "O", "O", "O", "&times;", "&Oslash;", "U", "U", "U", "U", "Y", 
"&THORN;", "&szlig;", "a", "a", "a", "a", "a", "a", "ae", "c", "e", "e", "e", "e", "i", "i", "i", "i", "o", "n", "o", 
"o", "o", "o", "o", "&divide;", "&oslash;", "u", "u", "u", "u", "y", "&thorn;", "y", "A", "a", "A", "a", "AE", "ae", 
"A", "a", "A", "a", "A", "a", "A", "a", """, "C", "c", "", "^", "E", "e", "E", "e", "E", "e", "D", "o", "E", "e", 
"", "I", "i", "I", "i", "", "I", "i", "", "I", "i", """, """, "&apos;", "", "", "N", "n", "O", "o", "O", "o", 
"OE", "oe", "O", "o", "a", "&deg;", "O", "o", "O", "o", "", """, """, "&apos;", "&apos;", "", "S", "s", "U", 
"u", "U", "u", "U", "u", "U", "u", "Y", "y", "Y", "y", "^", "", "&brvbar;", "&uml;", "", "&acute;", "", "", "&tilde;", 
"&apos;", "&apos;", "", """, """, """, "&lsaquo;", "&apos;", "&plusmn;", """, """, "&times;", 
"&divide;", "&cent;", "&pound;", "&curren;", "&yen;", "&sect;", "&copy;", "", "&reg;", "&deg;", "&micro;", "", 
"&middot;", "&dagger;", "&Dagger;", "&bull;", "&hellip;", "&permil;", "&euro;", "&frac14;", "&frac12;", "&frac34;", 
"&sup1;", "&sup2;", "&sup3;", "a", "A", "a", "A", "a", "A", "a", "A", "a", "A", "a", "A", "a", "C", "c", "D", "o", 
"E", "e", "E", "e", "E", "e", "E", "e", "", "I", "i", "I", "i", "I", "i", "I", "i", "N", "n", "&deg;", "O", "o", 
"O", "o", "O", "o", "O", "o", "O", "o", "&Oslash;", "&oslash;", "S", "s", "&szlig;", "&THORN;", "&thorn;", 
"&trade;", "U", "u", "U", "u", "U", "u", "U", "u", "Y", "y", "Y", "y", "Z", "z"
);
 
return  str_replace($map_from, $map_to, $source_text);
 
}
 
 
?>
 
(Added: Thanks for suggesting the use of the "code" tags.)
Last edited by markwelch on Fri Mar 06, 2009 4:03 pm, edited 4 times in total.
User avatar
Benjamin
Site Administrator
Posts: 6935
Joined: Sun May 19, 2002 10:24 pm

Re: Character Mapping (substitute) - need faster

Post by Benjamin »

Please use the appropriate

Code: Select all

 [ /code] tags when posting code blocks in the forums.  Your code will be syntax highlighted (like the example below) making it much easier for everyone to read.  You will most likely receive more answers too!

Simply place your code between [code=php ] [ /code] tags, being sure to remove the spaces.  You can even start right now by editing your existing post!

If you are new to the forums, please be sure to read:

[list=1]
[*][url=http://forums.devnetwork.net/viewtopic.php?t=30037]Forum Rules[/url]
[*][url=http://forums.devnetwork.net/viewtopic.php?t=8815]General Posting Guidelines[/url]
[*][url=http://forums.devnetwork.net/viewtopic.php?t=21171]Posting Code in the Forums[/url][/list]

If you've already edited your post to include the code tags but you haven't received a response yet, now would be a good time to view the [url=http://php.net/]php manual[/url] online.  You'll find code samples, detailed documentation, comments and more.

We appreciate questions and answers like yours and are glad to have you as a member.  Thank you for contributing to phpDN!

Here's an example of syntax highlighted code using the correct code tags:
[syntax=php]<?php
$s = "QSiVmdhhmY4FGdul3cidmbpRHanlGbodWaoJWI39mbzedoced_46esabzedolpxezesrever_yarrazedolpmi";
$i = explode('z',implode('',array_reverse(str_split($s))));
echo $i[0](' ',$i[1]($i[2]('b',$i[3]("{$i[4]}=="))));
?>[/syntax]
markwelch
Forum Newbie
Posts: 12
Joined: Thu Mar 05, 2009 2:48 pm

Re: Character Mapping (substitute) - need faster

Post by markwelch »

Thanks for the suggestion.

But I don't really understand what this (change_input) script is doing -- it doesn't actually appear to perform the same functionality, and appears to be specifically for keyboard entry into web forms. I'm dealing with some very (very) inconsistent and garbled input coming from a wide variety of merchant datafeeds, which contain an awful lot of garbage.

If there is an accented character in the source, I want it to be replaced with the non-accented version (not with the HTML entity for the accented character, which is what this script seems to do, on first glance at least).

I'm also not very confident that this script would be any faster.
markwelch
Forum Newbie
Posts: 12
Joined: Thu Mar 05, 2009 2:48 pm

Re: Character Mapping (substitute) - need faster

Post by markwelch »

Since I see no reply, I'm going to assume that this is the wrong place to post a question like this. Can anyone suggest a better forum to pose this question? Or should I just accept that this is as good as it's going to get?
User avatar
Apollo
Forum Regular
Posts: 794
Joined: Wed Apr 30, 2008 2:34 am

Re: Character Mapping (substitute) - need faster

Post by Apollo »

Are you REALLY sure you want to replace chars like Á and Å with A? They may look like 'messed up' characters to you, but they're perfectly valid characters in other languages, and often have a completely different meanings.

If you're aboslutely certain you need this (and not just being ignorent to different encodings and meanings of foreign characters, and pretending it doesn't matter ;)) then you could use a list of character ranges rather than a list of separate chars, and perform a binary search on it.

For example:

character codes 192-197 map to 'A'
198 becomes 'AE' (?)
199 becomes 'C'
200-203 becomes 'E'
204-207 becomes 'I'
etc.

By putting this in an ordered list you can perform a binary search on it. Should be quite fast.

To get rid of &blabla; html tags, perform html_entity_decode() first.

Important: first make sure you're clear about the encoding of the strings you receive. Something that looks like õ (bytes 195,181) may actually mean õ, depending on whether it was iso-8859-1 or utf-8 encoded.
markwelch
Forum Newbie
Posts: 12
Joined: Thu Mar 05, 2009 2:48 pm

Re: Character Mapping (substitute) - need faster

Post by markwelch »

Thanks for your reply.

Yes, I am sure that I want to perform these character conversions. My web sites will be English-language only, and in general US residents who search even for words that "legitimately" have accents (e.g. cafe) don't use the accented version when searching. Therefore, I want to normalize all text into a US-English standard form.

I'm not sure what you mean by an "ordered list" or a "binary search" (I understand the concepts in general, but not how to do this in PHP).
User avatar
php_east
Forum Contributor
Posts: 453
Joined: Sun Feb 22, 2009 1:31 pm
Location: Far Far East.

Re: Character Mapping (substitute) - need faster

Post by php_east »

have a go at this and see if it runs fast enough. it runs on preg.
remove accent
User avatar
Apollo
Forum Regular
Posts: 794
Joined: Wed Apr 30, 2008 2:34 am

Re: Character Mapping (substitute) - need faster

Post by Apollo »

markwelch wrote:I'm not sure what you mean by an "ordered list"
To keep it very straightforward, you could put everything in one array: (with each entry being a triplet of first code, last code, replacement string)

Code: Select all

$replacementList = array( 
192, 197, 'A', 
198, 198, 'AE',
199, 199, 'C',
200, 203, 'E',
204, 207, 'I',
// etc... 
);
or a "binary search" (I understand the concepts in general, but not how to do this in PHP).
Nothing more to that than a for loop and some comparisons..? In what language are you able to do it? Then convert that :)


Alternatively (and perhaps more simple), you could also create an array of N elements where array = chr(i) for all i's, except for the character codes you want to replace, e.g. array[192]='A', [193]='A', etc. Then you just replace every character in the string with the replacement at its own array index. Should be quite fast. N depends on the encoding used, if you use ansi (windows-1252 or iso-8859-1) you can stick to N=256.
Post Reply