Page 1 of 1

[HELP!!] How to compare complex string ???

Posted: Fri Jan 26, 2007 4:51 am
by acco
Hi to everybody, i write to you cause i got a big problem: i got a DB, in mysql, with more than 12000 records, each record has some values as name, surname, city, zip code and address.

My task consists to find all record with the same address. Here is the biggest problem because, while city and zip code are entered with a drop down list, and so there are no possibilities to got differences, the address is entered manually in an input text.

Some examples:
main avenue, 21
main avanue 21
or , worst case: abbreviations:
liberty street, 23
liberty st. 23
So, it depends how the users inserts the address, but the differences can be minimal and refer to the same address. So how can i compare the differences with a bit a tollerance??

I hope that someone can help me.


Thanks and sorry for my bad english ;)

Posted: Fri Jan 26, 2007 5:19 am
by Mohamed
try mysql fulltext search

Posted: Fri Jan 26, 2007 5:24 am
by acco
but mysql full-text search act as a compare with result in % to determine how accurate is the comparasion?

Posted: Fri Jan 26, 2007 5:44 am
by onion2k
Mysql doesn't have any really accurate fuzzy matching functions that you can use to compare records. It may be possible to achieve some level of success with the following:

http://uk2.php.net/manual/en/function.soundex.php
http://uk2.php.net/manual/en/function.levenshtein.php
http://uk2.php.net/manual/en/function.metaphone.php
http://uk2.php.net/manual/en/function.similar-text.php

Obviously how you'd approach comparisions would be different for each command. Soundex and metaphone calculate a 'signature' for each string, so you'd calculate that for all the addresses and then group the ones with similar signatures. Levenshtien and similar_text are trickier because they're for comparing 2 strings, so you'd need to compare each string with every other string. You'd do well to limit the number of comparisons, maybe only comparing addresses where the ZIP code is the same.

Now, that should work, but there'll still be problems. For example, if you find 2 records for "Mr Smith, 123 Liberty Road" and "Mr Smith, 123 Liberty Rd" how do are you going to know that they're the same person? It could be a man and his son.

For future reference though, don't rely on the user entering their address properly if you need to know they're unique. Use their email address or SSN or something.

Posted: Fri Jan 26, 2007 6:07 am
by acco
thanks for ur reply onion2k, after lunch i'll look at our links. Anyway, by ur answer, i understood that i've not explained very well my task. The DB contain all contracts of all our customers, so i could have more contracts from the same customers, or have another customer, that is the son of another customer, but i don't always know that. So what i'm traing to find out how many people live at the same address, and how many contracts they got with us. To do this I 'll look for that addresses that have the same zip code and city (here i've no problem, cause the zip code and city are entered by drop down menu) and address, no other values. For this reason im trying all methods to compare address similar, but not the same.

I hope to have explained better my task.

Posted: Fri Jan 26, 2007 6:48 am
by Mordred
I would go and re-parse the addresses in some canonical address format - "s", "st", "st." becomes "street", the number is always before "street" etc. -- analyze the cases in your dataset to infer the appropriate rules. You can then catch misspellings with onion2k's suggestions, but you might be safer if you compare the names with ones taken from another address database.

Posted: Fri Jan 26, 2007 7:53 am
by acco
Mordred wrote:...but you might be safer if you compare the names with ones taken from another address database.
what do u mean with that?

Posted: Fri Jan 26, 2007 8:02 am
by onion2k
Thinking a little more about this ... are these addresses spread out across the whole country? There's only 12,000, the chance of having more than a few hundred duplicates should be very low. It's probably quicker to print out a list ordered by the ZIP code and check them by hand. I doubt it'd take more than a few hours. Get a trainee or a junior coder to do it. :)

If it's a task that you'll need to do regularly though, then code is the only option.

Posted: Fri Jan 26, 2007 4:43 pm
by da404lewzer
acco wrote:
Mordred wrote:...but you might be safer if you compare the names with ones taken from another address database.
what do u mean with that?
soundex outputs a representation of the sounded-out word and is a great way to check for spelling errors...

consider:

echo soundex("Google");
echo soundex("Gewgle");

both output: G240 & G240
this is great for comparing words based on how they sound.

consider:

echo soundex("Google");
echo soundex("Good");
echo soundex("Geronimo");
echo soundex("Dog");

Outputs: G240, G300, G655 & D200

mysql has a soundex function too:
http://www.madirish.net/tech.php?section=4&article=85

Posted: Sun Jan 28, 2007 10:46 am
by Mordred
acco wrote:
Mordred wrote:...but you might be safer if you compare the names with ones taken from another address database.
what do u mean with that?
Find a database with street names from at least several cities. Use the entries in it to correct spelling mistakes in your database with one of the levenstein()-type functions.

Posted: Sun Jan 28, 2007 1:10 pm
by Kieran Huggins
it's also worth checking out the metaphone() function

beware: the mysql version of soundex() is faster, but produces slightly different output. If you're going to match soundexes in a db, transform them both with the mysql function and chop off the first character.