Page 1 of 1
Matching and grouping similar words/phrases
Posted: Wed Jul 29, 2009 11:18 am
by Reviresco
Okay I must be brain dead here today but I can't come up with a process to do this.
On a medical-related website, when members register we ask them what medical conditions they are interested in learning about. We give them a list of them to choose from, and then just an "Other" field for them to add anything else that's not listed.
Now we're getting into breaking down the data, especially demographics of people by what medical condition they're interested in. Problem is with the "Others" -- with all the misspellings and variations, it's sometimes hard to group them. Examples:
asthma
Asthma
asthma / melanoma
Asthma/Sinus
Asthma and Allergies
asthsma
Athsma
psorasis
Psorasis
psoratic arthritis
psoriasis
psoriasis and poriatic arthrit
psrosis
COLEL-RECTAL CANCER
colon
colon and ovarian
Colon Cancer
Colon cancer
colon cancer
colon cancer (family history)
colon polyps
Colon Problems
colorectal cancer
sinus infections
sinusitis
sinusitus
sinus problems
...You get the idea. Obviously I can deal with the capitalization issues easily, but it's the other ones that are puzzling me.
What I'm looking for is an approach to identifying and grouping these identical or related responses into single categories. I've thought about preg_match, but what am I matching to what?
Also, the prospect of coming up with another list of medical conditions and matching to that is daunting, and I don't think the doctor or his interns have time to create the list, even though he offered to help.
Re: Matching and grouping similar words/phrases
Posted: Wed Jul 29, 2009 12:08 pm
by andyhoneycutt
This would probably be a good start for your research into solving this problem:
http://php.net/manual/en/function.similar-text.php.
Hope that helps a bit, and if you come up with a great solution, please post!
-Andy
Re: Matching and grouping similar words/phrases
Posted: Wed Jul 29, 2009 12:22 pm
by prometheuzz
No, this is not something regex is suited for. With regex, you can precisely define how a string "looks" like, but trying to "fuzzy-match" using regex will only result in too many false positives, or the other way around.
Levenshtein has devised an clever algorithm that calculates the distance between two words. It calculates how many "steps" (insertions and deletions of characters) it takes to transform one word into the other, which is called the Levenshtein-distance. This is what you need to solve your problem.
More info:
http://us2.php.net/manual/en/function.levenshtein.php
Good luck!
Re: Matching and grouping similar words/phrases
Posted: Wed Jul 29, 2009 12:23 pm
by prometheuzz
That looks promising! I didn't know of it's existence.
Thanks Andy.
Re: Matching and grouping similar words/phrases
Posted: Wed Jul 29, 2009 12:30 pm
by andyhoneycutt
prometheuzz wrote:
That looks promising! I didn't know of it's existence.
Thanks Andy.
I was just reading about the Levenshtein approach. I think in practical usage like what Reviresco is looking for, it would be much better suited! The one I have referenced merely comes up with a 'likeness' value for the two words. I think that's less accurate when you're handling human input to determine how similar the words are; you have to take into account misspellings, and counter to my assumptions, levenshtein takes less processor time. similar_text weighs in at O(n**3), and levenshtein at O(m*n)[where m and n are the lenghts of the words you're comparing].
Anyways, this is a very interesting topic- I'm hoping to find some way to use this in my programming in the future
-andy
Re: Matching and grouping similar words/phrases
Posted: Wed Jul 29, 2009 12:49 pm
by andyhoneycutt
I came up with a down-and-dirty comparison of the similar_text and levenshtein functions. I found that by tweaking the scoring method of the levenshtein procedure, I'm able to come up with a good start for your system:
Code: Select all
<?php
$a = file("/tmp/test.txt");
array_trim($a);
array_tolower($a);
echo "<pre>";
$r = array();
foreach($a as $b)
{
foreach($a as $c)
{
$t = array();
$p = 0;
$t[0] = $b;
$t[1] = $c;
$t[2] = similar_text($b,$c,$p);
$t[3] = $p;
// make sure the larger of the two strings is the first argument. This
// ensures that we don't get penalized for inserting characters when we're
// actually deleting them, and also still maintains a penalty for inserting
// when clearly we are dealing with two [very] different strings.
$d = (strlen($b) <= strlen($c)) ? $c : $b;
$e = (strlen($b) <= strlen($c)) ? $b : $c;
$t[4] = levenshtein($d,$e,1,1,0);
$r[] = $t;
}
}
// EDIT TO CHANGE:
//print_r($r);
// dump print replaced with the following on edit:
foreach($r as $w_scores)
{
if( $w_scores[4] == 0 ) echo "Levenshtein shows {$w_scores[0]} and {$w_scores[1]} are probably the same.\n";
if( $w_scores[3] >= 90) echo "similar_text shows {$w_scores[0]} and {$w_scores[1]} are probably the same.\n";
echo "\n\n";
}
echo "</pre>";
function array_trim(&$a)
{
for($i=0;$i<count($a);$i++)
$a[$i] = trim($a[$i]);
}
function array_tolower(&$a)
{
for($i=0;$i<count($a);$i++)
$a[$i] = strtolower($a[$i]);
}
The text file:
Code: Select all
asthma
Asthma
asthma / melanoma
Asthma/Sinus
Asthma and Allergies
asthsma
Athsma
EDIT: changed the print_r to something meaningful.
Re: Matching and grouping similar words/phrases
Posted: Fri Jul 31, 2009 10:39 am
by Reviresco
Getting mixed results so far. Looking at word/phrase pairings that got Levenshtein numbers of 3 or less (lower is better), sometimes it works great, other times not so great:
Some examples:
acne - add: 3
acne - adhd: 3
acne - aging: 3
acne - all: 3
acne - fitne: 3
brain fog - brain food: 2
colon cancer - colan cancer: 2
colorectal cancer - colel-rectal cancer: 3
death - detox: 3
death - feet: 3
dimentia - dementia: 1
diverliculi - diverticular: 3
diverticulosis - diverticulitis: 2
edema - anemia: 3
exema - anemia: 3
exema - edema: 1
alzheimers - altzheimers: 1
alzheimers - alzheimer: 1
alzheimers - alzhiemers: 2
I must not be trimming my strings properly (I could've sworn I trimmed all white space) -- "colon cancer" and "colan cancer" should only be 1, not 2. Hmmm....
Re: Matching and grouping similar words/phrases
Posted: Fri Jul 31, 2009 10:48 am
by prometheuzz
That's too black-and-white. Besides looking at the Levenshtein distance, you should also take the length of the words into account. And to get more accurate results, you may even want to combine a few of these fuzzy matching algorithms.
Re: Matching and grouping similar words/phrases
Posted: Fri Jul 31, 2009 10:52 am
by andyhoneycutt
I agree. I think combining the phonetic approach using sound_ex could also help -- if it sounds like "asthma", mostly, and is almost written as "asthma" then you've probably got a dead ringer.
Re: Matching and grouping similar words/phrases
Posted: Fri Jul 31, 2009 11:03 am
by prometheuzz
andyhoneycutt wrote:I agree. I think combining the phonetic approach using sound_ex could also help -- if it sounds like "asthma", mostly, and is almost written as "asthma" then you've probably got a dead ringer.
Yes, combining it with a soundex-algorithm should result in far better results.