Matching and grouping similar words/phrases

Any questions involving matching text strings to patterns - the pattern is called a "regular expression."

Moderator: General Moderators

Post Reply
Reviresco
Forum Contributor
Posts: 172
Joined: Tue Feb 19, 2008 4:18 pm
Location: Milwaukee

Matching and grouping similar words/phrases

Post by Reviresco »

Okay I must be brain dead here today but I can't come up with a process to do this.

On a medical-related website, when members register we ask them what medical conditions they are interested in learning about. We give them a list of them to choose from, and then just an "Other" field for them to add anything else that's not listed.

Now we're getting into breaking down the data, especially demographics of people by what medical condition they're interested in. Problem is with the "Others" -- with all the misspellings and variations, it's sometimes hard to group them. Examples:

asthma
Asthma
asthma / melanoma
Asthma/Sinus
Asthma and Allergies
asthsma
Athsma

psorasis
Psorasis
psoratic arthritis
psoriasis
psoriasis and poriatic arthrit
psrosis

COLEL-RECTAL CANCER
colon
colon and ovarian
Colon Cancer
Colon cancer
colon cancer
colon cancer (family history)
colon polyps
Colon Problems
colorectal cancer

sinus infections
sinusitis
sinusitus
sinus problems

...You get the idea. Obviously I can deal with the capitalization issues easily, but it's the other ones that are puzzling me.

What I'm looking for is an approach to identifying and grouping these identical or related responses into single categories. I've thought about preg_match, but what am I matching to what?

Also, the prospect of coming up with another list of medical conditions and matching to that is daunting, and I don't think the doctor or his interns have time to create the list, even though he offered to help.
User avatar
andyhoneycutt
Forum Contributor
Posts: 468
Joined: Wed Aug 27, 2008 10:02 am
Location: Idaho Falls

Re: Matching and grouping similar words/phrases

Post by andyhoneycutt »

This would probably be a good start for your research into solving this problem: http://php.net/manual/en/function.similar-text.php.

Hope that helps a bit, and if you come up with a great solution, please post! :)

-Andy
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Matching and grouping similar words/phrases

Post by prometheuzz »

No, this is not something regex is suited for. With regex, you can precisely define how a string "looks" like, but trying to "fuzzy-match" using regex will only result in too many false positives, or the other way around.
Levenshtein has devised an clever algorithm that calculates the distance between two words. It calculates how many "steps" (insertions and deletions of characters) it takes to transform one word into the other, which is called the Levenshtein-distance. This is what you need to solve your problem.

More info: http://us2.php.net/manual/en/function.levenshtein.php

Good luck!
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Matching and grouping similar words/phrases

Post by prometheuzz »

andyhoneycutt wrote:This would probably be a good start for your research into solving this problem: http://php.net/manual/en/function.similar-text.php.

Hope that helps a bit, and if you come up with a great solution, please post! :)

-Andy
That looks promising! I didn't know of it's existence.
Thanks Andy.
User avatar
andyhoneycutt
Forum Contributor
Posts: 468
Joined: Wed Aug 27, 2008 10:02 am
Location: Idaho Falls

Re: Matching and grouping similar words/phrases

Post by andyhoneycutt »

prometheuzz wrote:
andyhoneycutt wrote:This would probably be a good start for your research into solving this problem: http://php.net/manual/en/function.similar-text.php.

Hope that helps a bit, and if you come up with a great solution, please post! :)

-Andy
That looks promising! I didn't know of it's existence.
Thanks Andy.
I was just reading about the Levenshtein approach. I think in practical usage like what Reviresco is looking for, it would be much better suited! The one I have referenced merely comes up with a 'likeness' value for the two words. I think that's less accurate when you're handling human input to determine how similar the words are; you have to take into account misspellings, and counter to my assumptions, levenshtein takes less processor time. similar_text weighs in at O(n**3), and levenshtein at O(m*n)[where m and n are the lenghts of the words you're comparing].

Anyways, this is a very interesting topic- I'm hoping to find some way to use this in my programming in the future :D

-andy
User avatar
andyhoneycutt
Forum Contributor
Posts: 468
Joined: Wed Aug 27, 2008 10:02 am
Location: Idaho Falls

Re: Matching and grouping similar words/phrases

Post by andyhoneycutt »

I came up with a down-and-dirty comparison of the similar_text and levenshtein functions. I found that by tweaking the scoring method of the levenshtein procedure, I'm able to come up with a good start for your system:

Code: Select all

<?php
$a = file("/tmp/test.txt");
 
array_trim($a);
array_tolower($a);
 
echo "<pre>";
 
$r = array();
 
foreach($a as $b)
{
  foreach($a as $c)
  {
    $t    = array();
    $p    = 0;
    $t[0] = $b;
    $t[1] = $c;
    $t[2] = similar_text($b,$c,$p);
    $t[3] = $p;
    // make sure the larger of the two strings is the first argument. This
    // ensures that we don't get penalized for inserting characters when we're
    // actually deleting them, and also still maintains a penalty for inserting
    // when clearly we are dealing with two [very] different strings.
    $d = (strlen($b) <= strlen($c)) ? $c : $b;
    $e = (strlen($b) <= strlen($c)) ? $b : $c;
    $t[4] = levenshtein($d,$e,1,1,0);
    $r[]  = $t;
  }
}
 
// EDIT TO CHANGE:
//print_r($r);
// dump print replaced with the following on edit:
foreach($r as $w_scores)
{
  if( $w_scores[4] == 0 ) echo "Levenshtein shows {$w_scores[0]} and {$w_scores[1]} are probably the same.\n";
  if( $w_scores[3] >= 90) echo "similar_text shows {$w_scores[0]} and {$w_scores[1]} are probably the same.\n";
  echo "\n\n";
}
 
echo "</pre>";
 
function array_trim(&$a)
{
  for($i=0;$i<count($a);$i++)
    $a[$i] = trim($a[$i]);
}
 
function array_tolower(&$a)
{
  for($i=0;$i<count($a);$i++)
    $a[$i] = strtolower($a[$i]);
}
The text file:

Code: Select all

asthma
Asthma
asthma / melanoma
Asthma/Sinus
Asthma and Allergies
asthsma
Athsma
EDIT: changed the print_r to something meaningful.
Reviresco
Forum Contributor
Posts: 172
Joined: Tue Feb 19, 2008 4:18 pm
Location: Milwaukee

Re: Matching and grouping similar words/phrases

Post by Reviresco »

Getting mixed results so far. Looking at word/phrase pairings that got Levenshtein numbers of 3 or less (lower is better), sometimes it works great, other times not so great:

Some examples:

acne - add: 3
acne - adhd: 3
acne - aging: 3
acne - all: 3
acne - fitne: 3
brain fog - brain food: 2
colon cancer - colan cancer: 2
colorectal cancer - colel-rectal cancer: 3
death - detox: 3
death - feet: 3
dimentia - dementia: 1
diverliculi - diverticular: 3
diverticulosis - diverticulitis: 2
edema - anemia: 3
exema - anemia: 3
exema - edema: 1
alzheimers - altzheimers: 1
alzheimers - alzheimer: 1
alzheimers - alzhiemers: 2

I must not be trimming my strings properly (I could've sworn I trimmed all white space) -- "colon cancer" and "colan cancer" should only be 1, not 2. Hmmm....
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Matching and grouping similar words/phrases

Post by prometheuzz »

That's too black-and-white. Besides looking at the Levenshtein distance, you should also take the length of the words into account. And to get more accurate results, you may even want to combine a few of these fuzzy matching algorithms.
User avatar
andyhoneycutt
Forum Contributor
Posts: 468
Joined: Wed Aug 27, 2008 10:02 am
Location: Idaho Falls

Re: Matching and grouping similar words/phrases

Post by andyhoneycutt »

I agree. I think combining the phonetic approach using sound_ex could also help -- if it sounds like "asthma", mostly, and is almost written as "asthma" then you've probably got a dead ringer.
User avatar
prometheuzz
Forum Regular
Posts: 779
Joined: Fri Apr 04, 2008 5:51 am

Re: Matching and grouping similar words/phrases

Post by prometheuzz »

andyhoneycutt wrote:I agree. I think combining the phonetic approach using sound_ex could also help -- if it sounds like "asthma", mostly, and is almost written as "asthma" then you've probably got a dead ringer.
Yes, combining it with a soundex-algorithm should result in far better results.
Post Reply