Page 1 of 1

Anagram Searching

Posted: Wed Jan 19, 2005 8:50 am
by JayBird
I have a DB that has around 250,000 english words in it.

I was thinking the other day how to create a script that would allow a user to enter some characters i.e. "servant"

It would then search the database and display other words, that can be made up of the same letters, i.e. for "servant" it could return "taverns".

Then, also you could have options to return all words that can be made from them letters, with a minimum length of 3 letters

e.g.

ant
tan
rent
rest

At first i thought it would be easy, then i started thinking about it and my brain melted! :lol:

Any ideas?

Posted: Wed Jan 19, 2005 8:59 am
by magicrobotmonkey
brute force it:

Code: Select all

//if its in the db:
$query = "select wordText from dict order by CHAR_LENGTH(wordText), wordText

while($thisDictWord = $res)
{
    if(count(array_intersect(str_split($testWord), str_spilt($thisDictWord)) >0))
        $anagramsї] = $thisDictWord;

}
you could also play with similar_text()

Posted: Wed Jan 19, 2005 9:43 am
by AGISB
I would find it easier to find anagrams of the same length. As it reduces the brute force approach a lot.

I think the better approach would be to create a list of possible words and get the rows with given a very long where clause.

So basically

anagram
aangram
angaram
.
.

Code: Select all

Select * from dict Where wordText IN (anagram, aangram, angaram ...)

much less system load doing it this way. And make sure you got an index on wordText



By the way: Where did you get that database? Is it downloadable anywhere open source?

Posted: Wed Jan 19, 2005 10:35 am
by JayBird
AGISB wrote:By the way: Where did you get that database? Is it downloadable anywhere open source?
I have just uploaded a zip file containing 12 text files


compound_forms-wordlist-76,205-words.txt
consolidated list.txt
FILE main list -- 103,070 entries.txt
Moby compound forms -- 256,772 entries.txt
Moby single words -- 354,984 entries.txt
Orchy list -- 175, 171 entries.txt
Roger King's dictionary -- 191,625 entries.txt
Roget's Thesaurus list -- 17,474 entries.txt
Shorter Oxford English Dict -- 119,995 entries.txt
UK ACD, no accents -- 230,534 entries.txt
Unix dictionary -- 25,104 entries.txt
wordlist-234,936-words.txt


The file is currenltly uplaoding, it should be 8.18 MB (8,578,681 bytes)

http://www.infinit-e.com/wordlists.zip

Mark

Re: Anagram Searching

Posted: Wed Jan 19, 2005 10:47 am
by onion2k
Bech100 wrote:I have a DB that has around 250,000 english words in it.

I was thinking the other day how to create a script that would allow a user to enter some characters i.e. "servant"

It would then search the database and display other words, that can be made up of the same letters, i.e. for "servant" it could return "taverns".
This is actually really simple so long as you process everything beforehand. Make a script that inserts all the words into a database table.. and also inserts another string thats made up of the letters in the word in alphabetical order. So you'd have:

servant | aenrstv
taverns | aenrstv

Finding an anagram would just be a matter of taking the lookup word, converting it into alphabetical letter order, and pulling out all the entries that match.
Then, also you could have options to return all words that can be made from them letters, with a minimum length of 3 letters

e.g.

ant
tan
rent
rest

At first i thought it would be easy, then i started thinking about it and my brain melted! :lol:

Any ideas?
Again, using the alphabet order index words, you'd have to go through all the possible 3 letter combinations, then 4 letter, 5 letter, and so on.

Theres probably better ways to index the words. That sort of thing is always a challenge..

Posted: Wed Jan 19, 2005 11:22 am
by AGISB
Thanks Mark. Downloading now.

Posted: Wed Jan 19, 2005 12:46 pm
by timvw
building all combination with some input can be easily solved recursively..

untested code

Code: Select all

function buildcombo($chars)
{
      // only one char - thus only one solution
      if (strlen($chars) == 1) return $chars{0};

     $solutions = array();
     for ($i = 1; $i < strlen($chars); ++$i)
     &#123;
           $others = substr($chars, 0, $i);
            $others .= substr($chars, $i+1, strlen($chars));

            $subsolutions = buildcombo($others);

            foreach($subsolutions as $subsolution)
                    $solutions&#1111;] = $chars&#123;$i&#125; . $subsolution;
     &#125;
     return $solutions;
&#125;

Posted: Wed Jan 19, 2005 12:48 pm
by feyd
should be able to do it with a regular expression pretty easy.. and no recursion needed.