Page 1 of 1

Letter generator, according to commonly used letters

Posted: Tue Jul 29, 2008 11:51 am
by ptocheia
Hi!

So I've currently got a chunk of code I'm working with that generates random letters to fill in extra spaces in a word search puzzle grid. Right now it's just this bit sitting inside a few loops for the random letter generator:

$randletter = chr(ord("a") + rand(0, 25));

Everything works fine as-is. However, I got a suggestion that there were perhaps too many Xs and Zs and uncommon letters in general, and not enough of rather common letters such as vowels. So then I started wondering if it would be possible to make a random (err, not so random in this case) letters function that spit out letters according to how common they are in the English language. Possibly using a multidimensional array to store each letter, plus a percentage value based on how common it is, plus a counter that increases each time that letter is used? And then somehow use the percentage plus the code snippet above to reject generated letters whose percentages are lower the other letters, until a higher percentage letter is generated, and use that one instead? Problem is, seems like there'd need to be a rather high number of letters generated before 26 separate percentages could be really practical to have, and while this program could be generating up to several hundred letters, that's still not all that many, considering. Or, maybe the percentages might need to be generalized and done away with, and I could somehow just have low-demand, mid-demand, and high-demand categories of letters? Or something along those lines?

So, this is more of a 'can this be done in a non-painful way' question then any sort of 'asking for help to write any hard code'. First off, I'm not sure if something like this might already exist out there? Second, I'm having a hard time wrapping my wee brain around what sort of logic needs to be used here, as well. So, I'd appreciate any useful thoughts that could be thrown at the matter.

Thanks!

Re: Letter generator, according to commonly used letters

Posted: Tue Jul 29, 2008 12:07 pm
by mabwi
I think you'd want to have some sort of weighting value assigned to each letter, based on actual frequency. Depending on how accurate you wanted to be, you could do something like a 1 to 5 scale, and split the letters in to 5 tiers, most common to least common, and explode that out in to an array with a weighted number of each value - for example, there would be 5 'E's, 1 'Z', and just do array_rand() for each spot in the grid.

It would produce a more accurate representation than the 1-1 setup you have now. Also, you don't need to do anything to mark "used" letters. Purely weighted random should be good enough.

Re: Letter generator, according to commonly used letters

Posted: Tue Jul 29, 2008 12:43 pm
by onion2k
When I wrote my wordsearch generator (currently broken :( ) I used the frequency of tiles in Scrabble as a basis for the frequency of random letters in the grid. It worked pretty well.

Re: Letter generator, according to commonly used letters

Posted: Tue Jul 29, 2008 12:44 pm
by ptocheia
mabwi wrote:I think you'd want to have some sort of weighting value assigned to each letter, based on actual frequency. Depending on how accurate you wanted to be, you could do something like a 1 to 5 scale, and split the letters in to 5 tiers, most common to least common, and explode that out in to an array with a weighted number of each value - for example, there would be 5 'E's, 1 'Z', and just do array_rand() for each spot in the grid.

It would produce a more accurate representation than the 1-1 setup you have now. Also, you don't need to do anything to mark "used" letters. Purely weighted random should be good enough.

Sounds simple and effective, thanks for the advice!

Re: Letter generator, according to commonly used letters

Posted: Tue Jul 29, 2008 12:49 pm
by ptocheia
onion2k wrote:When I wrote my wordsearch generator (currently broken :( ) I used the frequency of tiles in Scrabble as a basis for the frequency of random letters in the grid. It worked pretty well.
Scrabble frequency is a good idea, sure beats trying to use the percentages on the wikipedia page (http://en.wikipedia.org/wiki/Letter_frequencies) that I was staring at!