Page 1 of 1
[SOLVED] Random characters and letters
Posted: Fri Jul 10, 2009 2:12 pm
by William
Hey all,
I'm currently working on a project that I will be generating random identifiers that contain letters and numbers. It will be all uppercase. I'm trying to figure out the best solution to generate them very randomly so it's less likely they will collide. I will of course be checking to see if the identifier is already in use, and if it is regenerate, but I'd prefer to have to do the least amount of checks as possible.
I was thinking that maybe hashing data from urandom and using substr() to take X amount of characters off the hash. Anyone else have any other better ideas?
Example...
Code: Select all
function generateTransactionID( )
{
do {
if ( ( $handle = fopen( '/dev/urandom', 'rb' ) ) ) {
$random = fread( $handle, 8 );
fclose( $handle );
}
$hash = substr( md5( $random ), 0, 12 );
// $transaction = query DB if found true else false
} while ( $transaction !== false );
return strtoupper( $hash );
}
Example output:
93C88EE7E607
4DFD36F7A366
5D20CF60E253
5796307D51EB
67BEE2A79FDF
9D4C58E7B8EB
3C6183CED554
DFDDF00A1289
62DE6F4B0FA9
1510EECCBFDC
Thanks
Re: Random characters and letters
Posted: Fri Jul 10, 2009 3:43 pm
by andyhoneycutt
Code: Select all
echo strtoupper(md5(time() . rand(0,time())));
Seems to solve your problem.
output:
Code: Select all
99996/100000 uniques generated.
real 0m0.974s
user 0m0.929s
sys 0m0.045s
-Andy
Re: Random characters and letters
Posted: Fri Jul 10, 2009 4:51 pm
by pickle
Does the length of the unique string matter? If not, I'd suggest prepending the time in microseconds to string. In pretty much all practical applications this alone is a fairly good guarantee that the strings will be unique (though not a mathematical certainty).
If the length of the string does matter, then I'll point out that only taking the first 12 characters will increase the chance of collision. For example, say you only took the first character - you'd already have collisions in the sample data you generated. 12 characters won't be as bad, but you get the idea.
In any case, I've never heard of using that /dev/urandom file to generate random characters. I wrote a test script (below) to test it's speed versus how I usually generate random strings, and it's relatively REALLY slow - file operations generally are. I'd recommend doing something like this instead:
Code: Select all
$characters = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789';
$character_length = strlen($characters);
for($j = 0;$j < 8;$j++)
{
$random .= $characters[rand(0,$character_length)];
}
$hash = substr(md5($random),0,12);
Here's the test script I used & the results below it:
Code: Select all
<?PHP
$iterations = 1000;
$file_start = microtime(TRUE);
for($i = 0;$i < $iterations; $i++)
{
if($handle = fopen('/dev/urandom','rb'))
{
$random = fread($handle,8);
fclose($handle);
}
$hash = substr(md5($random),0,12);
}
$file_end = microtime(TRUE);
$file_diff = $file_end - $file_start;
$string_start = microtime(TRUE);
$characters = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789';
$character_length = strlen($characters);
$random = '';
for($i = 0;$i < $iterations;$i++)
{
$random = '';
for($j = 0;$j < 8;$j++)
{
$random .= $characters[rand(0,$character_length)];
}
$hash = substr(md5($random),0,12);
}
$string_end = microtime(TRUE);
$string_diff = $string_end - $string_start;
echo <<<OUTPUT
Iterations: $iterations<br />
Opening file duration: $file_diff seconds<br />
Random character duration: $string_diff seconds
OUTPUT;
?>
Script wrote:Iterations: 1000
Opening file duration: 1.3470158576965 seconds
Random character duration: 0.008868932723999 seconds
Re: Random characters and letters
Posted: Fri Jul 10, 2009 5:00 pm
by andyhoneycutt
pickle, could you elaborate, at least for my curiosity's sake, how your solution would be better used here than my own? I'm only asking out of sheer curiosity because I'm wondering if I'm missing something key to this problem.
Thanks much,
Andy
Re: Random characters and letters
Posted: Fri Jul 10, 2009 5:06 pm
by andyhoneycutt
When I account for the significantly smaller string length, I'm still seeing nearly the same percentage of uniques being generated by the following script:
Code: Select all
$s = array();
for($i=0;$i<100000;$i++)
$s[] = substr(strtoupper(md5(time() . rand(0,time()))),0,12);
$t = count($s);
$u = count(array_unique($s));
echo "$u/$t uniques generated.";
Yields the following result (5 non-unique terms):
Code: Select all
99995/100000 uniques generated.
real 0m1.076s
user 0m1.021s
sys 0m0.055s
Re: Random characters and letters
Posted: Fri Jul 10, 2009 5:10 pm
by William
pickle wrote:Does the length of the unique string matter? If not, I'd suggest prepending the time in microseconds to string. In pretty much all practical applications this alone is a fairly good guarantee that the strings will be unique (though not a mathematical certainty).
If the length of the string does matter, then I'll point out that only taking the first 12 characters will increase the chance of collision. For example, say you only took the first character - you'd already have collisions in the sample data you generated. 12 characters won't be as bad, but you get the idea.
In any case, I've never heard of using that /dev/urandom file to generate random characters. I wrote a test script (below) to test it's speed versus how I usually generate random strings, and it's relatively REALLY slow - file operations generally are. I'd recommend doing something like this instead:
Code: Select all
$characters = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789';
$character_length = strlen($characters);
for($j = 0;$j < 8;$j++)
{
$random .= $characters[rand(0,$character_length)];
}
$hash = substr(md5($random),0,12);
Here's the test script I used & the results below it:
Code: Select all
<?PHP
$iterations = 1000;
$file_start = microtime(TRUE);
for($i = 0;$i < $iterations; $i++)
{
if($handle = fopen('/dev/urandom','rb'))
{
$random = fread($handle,8);
fclose($handle);
}
$hash = substr(md5($random),0,12);
}
$file_end = microtime(TRUE);
$file_diff = $file_end - $file_start;
$string_start = microtime(TRUE);
$characters = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789';
$character_length = strlen($characters);
$random = '';
for($i = 0;$i < $iterations;$i++)
{
$random = '';
for($j = 0;$j < 8;$j++)
{
$random .= $characters[rand(0,$character_length)];
}
$hash = substr(md5($random),0,12);
}
$string_end = microtime(TRUE);
$string_diff = $string_end - $string_start;
echo <<<OUTPUT
Iterations: $iterations<br />
Opening file duration: $file_diff seconds<br />
Random character duration: $string_diff seconds
OUTPUT;
?>
Script wrote:Iterations: 1000
Opening file duration: 1.3470158576965 seconds
Random character duration: 0.008868932723999 seconds
Yeah I was originally doing something similar
Code: Select all
function generate_hash( $len = 8 )
{
$seed = 'A0BCaDbcIJdKeLM1N2OP3QR4ST5UV6WX7fgYhZij7klmHno8p9qrEstFuvGxyz';
$seed_len = strlen( $seed );
$random = '';
for ( $i = 0; $i < $len; ++$i )
{
$random .= substr( $seed, ( mt_rand() % $seed_len ), 1 );
}
return $random;
}
But I'm trying I was wondering if there was a better way (mathamatically) to help stop collisions. My math skills are not that great and I know some of the people here do a lot of research in how hashing algorithms work so I was hoping someone might have a simple solution to make something like the above better. I knew that by using md5 it would reduce how random it was by taking only a part of it, but I'm limited to a 12 character string, I figured that it "might" be better than something like above, although I can't prove that mathamatically.
Thanks a lot, I think I'll end up using one of these functions after all. Although I wouldn't mind if someone can tell me if a substr of a hash has a higher chance of less collisions than something like the function above.
Re: Random characters and letters
Posted: Fri Jul 10, 2009 5:38 pm
by William
Here are few examples:
Code: Select all
$a = 'MRF92JGN4OLQ6IWXC85BYZAP0K7EV1HS3DTU';
for ( $b = 0; $b < 10; ++$b ) {
$c = array();
$d = microtime( true );
for( $e = 0; $e < 100000; ++$e ) {
$f = '';
for ( $g = 0; $g < 12; ++$g ) {
$f .= substr( $a, ( mt_rand() % 36 ), 1 );
}
$c[] = $f;
}
$h = microtime( true );
$i = count( $c );
$j = count( array_unique( $c ) );
echo sprintf( '%u/%u uniques generated in %.4f seconds<br>', $j, $i, $h - $d );
}
Result:
100000/100000 uniques generated in 8.3945 seconds
100000/100000 uniques generated in 8.3961 seconds
100000/100000 uniques generated in 8.3905 seconds
100000/100000 uniques generated in 8.4155 seconds
100000/100000 uniques generated in 8.4325 seconds
100000/100000 uniques generated in 8.4174 seconds
100000/100000 uniques generated in 8.4212 seconds
100000/100000 uniques generated in 8.4405 seconds
100000/100000 uniques generated in 8.4373 seconds
100000/100000 uniques generated in 8.4512 seconds
The server these tests are running on is a Pentium D 2.8GHz computer by the way.
Edit: (Extended)
Modified the loop to do 1,000,000, this is the result
1000000/1000000 uniques generated in 83.6437 seconds
Which is an average of 0.0000836437 random strings per second.
Or I could change this line...
Code: Select all
$f .= substr( $a, ( mt_rand() % 36 ), 1 );
to
Resulting to...
1000000/1000000 uniques generated in 54.2089 seconds or 0.0000542089 per random string.
That is a 64.80% speed increase
or change it to...
Which results in:
1000000/1000000 uniques generated in 46.0932 seconds or 0.0000460932 per random string.
Edit again:
10000000/10000000 uniques generated in 459.8537 seconds or 0.00004598537 per random string.
I think I found my solution. Which would be:
Thanks all.

Re: [SOLVED] Random characters and letters
Posted: Sat Jul 11, 2009 2:44 am
by kaisellgren
If you are worried about collisions, you can just keep while looping until it does not collide.
Your solution results in data that has a strength of 71-bits (optimally). It's unlikely that you are ever going to have clashes; thus, having a while loop will not kill performance as it most likely never runs, but it does add reliability to your scheme.
Re: [SOLVED] Random characters and letters
Posted: Sat Jul 11, 2009 8:44 am
by William
kaisellgren wrote:If you are worried about collisions, you can just keep while looping until it does not collide.
Your solution results in data that has a strength of 71-bits (optimally). It's unlikely that you are ever going to have clashes; thus, having a while loop will not kill performance as it most likely never runs, but it does add reliability to your scheme.
Yeah I have it running on a loop checking if it's valid, I just wanted to make sure it was unique enough that there is a high chance that it won't have to query again.
Thanks!
Re: Random characters and letters
Posted: Mon Jul 13, 2009 9:57 am
by pickle
andyhoneycutt wrote:pickle, could you elaborate, at least for my curiosity's sake, how your solution would be better used here than my own?
To be honest, I didn't spend a whole lot of time looking at your solution before I worked on mine. I was bored at work & this was something to do
Anyway, looking at it now, your seed to MD5 only has numbers - 10 characters, whereas mine has 62. I'm no hash expert, but I'd imagine the number of hashes produced from those 10 characters
has to be smaller than the number of hashes from 62. Additionally, you're calling strtoupper(), which reduces the uniqueness as well.
Guys, we can do all the testing we want with 1001 different ways of seeding md5(). In the end it doesn't matter because it has been proven (not sure where) that 2 different strings, when run through md5, can have the same hash. There are an infinite number of character combinations that can be thrown at md5(), but only a finite number of hashes. No matter how md5() is seeded, you still need to check if it exists.
Re: Random characters and letters
Posted: Mon Jul 13, 2009 10:04 am
by William
pickle wrote:andyhoneycutt wrote:pickle, could you elaborate, at least for my curiosity's sake, how your solution would be better used here than my own?
To be honest, I didn't spend a whole lot of time looking at your solution before I worked on mine. I was bored at work & this was something to do
Anyway, looking at it now, your seed to MD5 only has numbers - 10 characters, whereas mine has 62. I'm no hash expert, but I'd imagine the number of hashes produced from those 10 characters
has to be smaller than the number of hashes from 62. Additionally, you're calling strtoupper(), which reduces the uniqueness as well.
Guys, we can do all the testing we want with 1001 different ways of seeding md5(). In the end it doesn't matter because it has been proven (not sure where) that 2 different strings, when run through md5, can have the same hash. There are an infinite number of character combinations that can be thrown at md5(), but only a finite number of hashes. No matter how md5() is seeded, you still need to check if it exists.
I wasn't trying to find a way to completely stop checking to see if a hash exists in the database already. I was simply trying to find a way to make a hash where the odds of it matching more than once in a row is slim. Basically if I'm doing 1 query per check, I don't want to have to check 10 times in a row because the two hashes I generated were all used. I know 10 query's doesn't seem like a lot, but when you have over 100,000,000 entries in your database, and you constantly are adding new every second and it's changing, then those query's add up. It looks like we found a good solution, thanks again for all your help.
And yeah, two hashes can be the same from the same string, there is only 2^32 possibilities for MD5, there are infinite combiniations, so you'll always be able to find a collision.
Thanks again guys!
Re: Random characters and letters
Posted: Mon Jul 13, 2009 12:29 pm
by andyhoneycutt
pickle wrote:Guys, we can do all the testing we want with 1001 different ways of seeding md5(). In the end it doesn't matter because it has been proven (not sure where) that 2 different strings, when run through md5, can have the same hash. There are an infinite number of character combinations that can be thrown at md5(), but only a finite number of hashes. No matter how md5() is seeded, you still need to check if it exists.
Awesome, thanks very much

I wasn't aware of this, though I should have done some more research on my part as I do rely on md5 for certain tasks. While I'm not assuming 100% uniqueness, I hadn't realized that md5 potentially yielded a much higher volume of non-uniques as compared to other similar operations. Again, thanks for that pickle! I had assumed there was good reason behind your approach
@William: Glad you solved your problem!
-Andy