[SOLVED] Weighted Random MySQL query problems [SOLVED]
Moderator: General Moderators
[SOLVED] Weighted Random MySQL query problems [SOLVED]
Ok, I'm trying to make a weighted random from a table, but I dont think I fully understand the "order by rand()" method.
Here's my problem...
My query is simple:
$query1 = "select user_id from user_table order by rand()*weight limit 1";
The contents of user_table:
user_id weight
1 ------- 1
2 ------- 2
Here are the results:
(out of 1000 tries)
# times user 1 is chosen ~ 750
# times user 2 is chosen ~ 250
Since user 1 has a weight of 1, I would expect user one to be chosen 1/3 of the time (333/1000) and user 2 to be chosen 2/3 of the time (667/1000). But, user 1 is being chosen 3/4 of the time and user 2 is being chosen 1/4 of the time.
Is there something else I should multiply rand() by? Maybe rand()*(1/weight)? Nope, that just switches them so that user 2 gets chosen 75% of the time and user 1 is chosen 25%.
I need to get this simple query working (and understand it) before I can move on and use the results.
Here's my problem...
My query is simple:
$query1 = "select user_id from user_table order by rand()*weight limit 1";
The contents of user_table:
user_id weight
1 ------- 1
2 ------- 2
Here are the results:
(out of 1000 tries)
# times user 1 is chosen ~ 750
# times user 2 is chosen ~ 250
Since user 1 has a weight of 1, I would expect user one to be chosen 1/3 of the time (333/1000) and user 2 to be chosen 2/3 of the time (667/1000). But, user 1 is being chosen 3/4 of the time and user 2 is being chosen 1/4 of the time.
Is there something else I should multiply rand() by? Maybe rand()*(1/weight)? Nope, that just switches them so that user 2 gets chosen 75% of the time and user 1 is chosen 25%.
I need to get this simple query working (and understand it) before I can move on and use the results.
Last edited by paladaxar on Wed Dec 19, 2007 11:17 am, edited 2 times in total.
- Kieran Huggins
- DevNet Master
- Posts: 3635
- Joined: Wed Dec 06, 2006 4:14 pm
- Location: Toronto, Canada
- Contact:
Probably because..
0.6 * 1 = 0.6
0.6 * 2 = 1.2
0.6 * 3 = 1.8
http://dev.mysql.com/doc/refman/5.0/en/ ... ction_randReturns a random floating-point value v in the range 0 <= v < 1.0.
0.6 * 1 = 0.6
0.6 * 2 = 1.2
0.6 * 3 = 1.8
- Kieran Huggins
- DevNet Master
- Posts: 3635
- Joined: Wed Dec 06, 2006 4:14 pm
- Location: Toronto, Canada
- Contact:
Thanks for the post, but I dont think I follow everything you said. I have read the man many times and I have tried to find something in it that talked about weighted random searches. I couldn't find any.astions wrote:Probably because..
http://dev.mysql.com/doc/refman/5.0/en/ ... ction_randReturns a random floating-point value v in the range 0 <= v < 1.0.
0.6 * 1 = 0.6
0.6 * 2 = 1.2
0.6 * 3 = 1.8
I realize that rand() just picks a random number between 0 and 1, but could you explain exactly what is happening when my query runs? What is the method behind an "order by rand()"? I realize that a temporary table is generated, but thats about all I know. If you "order by rand()" does it choose ONE random number then order the temp table by it? or does it choose one for each row? or does it choose one number, create a table, then choose another number and create another table, then randomly chooses which one to pick your final output from? I think that if I understand what is going on when this query is run, I will be able to figure out what I need to do to get the results that I am looking for. Unfortunately, no one seems to be able to explain to me exactly how this query works.
So, in still trying to figure this out on my own, I ran some tests. I set user 1's weight to 1 and kept it there as I adjusted user 2's weight. Then I put the query into a for loop that ran it 1000 times to see what kind of results were generated.
With this query:
"select user_id from test2 order by rand()/weight limit 1"
and keeping user 1's weight set to 1...
Here are the results:
User 2 weight:
------------Number of times that user 2 was chosen out of 1000
0.01 ----- 5
0.125 -- 62
0.25 -- 125
0.5 ---- 250
1 ------ 500
1.25 -- 600
1.5 ---- 667
1.75 -- 715
2 ------ 750
3 ------ 832
4 ------ 875
5 ------ 900
6 ------ 915
7 ------ 930
8 ------ 937
9 ------ 946
10 ---- 952
100 -- 995
1000 -999.5
Since all of those numbers are kind of hard to relate to each other, I threw them into a graph to see what they "looked" like. Here it is (interestingly):

Now, if I could find a way to flatten that logarithmic (?) graph into a linear graph, that would be great. My problem would be solved for at least the case where user 1's weight is 1. Beyond that, I'm not sure yet...
I really don't know exactly how it works either, but I do know that the mediawiki software uses either floats or decimals for random results in their table and I don't believe any of them were greater than 1. I think they are aiming for truly random results however, so this may not match what your looking for.
ahh this ones simpler than you think..
rand produces a number between 0 and 1
your first weight is 1
therefore 100% of the time your going to get a result of 1 or less on that row with rand()*weight
== 0.XXXXX*1 you see
where as with a weight of 2 you can get a value of over 1 roughly half of the time as it rand has to produce 0.5 or above for that row to get a result higher than 1.
thus row 1 is picked a hell of a lot more..
the answer is simpler yet though : your query is correct, your just ordering it wrong lol, ORDER BY naturally goes ASC, so change your query to:
and you'll get what you want 
ps astions almost had it with:
0.134122
0.634721
0.342523
not 0.6 each time : unless you did rand(seed) where seed is an int
rand produces a number between 0 and 1
your first weight is 1
therefore 100% of the time your going to get a result of 1 or less on that row with rand()*weight
== 0.XXXXX*1 you see
where as with a weight of 2 you can get a value of over 1 roughly half of the time as it rand has to produce 0.5 or above for that row to get a result higher than 1.
thus row 1 is picked a hell of a lot more..
the answer is simpler yet though : your query is correct, your just ordering it wrong lol, ORDER BY naturally goes ASC, so change your query to:
Code: Select all
$query1 = "select user_id from user_table order by rand()*weight DESC limit 1";ps astions almost had it with:
but rand() would produceastions wrote:Probably because..
http://dev.mysql.com/doc/refman/5.0/en/ ... ction_randReturns a random floating-point value v in the range 0 <= v < 1.0.
0.6 * 1 = 0.6
0.6 * 2 = 1.2
0.6 * 3 = 1.8
0.134122
0.634721
0.342523
not 0.6 each time : unless you did rand(seed) where seed is an int
Man, I really wish it were that simple. If I order by rand()*weight DESC, it does choose user 2 more than user 1, but it chooses user 2 75% of the time and user 1 25% of the time. Thus, still not what I am looking for. I want user 1 to be chosen 1/3 (33%) of the time and user 2 to be chosen 2/3 (67%) of the time.nathanr wrote:ahh this ones simpler than you think..
Is there a formula that I can use in my query to achieve that?
Actually if I lower the weight of user 2 down to 1.5, then I get the ratio that I am looking for. But, I'm still not 100% sure why. Also, I will have many more than just 2 users in the table. What weight do I choose for user 3? user 4? I have no idea how to pick weights for each so that I get the right ratios.
If there is a formula for picking their weights correctly, then couldnt I just choose discrete weights (integers) and put that formula into my query? That way the weights would look right to my human understanding, and still produce the correct results from the computer's understanding.
If there is a formula for picking their weights correctly, then couldnt I just choose discrete weights (integers) and put that formula into my query? That way the weights would look right to my human understanding, and still produce the correct results from the computer's understanding.
No offense, but I cant really deduct a formula out of what you posted.
I tried weights in the range you suggested, I changed user 1 to .1 and user 2 to .2 and still got exactly the same results as when I was using 1 and 2. Then I tried using 10 and 20 and, you guessed it, got the same results.
I know its a lot to ask, but I really wish someone would create a table and try this query on it to see what I mean.
Here's the PHP you can use to test your theories...its what I have been doing for 2 weeks....
I tried weights in the range you suggested, I changed user 1 to .1 and user 2 to .2 and still got exactly the same results as when I was using 1 and 2. Then I tried using 10 and 20 and, you guessed it, got the same results.
I know its a lot to ask, but I really wish someone would create a table and try this query on it to see what I mean.
Here's the PHP you can use to test your theories...its what I have been doing for 2 weeks....
Code: Select all
<?php
include("log_into_db.php");
for($i = 0; $i < 10000; $i++)
{
$query = "select user_id from user_table order by rand()*weight desc limit 1";
$result = mysql_query($query);
$line = mysql_fetch_row($result);
if ($line[0] == 1)
$user1++;
if ($line[0] == 2)
$user2++;
if ($line[0] == 3)
$user3++;
echo "User 1: $user1 <br>";
echo "User 2: $user2 <br>";
echo "User 3: $user3 <br>"
}
?>- Kieran Huggins
- DevNet Master
- Posts: 3635
- Joined: Wed Dec 06, 2006 4:14 pm
- Location: Toronto, Canada
- Contact:
With the weights of 1 and 2 the first random number needs to be double the second in order for the first record to be chosen.
If random2 is over 0.5 user2 will always be chosen.
For values of random1 below 0.5 there is a 50% chance that user2 will be chosen.
Therefore 1/2 the selections where random1 is under 0.5 user2 will be chosen, and all the selections where random2 is over 0.5 user2 will be chosen.
That's 3/4 selections.
(Not a very clear explanation, but MySQL is correct)
If random2 is over 0.5 user2 will always be chosen.
For values of random1 below 0.5 there is a 50% chance that user2 will be chosen.
Therefore 1/2 the selections where random1 is under 0.5 user2 will be chosen, and all the selections where random2 is over 0.5 user2 will be chosen.
That's 3/4 selections.
(Not a very clear explanation, but MySQL is correct)
here's the only viable option I ever seen for sql:
http://www.sqlteam.com/article/returnin ... random-row
It involves adding additional column and recalculating it (for every row) every time a row is inserted
http://www.sqlteam.com/article/returnin ... random-row
It involves adding additional column and recalculating it (for every row) every time a row is inserted