Page 1 of 3
[SOLVED] Weighted Random MySQL query problems [SOLVED]
Posted: Mon Dec 10, 2007 12:53 am
by paladaxar
Ok, I'm trying to make a weighted random from a table, but I dont think I fully understand the "order by rand()" method.
Here's my problem...
My query is simple:
$query1 = "select user_id from user_table order by rand()*weight limit 1";
The contents of user_table:
user_id weight
1 ------- 1
2 ------- 2
Here are the results:
(out of 1000 tries)
# times user 1 is chosen ~ 750
# times user 2 is chosen ~ 250
Since user 1 has a weight of 1, I would expect user one to be chosen 1/3 of the time (333/1000) and user 2 to be chosen 2/3 of the time (667/1000). But, user 1 is being chosen 3/4 of the time and user 2 is being chosen 1/4 of the time.
Is there something else I should multiply rand() by? Maybe rand()*(1/weight)? Nope, that just switches them so that user 2 gets chosen 75% of the time and user 1 is chosen 25%.
I need to get this simple query working (and understand it) before I can move on and use the results.
Posted: Tue Dec 11, 2007 1:08 am
by Kieran Huggins
I'm COMPLETELY stumped on this one as well. 2 and 3 give you the results you're looking for, but I don't understand why either.
Anyone know WTF is going on here?
Posted: Tue Dec 11, 2007 1:22 am
by Benjamin
Probably because..
Returns a random floating-point value v in the range 0 <= v < 1.0.
http://dev.mysql.com/doc/refman/5.0/en/ ... ction_rand
0.6 * 1 = 0.6
0.6 * 2 = 1.2
0.6 * 3 = 1.8
Posted: Tue Dec 11, 2007 2:41 am
by Kieran Huggins
but then wouldn't #2 return 2x as likely as #1? Hence 2 thirds of the time v.s. 1 third?
The current behaviour is #2 returning 3x as likely as #1.
Posted: Tue Dec 11, 2007 10:57 pm
by paladaxar
Thanks for the post, but I dont think I follow everything you said. I have read the man many times and I have tried to find something in it that talked about weighted random searches. I couldn't find any.
I realize that rand() just picks a random number between 0 and 1, but could you explain exactly what is happening when my query runs? What is the method behind an "order by rand()"? I realize that a temporary table is generated, but thats about all I know. If you "order by rand()" does it choose ONE random number then order the temp table by it? or does it choose one for each row? or does it choose one number, create a table, then choose another number and create another table, then randomly chooses which one to pick your final output from? I think that if I understand what is going on when this query is run, I will be able to figure out what I need to do to get the results that I am looking for. Unfortunately, no one seems to be able to explain to me exactly how this query works.
So, in still trying to figure this out on my own, I ran some tests. I set user 1's weight to 1 and kept it there as I adjusted user 2's weight. Then I put the query into a for loop that ran it 1000 times to see what kind of results were generated.
With this query:
"select user_id from test2 order by rand()/weight limit 1"
and keeping user 1's weight set to 1...
Here are the results:
User 2 weight:
------------Number of times that user 2 was chosen out of 1000
0.01 ----- 5
0.125 -- 62
0.25 -- 125
0.5 ---- 250
1 ------ 500
1.25 -- 600
1.5 ---- 667
1.75 -- 715
2 ------ 750
3 ------ 832
4 ------ 875
5 ------ 900
6 ------ 915
7 ------ 930
8 ------ 937
9 ------ 946
10 ---- 952
100 -- 995
1000 -999.5
Since all of those numbers are kind of hard to relate to each other, I threw them into a graph to see what they "looked" like. Here it is (interestingly):
Now, if I could find a way to flatten that logarithmic (?) graph into a linear graph, that would be great. My problem would be solved for at least the case where user 1's weight is 1. Beyond that, I'm not sure yet...
Posted: Wed Dec 12, 2007 12:39 am
by Benjamin
I really don't know exactly how it works either, but I do know that the mediawiki software uses either floats or decimals for random results in their table and I don't believe any of them were greater than 1. I think they are aiming for truly random results however, so this may not match what your looking for.
Posted: Wed Dec 12, 2007 4:19 am
by nathanr
ahh this ones simpler than you think..
rand produces a number between 0 and 1
your first weight is 1
therefore 100% of the time your going to get a result of 1 or less on that row with rand()*weight
== 0.XXXXX*1 you see
where as with a weight of 2 you can get a value of over 1 roughly half of the time as it rand has to produce 0.5 or above for that row to get a result higher than 1.
thus row 1 is picked a hell of a lot more..
the answer is simpler yet though : your query is correct, your just ordering it wrong lol, ORDER BY naturally goes ASC, so change your query to:
Code: Select all
$query1 = "select user_id from user_table order by rand()*weight DESC limit 1";
and you'll get what you want
ps astions almost had it with:
but rand() would produce
0.134122
0.634721
0.342523
not 0.6 each time : unless you did rand(seed) where seed is an int
Posted: Wed Dec 12, 2007 10:09 am
by paladaxar
nathanr wrote:ahh this ones simpler than you think..
Man, I really wish it were that simple. If I order by rand()*weight DESC, it does choose user 2 more than user 1, but it chooses user 2 75% of the time and user 1 25% of the time. Thus, still not what I am looking for. I want user 1 to be chosen 1/3 (33%) of the time and user 2 to be chosen 2/3 (67%) of the time.
Is there a formula that I can use in my query to achieve that?
Posted: Wed Dec 12, 2007 10:12 am
by nathanr
yeah keep it desc and change the weight of 2, lower 2 to get the ratio you want, try 1 & 1.86 (no calcs there, just makes sense in my over tired worked for 49 hours straight head)

Posted: Wed Dec 12, 2007 10:20 am
by paladaxar
Actually if I lower the weight of user 2 down to 1.5, then I get the ratio that I am looking for. But, I'm still not 100% sure why. Also, I will have many more than just 2 users in the table. What weight do I choose for user 3? user 4? I have no idea how to pick weights for each so that I get the right ratios.
If there is a formula for picking their weights correctly, then couldnt I just choose discrete weights (integers) and put that formula into my query? That way the weights would look right to my human understanding, and still produce the correct results from the computer's understanding.
Posted: Wed Dec 12, 2007 10:22 am
by nathanr
there is a formula, i described it roughly up there..
however make life simple for yourself and either use weight values between 0 and 1 (0.120) or weight values higher than 1 (2>) don't use 1 or 0

Posted: Wed Dec 12, 2007 10:33 am
by paladaxar
No offense, but I cant really deduct a formula out of what you posted.
I tried weights in the range you suggested, I changed user 1 to .1 and user 2 to .2 and still got exactly the same results as when I was using 1 and 2. Then I tried using 10 and 20 and, you guessed it, got the same results.
I know its a lot to ask, but I really wish someone would create a table and try this query on it to see what I mean.
Here's the PHP you can use to test your theories...its what I have been doing for 2 weeks....
Code: Select all
<?php
include("log_into_db.php");
for($i = 0; $i < 10000; $i++)
{
$query = "select user_id from user_table order by rand()*weight desc limit 1";
$result = mysql_query($query);
$line = mysql_fetch_row($result);
if ($line[0] == 1)
$user1++;
if ($line[0] == 2)
$user2++;
if ($line[0] == 3)
$user3++;
echo "User 1: $user1 <br>";
echo "User 2: $user2 <br>";
echo "User 3: $user3 <br>"
}
?>
Posted: Wed Dec 12, 2007 11:21 pm
by Kieran Huggins
I've created said table, and am getting exactly your results, paladaxar.
I'm also extremely confused. This does NOT seem correct... not by a long shot.
Let's see if this duplicates in other SQL DB engines. If MySQL is alone we'll file a bug report.
Posted: Thu Dec 13, 2007 4:40 am
by onion2k
With the weights of 1 and 2 the first random number needs to be double the second in order for the first record to be chosen.
If random2 is over 0.5 user2 will always be chosen.
For values of random1 below 0.5 there is a 50% chance that user2 will be chosen.
Therefore 1/2 the selections where random1 is under 0.5 user2 will be chosen, and all the selections where random2 is over 0.5 user2 will be chosen.
That's 3/4 selections.
(Not a very clear explanation, but MySQL is correct)
Posted: Thu Dec 13, 2007 7:55 am
by Weirdan
here's the only viable option I ever seen for sql:
http://www.sqlteam.com/article/returnin ... random-row
It involves adding additional column and recalculating it (for every row) every time a row is inserted