[SOLVED] Weighted Random MySQL query problems [SOLVED]

Questions about the MySQL, PostgreSQL, and most other databases, as well as using it with PHP can be asked here.

Moderator: General Moderators

User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

Something like this could possibly work:

Code: Select all

select min(id) from  (
    select id, weight, (select sum(weight) from t tt where tt.id <= t.id) sum   from t
) z
where sum > rand() * (select sum(weight) from t)
but my testing shows it doesn't:

Code: Select all

mysql> select sum(id=1)/count(*) as 'weight=0.5', sum(id=2) / count(*) as 'weight=0.3', sum(id=3) / count(*) as 'weight=0.2', count(*) as 'total runs' from t2;
+------------+------------+------------+------------+
| weight=0.5 | weight=0.3 | weight=0.2 | total runs |
+------------+------------+------------+------------+
|     0.4737 |     0.3991 |     0.1272 |        228 |
+------------+------------+------------+------------+
1 row in set (0.00 sec)
User avatar
Kieran Huggins
DevNet Master
Posts: 3635
Joined: Wed Dec 06, 2006 4:14 pm
Location: Toronto, Canada
Contact:

Post by Kieran Huggins »

onion2k wrote:With the weights of 1 and 2 the first random number needs to be double the second in order for the first record to be chosen.

If random2 is over 0.5 user2 will always be chosen.

For values of random1 below 0.5 there is a 50% chance that user2 will be chosen.

Therefore 1/2 the selections where random1 is under 0.5 user2 will be chosen, and all the selections where random2 is over 0.5 user2 will be chosen.

That's 3/4 selections.

(Not a very clear explanation, but MySQL is correct)
onion++

Good explanation - makes sense now! This is why probability makes my brain hurt, but in a fun way.
paladaxar
Forum Commoner
Posts: 85
Joined: Fri Jun 18, 2004 11:50 pm

Post by paladaxar »

Weirdan wrote:Something like this could possibly work:
.......
but my testing shows it doesn't:
Haha...at least you tested it! Most people just say "um...try this..." and then go to bed thinking that they helped out.

I have been working on this problem for about 2 weeks now, trying everything that I can think of. I appreciate anyone and everyone's efforts in helping to come up with a solution that works. Two heads are better than one, especially when the second head actually has some idea of what they are doing. I'm not a complete newbie when it comes to MySQL, but I certainly dont know nearly as much as many of you do.
paladaxar
Forum Commoner
Posts: 85
Joined: Fri Jun 18, 2004 11:50 pm

Post by paladaxar »

I found that for the simple table that I posted earlier (with just 2 users) that this query works:

$query1 = "select user_id from user_table order by rand()*(((1/3)*3+weight)/((2/3)*3)) desc limit 1";

I noticed that 3 is the sum of all weights, interestingly, but I'm still working on figuring out how to make this work when I add more users and more weight. But I have a feeling that this is a way that should work: have some type of formula that I multiply rand() by to scale the weight to get the results that I am looking for.
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Post by VladSun »

I have a banner rotator based on similar principle (implemented in PHP by using arrays).

Suppose you have N data records (R1 ... Rn) - then you have N weights (W1 ... Wn).
You can calculate the sum of all weights = Ws.
So, you want to get a probability for record N to be chosen equal to:

p = Wn/Ws

We can build a new series of these records like these:

R1(V1,W1), R2(V2,W2) ... RN(Vn,Wn)
where Vi = Vi-1+Wi (similar to Fibonacci series)

Having this in mind a possible abstract query would look like this:

select first Ri where Vi > randomNumber * Ws

or its equivalent with Wi

select first Ri where SUM(W1..Wi) > randomNumber * Ws

the randomNumber *should* be constant while MySQL engine executes the query ( @Weirdan - that's why your query gives unexpected results)

Tested (http://ipclassify.relef.net/1.php and refresh several times :))
Full code: http://ipclassify.relef.net/1.phps

Code: Select all

 
        $rand = rand(0,1000)/1000;
    
    $query = "
    select 
        id, weight 
    from 
        t 
    where 
         (select sum(t2.weight) from t as t2 where t.id >= t2.id) > (select sum(weight) from t)*$rand
    limit 
        1
        ";
 

Code: Select all

 
id: 1 (10)    count: 1005
id: 2 (20)    count: 2018
id: 3 (30)    count: 3046
id: 4 (40)    count: 3921
 
PS: It's best case when we have

$rand = rand(0,sum_of_all_weights)/sum_of_all_weights;

instead of

$rand = rand(0,1000)/1000;

It's about precision.
Last edited by VladSun on Tue Jan 22, 2008 6:02 pm, edited 1 time in total.
There are 10 types of people in this world, those who understand binary and those who don't
paladaxar
Forum Commoner
Posts: 85
Joined: Fri Jun 18, 2004 11:50 pm

Post by paladaxar »

Wow...that's ingenious! Thank you so much.

Now, this script is going to need to be able to run possibly hundreds of times per second. So, I am wondering if there is any way to eliminate the sub queries. I see three queries there (one main and two sub). What if we added two more columns say "fibonacci_weight" and "total_weight". Could we eliminate the sub queries by using these columns? Like maybe select yada yada "where fibonacci_weight > total_weight*$rand"? Then maybe every minute or so we could update these columns. That way the query would run one time per minute instead of running a query with 2 sub queries a hundred times a second.

I'm still trying to understand your query completely so this may or may not work...just brain storming here...

PS...when I click on your link (to test your code) the numbers seem right, but every time I refresh, the numbers get higher. What's going on there?
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Post by VladSun »

Yes, you could calculate the Vi at insert time - and I think it's the best you can do. So... the insert query would look like this:

Code: Select all

INSERT INTO TABLE 
   (
      id, 
      DATA, 
      V
   ) 
   VALUES
      (
         NULL, 
         $data, 
         (
            SELECT 
               sum(V) 
            FROM 
               TABLE
         ) + $Weight)
 

You have the sum of ALL weights IN your last record

Code: Select all

SELECT 
   V 
FROM 
   TABLE 
ORDER BY 
   id DESC 
LIMIT 
   1
Let's have its value in $weight_sum. Then

Code: Select all

$V_rand = rand(0, $weight_sum);
 

Then your select query would look like this:

Code: Select all

select 
   * 
from 
   table 
where 
   V>$V_rand 
order by 
   V 
limit 
   1 
Don't forget to update field V in your records when deleting a record.
PS...when I click on your link (to test your code) the numbers seem right, but every time I refresh, the numbers get higher. What's going on there?
Have you looked at full source code ;) I don't delete any records - every time you refresh you add new records and the value of count() is increasing :)
Last edited by VladSun on Wed Sep 15, 2010 3:26 pm, edited 4 times in total.
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Post by Weirdan »

the randomNumber *should* be constant while MySQL engine executes the query ( @Weirdan - that's why your query gives unexpected results)
Indeed. After modifying it to look like this:

Code: Select all

set @rand=rand(); 
insert into t2 
select 
   (@num := @num + 1) as num, 
   min(id) from  (       
        select id, weight, (select sum(weight) from t tt where tt.id <= t.id) sum       
        from t 
   ) z 
where sum > @rand * (select sum(weight) from t);
this is what i'm getting:

Code: Select all

mysql> select sum(id=1)/count(*) as 'weight=0.5', sum(id=2) / count(*) as 'weight=0.3', sum(id=3) / count(*) as 'weight=0.2', count(*) as 'total runs' from t2;
+------------+------------+------------+------------+
| weight=0.5 | weight=0.3 | weight=0.2 | total runs |
+------------+------------+------------+------------+
|     0.5000 |     0.2900 |     0.2100 |        100 |
+------------+------------+------------+------------+
1 row in set (0.01 sec)
Pretty close to the desired results
patrickwonders
Forum Newbie
Posts: 2
Joined: Wed Dec 12, 2007 4:35 pm

Post by patrickwonders »

The probability is 75% that 1 * rand() will be less than or equal to 2 * rand().

The x-axis respresents the value of 1 * rand(), the y-axis represent 2 * rand(). The green area is the area where 1 * rand is less than or equal to 2 * rand().

Image

But, I believe you're going to get hosed trying to fiddle the weights if you want more than two weights. If you fudge the weights so that 2 is twice as likely as one and three is thrice as likely as one, I'd bet that 3 isn't 50% more likely than 2.

And, in fact, quick calculation bears that out. If you want weights 1, a, and b such that a is chosen twice as often as 1 and b is chosen three times as often as one, then b will get chosen five thirds as often as a, but you probably wanted it to be chosen three halves as often as a.

Oops: Sorry, I tried replying to this thread a day or two ago, but needed to jump through the registration hoop. In the intervening time, onion2k already explained why it's 75%... so this post is redundant. But, I'm leaving it here because the picture is worth quite a number of words.
paladaxar
Forum Commoner
Posts: 85
Joined: Fri Jun 18, 2004 11:50 pm

Post by paladaxar »

Hey Vlad, I think we're getting really close. I love the new query and how simple and fast it is. But, it breaks down on some tables and I have NO idea why. It makes no sense for it to work great on some tables but not on others. Here are a few examples:

The query on this table works beautifully:

Code: Select all

id   weight      v
1       1       1
2       2       3
3       3       6
4       1       7
The query on this table breaks down:

Code: Select all

id   weight      v
1       1       1
2       2       3
3       3       6
4       4       10
On the first table, I get nicely distributed percentages that match the weights of each id. For the second table, id 1 is chosen 14% and id 4 is chosen 86%!

It makes no sense to me. I have been fiddling around with different numbers trying to figure this out. Some tables work great, other configurations just break completely. I havent yet found a common trend in the tables that are breaking.

PS...How's Sophia these days? I was there back in 2005 and I thought it was a pretty cool place. We went to a few great clubs and we all thought the underground malls were ingenious! (If those are common place in Europe, then sorry, but I'm from the US and we dont have those...)

Lastly, I want to give a big thank you to everyone who has been helping out...I could never have figured out half of this without you guys. And Patrick...the visual is awesome. It really helped to explain things.
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Post by VladSun »

Sorry - I have a mistake in the insert query :(
It should look like this:

Code: Select all

insert into table (id, data, V) values(NULL, $data, (select max(V) from table) + $Weight)
And my second query looks really silly :) Should be:

Code: Select all

select max(V) from table
For the second table, id 1 is chosen 14% and id 4 is chosen 86%!
That means all other records are never selected - are you sure?
PS...How's Sophia these days? I was there back in 2005 and I thought it was a pretty cool place. We went to a few great clubs and we all thought the underground malls were ingenious! (If those are common place in Europe, then sorry, but I'm from the US and we dont have those...)
Yes - it's cold in the winter, naturally :) About the clubs - I know what are you talking about ;)
There are 10 types of people in this world, those who understand binary and those who don't
User avatar
VladSun
DevNet Master
Posts: 4313
Joined: Wed Jun 27, 2007 9:44 am
Location: Sofia, Bulgaria

Post by VladSun »

I've tested it with your second table values - no problems ... works fine with the new queries. Could you post some code and SQL data.
There are 10 types of people in this world, those who understand binary and those who don't
paladaxar
Forum Commoner
Posts: 85
Joined: Fri Jun 18, 2004 11:50 pm

Post by paladaxar »

VladSun wrote:
For the second table, id 1 is chosen 14% and id 4 is chosen 86%!
That means all other records are never selected - are you sure?
Well, first of all, it wasnt 14% and 86%, I had a silly little mistake, but now that everything is "right" I am getting user 1 10% of the time and user 4 90% of the time.

Using the second table (the one that breaks, giving 10% and 90%),

Code: Select all

id   weight      v
1       1       1
2       2       3
3       3       6
4       4       10
here is the code that I am running:

Code: Select all

<?php
   include("lidb.php");
   for($i = 0; $i < 10000; $i++)
   {
      
      $v_rand = rand(1,10);
   
 
      $query1 = "
      select
        user_id, weight
      from
        test3
      where
         v >= $v_rand
      order by
        v
      limit 1";
      $result1 = mysql_query($query1);
      $line1 = mysql_fetch_row($result1);
      if ($line1[0] == 1)
         $user1++;
      if ($line1[0] == 2)
         $user2++;
      if ($line1[0] == 3)
         $user3++;
      if ($line1[0] == 4)
         $user4++;
     }
   echo "User 1: $user1 <br>";
   echo "User 2: $user2 <br>";
   echo "User 3: $user3 <br>";
   echo "User 4: $user4 <br><br>";
   
   echo "User 1: ".round($user1/100)."% <br>";
   echo "User 2: ".round($user2/100)."% <br>";
   echo "User 3: ".round($user3/100)."% <br>";
   echo "User 4: ".round($user4/100)."% <br>";
   
?>
And my output is:

Code: Select all

User 1: 990
User 2:
User 3:
User 4: 9010

User 1: 10%
User 2: 0%
User 3: 0%
User 4: 90%
paladaxar
Forum Commoner
Posts: 85
Joined: Fri Jun 18, 2004 11:50 pm

Post by paladaxar »

If I change the table to:

Code: Select all

id   weight      v
1       1       1
2       2       3
3       3       6
4       3       9
My output is:

Code: Select all

User 1: 1087
User 2: 2179
User 3: 3317
User 4: 3417

User 1: 11%
User 2: 22%
User 3: 33%
User 4: 34%
Exactly what I want!

I'm so confused. This seems like a glitch in the DB or something...not my (or should I say Vlad's) code.
User avatar
nathanr
Forum Contributor
Posts: 200
Joined: Wed Jun 07, 2006 5:46 pm

Post by nathanr »

so if you remove the extra columns generated and take it back to the simple user_id, weight table.. is this the sql we'd end up with?

Code: Select all

SELECT
     user_id,
     (SELECT RAND())*((SELECT MAX(weight) from test3)+weight-(weight/(SELECT SUM(weight) FROM test3))) as random
FROM test3
ORDER BY random DESC
LIMIT 1;
Post Reply