Page 2 of 3

Posted: Thu Dec 13, 2007 8:03 am
by Weirdan
Something like this could possibly work:

Code: Select all

select min(id) from  (
    select id, weight, (select sum(weight) from t tt where tt.id <= t.id) sum   from t
) z
where sum > rand() * (select sum(weight) from t)
but my testing shows it doesn't:

Code: Select all

mysql> select sum(id=1)/count(*) as 'weight=0.5', sum(id=2) / count(*) as 'weight=0.3', sum(id=3) / count(*) as 'weight=0.2', count(*) as 'total runs' from t2;
+------------+------------+------------+------------+
| weight=0.5 | weight=0.3 | weight=0.2 | total runs |
+------------+------------+------------+------------+
|     0.4737 |     0.3991 |     0.1272 |        228 |
+------------+------------+------------+------------+
1 row in set (0.00 sec)

Posted: Thu Dec 13, 2007 8:50 am
by Kieran Huggins
onion2k wrote:With the weights of 1 and 2 the first random number needs to be double the second in order for the first record to be chosen.

If random2 is over 0.5 user2 will always be chosen.

For values of random1 below 0.5 there is a 50% chance that user2 will be chosen.

Therefore 1/2 the selections where random1 is under 0.5 user2 will be chosen, and all the selections where random2 is over 0.5 user2 will be chosen.

That's 3/4 selections.

(Not a very clear explanation, but MySQL is correct)
onion++

Good explanation - makes sense now! This is why probability makes my brain hurt, but in a fun way.

Posted: Thu Dec 13, 2007 1:37 pm
by paladaxar
Weirdan wrote:Something like this could possibly work:
.......
but my testing shows it doesn't:
Haha...at least you tested it! Most people just say "um...try this..." and then go to bed thinking that they helped out.

I have been working on this problem for about 2 weeks now, trying everything that I can think of. I appreciate anyone and everyone's efforts in helping to come up with a solution that works. Two heads are better than one, especially when the second head actually has some idea of what they are doing. I'm not a complete newbie when it comes to MySQL, but I certainly dont know nearly as much as many of you do.

Posted: Thu Dec 13, 2007 1:44 pm
by paladaxar
I found that for the simple table that I posted earlier (with just 2 users) that this query works:

$query1 = "select user_id from user_table order by rand()*(((1/3)*3+weight)/((2/3)*3)) desc limit 1";

I noticed that 3 is the sum of all weights, interestingly, but I'm still working on figuring out how to make this work when I add more users and more weight. But I have a feeling that this is a way that should work: have some type of formula that I multiply rand() by to scale the weight to get the results that I am looking for.

Posted: Thu Dec 13, 2007 3:20 pm
by VladSun
I have a banner rotator based on similar principle (implemented in PHP by using arrays).

Suppose you have N data records (R1 ... Rn) - then you have N weights (W1 ... Wn).
You can calculate the sum of all weights = Ws.
So, you want to get a probability for record N to be chosen equal to:

p = Wn/Ws

We can build a new series of these records like these:

R1(V1,W1), R2(V2,W2) ... RN(Vn,Wn)
where Vi = Vi-1+Wi (similar to Fibonacci series)

Having this in mind a possible abstract query would look like this:

select first Ri where Vi > randomNumber * Ws

or its equivalent with Wi

select first Ri where SUM(W1..Wi) > randomNumber * Ws

the randomNumber *should* be constant while MySQL engine executes the query ( @Weirdan - that's why your query gives unexpected results)

Tested (http://ipclassify.relef.net/1.php and refresh several times :))
Full code: http://ipclassify.relef.net/1.phps

Code: Select all

 
        $rand = rand(0,1000)/1000;
    
    $query = "
    select 
        id, weight 
    from 
        t 
    where 
         (select sum(t2.weight) from t as t2 where t.id >= t2.id) > (select sum(weight) from t)*$rand
    limit 
        1
        ";
 

Code: Select all

 
id: 1 (10)    count: 1005
id: 2 (20)    count: 2018
id: 3 (30)    count: 3046
id: 4 (40)    count: 3921
 
PS: It's best case when we have

$rand = rand(0,sum_of_all_weights)/sum_of_all_weights;

instead of

$rand = rand(0,1000)/1000;

It's about precision.

Posted: Thu Dec 13, 2007 11:55 pm
by paladaxar
Wow...that's ingenious! Thank you so much.

Now, this script is going to need to be able to run possibly hundreds of times per second. So, I am wondering if there is any way to eliminate the sub queries. I see three queries there (one main and two sub). What if we added two more columns say "fibonacci_weight" and "total_weight". Could we eliminate the sub queries by using these columns? Like maybe select yada yada "where fibonacci_weight > total_weight*$rand"? Then maybe every minute or so we could update these columns. That way the query would run one time per minute instead of running a query with 2 sub queries a hundred times a second.

I'm still trying to understand your query completely so this may or may not work...just brain storming here...

PS...when I click on your link (to test your code) the numbers seem right, but every time I refresh, the numbers get higher. What's going on there?

Posted: Fri Dec 14, 2007 5:49 am
by VladSun
Yes, you could calculate the Vi at insert time - and I think it's the best you can do. So... the insert query would look like this:

Code: Select all

INSERT INTO TABLE 
   (
      id, 
      DATA, 
      V
   ) 
   VALUES
      (
         NULL, 
         $data, 
         (
            SELECT 
               sum(V) 
            FROM 
               TABLE
         ) + $Weight)
 

You have the sum of ALL weights IN your last record

Code: Select all

SELECT 
   V 
FROM 
   TABLE 
ORDER BY 
   id DESC 
LIMIT 
   1
Let's have its value in $weight_sum. Then

Code: Select all

$V_rand = rand(0, $weight_sum);
 

Then your select query would look like this:

Code: Select all

select 
   * 
from 
   table 
where 
   V>$V_rand 
order by 
   V 
limit 
   1 
Don't forget to update field V in your records when deleting a record.
PS...when I click on your link (to test your code) the numbers seem right, but every time I refresh, the numbers get higher. What's going on there?
Have you looked at full source code ;) I don't delete any records - every time you refresh you add new records and the value of count() is increasing :)

Posted: Fri Dec 14, 2007 6:11 am
by Weirdan
the randomNumber *should* be constant while MySQL engine executes the query ( @Weirdan - that's why your query gives unexpected results)
Indeed. After modifying it to look like this:

Code: Select all

set @rand=rand(); 
insert into t2 
select 
   (@num := @num + 1) as num, 
   min(id) from  (       
        select id, weight, (select sum(weight) from t tt where tt.id <= t.id) sum       
        from t 
   ) z 
where sum > @rand * (select sum(weight) from t);
this is what i'm getting:

Code: Select all

mysql> select sum(id=1)/count(*) as 'weight=0.5', sum(id=2) / count(*) as 'weight=0.3', sum(id=3) / count(*) as 'weight=0.2', count(*) as 'total runs' from t2;
+------------+------------+------------+------------+
| weight=0.5 | weight=0.3 | weight=0.2 | total runs |
+------------+------------+------------+------------+
|     0.5000 |     0.2900 |     0.2100 |        100 |
+------------+------------+------------+------------+
1 row in set (0.01 sec)
Pretty close to the desired results

Posted: Fri Dec 14, 2007 10:22 am
by patrickwonders
The probability is 75% that 1 * rand() will be less than or equal to 2 * rand().

The x-axis respresents the value of 1 * rand(), the y-axis represent 2 * rand(). The green area is the area where 1 * rand is less than or equal to 2 * rand().

Image

But, I believe you're going to get hosed trying to fiddle the weights if you want more than two weights. If you fudge the weights so that 2 is twice as likely as one and three is thrice as likely as one, I'd bet that 3 isn't 50% more likely than 2.

And, in fact, quick calculation bears that out. If you want weights 1, a, and b such that a is chosen twice as often as 1 and b is chosen three times as often as one, then b will get chosen five thirds as often as a, but you probably wanted it to be chosen three halves as often as a.

Oops: Sorry, I tried replying to this thread a day or two ago, but needed to jump through the registration hoop. In the intervening time, onion2k already explained why it's 75%... so this post is redundant. But, I'm leaving it here because the picture is worth quite a number of words.

Posted: Fri Dec 14, 2007 11:03 am
by paladaxar
Hey Vlad, I think we're getting really close. I love the new query and how simple and fast it is. But, it breaks down on some tables and I have NO idea why. It makes no sense for it to work great on some tables but not on others. Here are a few examples:

The query on this table works beautifully:

Code: Select all

id   weight      v
1       1       1
2       2       3
3       3       6
4       1       7
The query on this table breaks down:

Code: Select all

id   weight      v
1       1       1
2       2       3
3       3       6
4       4       10
On the first table, I get nicely distributed percentages that match the weights of each id. For the second table, id 1 is chosen 14% and id 4 is chosen 86%!

It makes no sense to me. I have been fiddling around with different numbers trying to figure this out. Some tables work great, other configurations just break completely. I havent yet found a common trend in the tables that are breaking.

PS...How's Sophia these days? I was there back in 2005 and I thought it was a pretty cool place. We went to a few great clubs and we all thought the underground malls were ingenious! (If those are common place in Europe, then sorry, but I'm from the US and we dont have those...)

Lastly, I want to give a big thank you to everyone who has been helping out...I could never have figured out half of this without you guys. And Patrick...the visual is awesome. It really helped to explain things.

Posted: Fri Dec 14, 2007 11:37 am
by VladSun
Sorry - I have a mistake in the insert query :(
It should look like this:

Code: Select all

insert into table (id, data, V) values(NULL, $data, (select max(V) from table) + $Weight)
And my second query looks really silly :) Should be:

Code: Select all

select max(V) from table
For the second table, id 1 is chosen 14% and id 4 is chosen 86%!
That means all other records are never selected - are you sure?
PS...How's Sophia these days? I was there back in 2005 and I thought it was a pretty cool place. We went to a few great clubs and we all thought the underground malls were ingenious! (If those are common place in Europe, then sorry, but I'm from the US and we dont have those...)
Yes - it's cold in the winter, naturally :) About the clubs - I know what are you talking about ;)

Posted: Fri Dec 14, 2007 11:59 am
by VladSun
I've tested it with your second table values - no problems ... works fine with the new queries. Could you post some code and SQL data.

Posted: Fri Dec 14, 2007 3:00 pm
by paladaxar
VladSun wrote:
For the second table, id 1 is chosen 14% and id 4 is chosen 86%!
That means all other records are never selected - are you sure?
Well, first of all, it wasnt 14% and 86%, I had a silly little mistake, but now that everything is "right" I am getting user 1 10% of the time and user 4 90% of the time.

Using the second table (the one that breaks, giving 10% and 90%),

Code: Select all

id   weight      v
1       1       1
2       2       3
3       3       6
4       4       10
here is the code that I am running:

Code: Select all

<?php
   include("lidb.php");
   for($i = 0; $i < 10000; $i++)
   {
      
      $v_rand = rand(1,10);
   
 
      $query1 = "
      select
        user_id, weight
      from
        test3
      where
         v >= $v_rand
      order by
        v
      limit 1";
      $result1 = mysql_query($query1);
      $line1 = mysql_fetch_row($result1);
      if ($line1[0] == 1)
         $user1++;
      if ($line1[0] == 2)
         $user2++;
      if ($line1[0] == 3)
         $user3++;
      if ($line1[0] == 4)
         $user4++;
     }
   echo "User 1: $user1 <br>";
   echo "User 2: $user2 <br>";
   echo "User 3: $user3 <br>";
   echo "User 4: $user4 <br><br>";
   
   echo "User 1: ".round($user1/100)."% <br>";
   echo "User 2: ".round($user2/100)."% <br>";
   echo "User 3: ".round($user3/100)."% <br>";
   echo "User 4: ".round($user4/100)."% <br>";
   
?>
And my output is:

Code: Select all

User 1: 990
User 2:
User 3:
User 4: 9010

User 1: 10%
User 2: 0%
User 3: 0%
User 4: 90%

Posted: Fri Dec 14, 2007 3:12 pm
by paladaxar
If I change the table to:

Code: Select all

id   weight      v
1       1       1
2       2       3
3       3       6
4       3       9
My output is:

Code: Select all

User 1: 1087
User 2: 2179
User 3: 3317
User 4: 3417

User 1: 11%
User 2: 22%
User 3: 33%
User 4: 34%
Exactly what I want!

I'm so confused. This seems like a glitch in the DB or something...not my (or should I say Vlad's) code.

Posted: Fri Dec 14, 2007 4:11 pm
by nathanr
so if you remove the extra columns generated and take it back to the simple user_id, weight table.. is this the sql we'd end up with?

Code: Select all

SELECT
     user_id,
     (SELECT RAND())*((SELECT MAX(weight) from test3)+weight-(weight/(SELECT SUM(weight) FROM test3))) as random
FROM test3
ORDER BY random DESC
LIMIT 1;