Page 1 of 1

Weighted Random select (with a twist)

Posted: Fri Dec 29, 2006 9:55 am
by paladaxar
Alright...I'm not even sure if I can explain what I am trying to do in words, but I'll try.

I have two tables involved in this select. Table 1 has a column that sets the weight of each row in it. Then table 2 has the rows that will be selected, each being relationally connected to one of the rows in table 1. Now, if it was one to one relationship, and both tables were small, that wouldnt be very tough to set up a weighted random select by expanding each row into an array and selecting randomly from it.

There are at least two conditions, though, that complicate the situation. First, table 2 can have more than one row associated with one row in table 1. But, the weight cannot change. IE, if a row in table 1 has a weight of 5, and there are two rows associated with it (in table 2) I dont want the weight to be 10...it needs to remain 5. So, each of the two rows in table 2 have a weight of 2.5.

The next problem is that the tables will not be small, so I cannot expand each row into an array every time a selection needs to be made, nor can I use sort by rand(), since that too would take a huge amount of time -- possibly seconds. On a table where selections will be made on the order of a few per second, that is simply not acceptable.

Since this query is going to be executed very frequently, it needs to be FAST.

I have written some code to try to do all of this...but I have changed it and re-changed it so many times to try to get this working, that the code is just a big mess now. I think I need to start over with some fresh ideas. Anything is welcome!

There are a few more complications, but I think I have solved them already, so I wont include them here and make my question even worse.

Posted: Fri Dec 29, 2006 10:08 am
by feyd
So what I would likely do is perform the basic weighted random via a separate table that stored the expanded set. Then perform another random if the record was associated with more than one source.

Posted: Fri Dec 29, 2006 10:20 am
by paladaxar
I dont think I'm following what you said...

What is the separate table? Are you saying to expand the data once and store all of the values in another table? I cant really do that, because table 2 is always growing, so I would have to regenerate the "separate table" far too often. The weights in table 1 will be changing frequently as well.

Posted: Fri Dec 29, 2006 10:33 am
by feyd
If you wrote the code right a "regeneration" pass wouldn't require much: records to remove and records to add. Literally all that would require is checking how much the weight has changed when that record it being updated.

You're building a cache of the expansion. All the meat should be taken care of when the weight is changed. Nothing about the composition of the weight -- how many records are associated with it -- would affect the expanded weight cache. Because the data associated with the weights are uniform it's a simple matter of randomly selecting amongst them if there are more than one records of information associated with the weighted record.

Posted: Fri Dec 29, 2006 10:47 am
by paladaxar
Ok, so the new table would basically be a table version of the array with the expanded weights. But, if some rows have a very high weight, say 500, the new table could get very very large. And, how would I write 2.5 rows to a table (assuming a weight of 5, with two rows associated with it)?

Then, once the cache table is created, how would I select something from it at random without using sort by rand() which creates way too much load on the server. The cache table may have millions of rows in it (since each row in table 1 would be multiplied by its weight).

Posted: Fri Dec 29, 2006 11:11 am
by feyd
The cache has no concern whatsoever for the records associated with the weights. So there is no 2.5. You would simply have five for the main record. The associated records have uniform distribution against that weighted record.

I have not had any performance issues with performing a RAND() against a very simple query. Alternately, you could have store a cache of that pool already shuffled. Then all you have to do is perform a limited select based on an offset kept somewhere. There are race conditions involved in that potentially, but that may not matter.

Posted: Fri Dec 29, 2006 2:04 pm
by paladaxar
Alright, thanks feyd...I think that's going to work. It took a while to think through everything (including the other complications) but I think your idea will be the best way to go about this.

I try to remember to post my final solution once I get to it.

P.S...if everyone spells and pronounces your name wrong, and since its spelled and pronounced "fade", why dont you just spell it f-a-d-e?

Posted: Fri Dec 29, 2006 3:33 pm
by feyd
feyd is the name of a character in a series of books you may have heard of: Dune.

:)

http://en.wikipedia.org/wiki/Feyd_Rautha

Posted: Fri Dec 29, 2006 4:06 pm
by paladaxar
Oh ok. Cool. Never heard of the books, but they sound good. Thanks again for your help. This forum really needs more people like you.

Posted: Fri Dec 29, 2006 4:22 pm
by John Cartwright
paladaxar wrote:This forum really needs more people like you.
I think were doing o-k, but a couple more feyd's wouldn't hurt :)

Posted: Fri Dec 29, 2006 4:27 pm
by RobertGonzalez
Jcart wrote:
paladaxar wrote:This forum really needs more people like you.
I think were doing o-k, but a couple more feyd's wouldn't hurt :)
Of course, if we had a couple more feyd's we could probably do away with all the mainframe computers, internet server farms and about 90% of wikipedia. :wink: