Wednesday, December 5, 2007

Reservoir Sampling

Problem Statement

source: http://gregable.com/2007/10/reservoir-sampling.html

Now that that is out of the way, Reservoir Sampling is a fun algorithm. Imagine you are given a really large stream of data elements (queries on Google searches in May, products bought at Walmart during the Christmas season, names in a phone book, whatever). Your goal is to efficiently return a random sample of 1,000 elements evenly distributed from the original stream. How would you do it?

The right answer is generating random integers between 0 and N-1, then retrieving the elements at those indices and you have your answer.

So, let me make the problem harder. You don't know N (the size of the stream) in advance and you can't index directly into it !!!!!!!!. You can count it, but that requires making 2 passes of the data. You can do better. There are some heuristics you might try: guess the length and hope to undershoot. It will either not work in one pass or not be evenly distributed.

Simple Solution

A relatively easy and correct solution is to assign a random number to every element as you see them in the stream, and then always keep the top 1,000 numbered elements at all times. This is similar to how mysql does "ORDER BY RAND()" calls.

tu: what is the problem with this simple solution??????????????
A: simpler in this problem, but cannot be extended to the more complex cases, which we will see later.


Classic Reservoir Sampling

Another option is reservoir sampling. In the example I've given you, the algorithm above is a little simpler, but reservoir sampling can be extended in other ways which I will get to, but first the basic algorithm:

First, you want to make a reservoir (array) of 1,000 elements and fill it with the first 1,000 elements in your stream. That way if you have exactly 1,000 elements, the algorithm works. This is the base case.

Next, you want to process the i'th element (starting with i = 1,001) such that at the end of processing that step, the 1,000 elements in your reservoir are randomly sampled amongst the i elements you've seen so far. How can you do this. Start with i = 1,001. With what probability after the 1001'th step should element 1,001 (or any element for that matter) be in the set of 1,000 elements? The answer is easy: 1,000/1,001. So, generate a random number between 0 and 1, and if it is less than 1,000/1,001 you should take element 1,001. In other words, choose to add element 1,001 to your reservoir with probability 1,000/1,001. If you choose to add it (which you likely will), then replace any element in the reservoir chosen randomly. I've shown that this produces a 1,000/1,001 chance of selecting the 1,001'th element, but what about the 2nd element? The second element is definitely in the reservoir at step 1,000 and the probability of it getting removed is the probability of element 1,001 getting selected multiplied by the probability of #2 getting randomly chosen as the replacement candidate. That probability is 1,000/1,001 * 1/1,000 = 1/1,001. So, the probability that #2 survives this round is 1 - that or 1,000/1,001.

This can be extended for the i'th round - keep the i'th element with probability 1,000/i and if you choose to keep it, replace a random element from the reservoir. It is pretty easy to prove that this works for all values of i using induction. It obviously works for the i'th element based on the way the algorithm selects the i'th element with the correct probability outright. The probability any element before this step being in the reservoir is 1,000/(i-1). The probability that they are removed is 1,000/i * 1/1,000 = 1/i. The probability that each element sticks around given that they are already in the reservoir is (i-1)/i and thus the elements' overall probability of being in the reservoir after i rounds is 1,000/(i-1) * (i-1)/i = 1,000/i.

Weighted Reservoir Sampling Variation

tu: seems not quite right. What is the reservoir size?

Take the same problem above but add the extra challenge: How would you sample from a weighted distribution where each element is given a weight? This is sorta tricky. The above algorithm can be found on the web easily, but I don't know where I can find a weighted reservoir sampling version, so I made one up. I'm 100% sure someone has figured this out before me though, I don't take credit.

Start the same way by filling in the first 1,000 elements of the reservoir except keep a sum of the weights seen so far. Call this seen_weight. Also, make sure to store the weights of the individual elements stored in the reservoir. Call these weights a different array named weight[]. Lastly, to keep things efficient, store a variable for the total weight of the elements in the reservoir called reservoir_weight. On step i, generate a random number between 0 and 1 and use this pseudocode:

prob_sum = 0;
for(x = 0; x < (1-(weight[x]*seen_weight/reservoir_weight/(seen_weight + input_weight))))/reservoir_size; ++x) {
if (prob_sum > R) {
// replace, the probability of replace is input_weight/(seen_weight+inputweight)
reservoir[x] = input_element;
reservoir_weight += input_weight - weight[x];
weight[x] = input_weight;
break;
}
seen_weight += input_weight
}

The only magic here is: (1-(weight[x] * seen_weight / reservoir_weight / (seen_weight + input_weight))))/reservoir_size. We treat the seen_weight so far as the weight for the current reservoir, so if we have a seen weight of 9 and a new element comes in with weight 1, It will get added in with 1/(9+1) weight. weight[x] * seen_weight / reservoir_weight does this magic by calculating the fraction of weight in the reservoir allocated to element x, and then multiplying that by seen_weight. After this step, the new reservoir will have a weight of (seen_weight + input_weight) so we divide by this - then since we are only talking about a single element at a time, we divide by the reservoir size too.

Distributed Reservoir Sampling Variation

This is the problem that got me looking at the weighted sample above. In both of the above algorithms, I can process the stream in O(N) time where N is length of the stream, in other words: in a single pass. If I want to break break up the problem on say 10 machines and solve it close to 10 times faster, how can I do that?

The answer is to have each of the 10 machines take roughly 1/10th of the input to process and generate their own reservoir sample from their subset of the data. Then, a final process must take the 10 output reservoirs and merge them with another sample. The trick is that the final process must take into account the reservoir weights of each of the input reservoirs. Basically, the final process should reservoir sample every element in the 10 input reservoirs but assign each element a weight of weight[x] * seen_weight / reservoir_weight where weight[x] is the element's original weight, seen_weight is the total weight counted in generating that particular reservoir and reservoir_weight is the sum of the weights of the elements in the reservoir. Proving this works is a little harder, so I will frustrate you by leaving it as an excercise for the dear reader.

0 comments: