Notes on Reservoir Sampling

Reservoir sampling is a useful technique when sample k items for a large data sets.

In this post I outline some ideas on how it works.

First, let’s take a look at the task sampling one number of an array. We want to ensure that each number has equal probability of being chosen.

The straightforward way to sample a number of a list of numbers would be to randomly generate an index in the range of all numbers and pick that one. Here we show some python code

import random
a = [1, 2, 4]
rand_idx = random.randrange(len(a))
rand_sample = a[rand_idx]

There are two problem with this approach when the size of a is very large

(1) This requires knowing the length of a, which might be hard to compute without streaming through terrabytes of data.

(2) This requires keeping all of the data in memory in some implementations.

To tackle this issue, we want to sample a with an alternative approach

import random
a = [1, 2, 4]

def reservoir_sampling_one(input_array):
    buffer = [input_array[0]]
    for i in range(len(input_array)): #this can be checking if out of range
        rand_idx = random.randrange(0,i+1)
        if rand_idx == 0:
            buffer[0] = a[i]
    return buffer

In this version, the buffer in the end will return the one number we sampled. This approach only need to keep the buffer in memory. We can replace the length parameter by checking if the index is out of bounds (for example, by catching an exception or check for a sentry). This way, we don’t need the len(input_array) as well.

If you do the math, the probability of a specific number in the buffer in the end is 1/(n), where n is the len(input_array). The easiest way to think about it is that for the first number, 1, to be in the buffer, the probability is 1 * 1/2 * 2/3 * … * (n-1)/n = 1/n . Things cancel out, and the probability is 1/n. You can try this for all the numbers, and it is true. I believe a proof can be done by induction.

Here’s a small program for you to verify this result with simulation

count = 0
num_trails = 10000
test_idx = 1
for i in range(num_trails):
    output = reservoir_sampling_one(a)
    if a[test_idx] in output:
        count += 1

print(f"a[test_idx] sampled {count} times out of {num_trails} trails")
print("prob of a[test_idx] sampled: ", float(count)/num_trails)

Now we can generalize this to sampling k items with the code below

import random
a = [1, 2, 4]
k = 2

def reservoir_sampling(sample_size, input_array):
    buffer = []
    for i in range(len(input_array)):
        rand_idx = random.randrange(0,i+1)
        if rand_idx < sample_size:
            if len(buffer) < sample_size:
                buffer.append(input_array[i])
            else:
                buffer[rand_idx] = input_array[i]
    #print(buffer)
    return buffer
    
count = 0
num_trails = 10000
test_idx = 1
for i in range(num_trails):
    output = reservoir_sampling(k, a)
    if a[test_idx] in output:
        count += 1

print(f"a[test_idx] sampled {count} times out of {num_trails} trails")
print("prob of a[test_idx] sampled: ", float(count)/num_trails)

With the simulation you will see that the probability a number at the test_idx is chosen is around 0.67 (2/3), which is k / n.

Again, the reason it works is that things cancel out. I won’t go through the math over here. You are welcomed to read through the reference materials on the mathematics.

So there you go, this is a very efficient sampling method for very large datasets.

Reference Materials

https://en.wikipedia.org/wiki/Reservoir_sampling

https://www.educative.io/edpresso/what-is-reservoir-sampling

https://www.geeksforgeeks.org/reservoir-sampling/

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s