Reservoir sampling is a useful technique when sample k items for a large data sets.

In this post I outline some ideas on how it works.

First, let’s take a look at the task sampling one number of an array. We want to ensure that each number has equal probability of being chosen.

The straightforward way to sample a number of a list of numbers would be to randomly generate an index in the range of all numbers and pick that one. Here we show some python code

```
import random
a = [1, 2, 4]
rand_idx = random.randrange(len(a))
rand_sample = a[rand_idx]
```

There are two problem with this approach when the size of a is very large

(1) This requires knowing the length of a, which might be hard to compute without streaming through terrabytes of data.

(2) This requires keeping all of the data in memory in some implementations.

To tackle this issue, we want to sample a with an alternative approach

```
import random
a = [1, 2, 4]
def reservoir_sampling_one(input_array):
buffer = [input_array[0]]
for i in range(len(input_array)): #this can be checking if out of range
rand_idx = random.randrange(0,i+1)
if rand_idx == 0:
buffer[0] = a[i]
return buffer
```

In this version, the buffer in the end will return the one number we sampled. This approach only need to keep the buffer in memory. We can replace the length parameter by checking if the index is out of bounds (for example, by catching an exception or check for a sentry). This way, we don’t need the len(input_array) as well.

If you do the math, the probability of a specific number in the buffer in the end is 1/(n), where n is the len(input_array). The easiest way to think about it is that for the first number, 1, to be in the buffer, the probability is 1 * 1/2 * 2/3 * … * (n-1)/n = 1/n . Things cancel out, and the probability is 1/n. You can try this for all the numbers, and it is true. I believe a proof can be done by induction.

Here’s a small program for you to verify this result with simulation

```
count = 0
num_trails = 10000
test_idx = 1
for i in range(num_trails):
output = reservoir_sampling_one(a)
if a[test_idx] in output:
count += 1
print(f"a[test_idx] sampled {count} times out of {num_trails} trails")
print("prob of a[test_idx] sampled: ", float(count)/num_trails)
```

Now we can generalize this to sampling k items with the code below

```
import random
a = [1, 2, 4]
k = 2
def reservoir_sampling(sample_size, input_array):
buffer = []
for i in range(len(input_array)):
rand_idx = random.randrange(0,i+1)
if rand_idx < sample_size:
if len(buffer) < sample_size:
buffer.append(input_array[i])
else:
buffer[rand_idx] = input_array[i]
#print(buffer)
return buffer
count = 0
num_trails = 10000
test_idx = 1
for i in range(num_trails):
output = reservoir_sampling(k, a)
if a[test_idx] in output:
count += 1
print(f"a[test_idx] sampled {count} times out of {num_trails} trails")
print("prob of a[test_idx] sampled: ", float(count)/num_trails)
```

With the simulation you will see that the probability a number at the test_idx is chosen is around 0.67 (2/3), which is k / n.

Again, the reason it works is that things cancel out. I won’t go through the math over here. You are welcomed to read through the reference materials on the mathematics.

So there you go, this is a very efficient sampling method for very large datasets.

Reference Materials

https://en.wikipedia.org/wiki/Reservoir_sampling

https://www.educative.io/edpresso/what-is-reservoir-sampling