Question
Given a set or stream of values, consider an algorithm to randomly generate sample values.
Solution

If the dataset is very large / stream, we describe reservoir sampling in this note.

  • Goal: sample $k$ items uniformly at random from $n \gg 1$ values, without storing all $n$ values in memory.

Base case $k=1$:

  1. Store the first item as the current sample.
  2. When the $i$-th item arrives $(i \ge 2)$, replace the current sample with probability $1/i$. After processing all items, each item has probability exactly $1/n$.

General $k$ case

  1. Put the first $k$ items into the reservoir.
  2. For item $i=k+1,\dots,n$, generate a random integer $j$ uniformly from $\{1,\dots,i\}$.
  3. If $j \le k$, replace the $j$-th element in the reservoir by the new item.

After processing all $n$ items, each item is included in the reservoir with probability $k/n$.

Question
Why does reservoir sampling work for the general case of sampling $k$ items?

We provide a proof by induction induction on the number of processed items.

For the base case, after seeing the first $k$ items, the reservoir contains exactly those $k$ items, each of them is in the reservoir with probability $$ 1 = \frac{k}{k}. $$ The claim holds when $n=k$.

Now consider that after processing $i-1$ items, each of the $i-1$ items is in the reservoir with probability $ {k}/({i-1}). $

First consider item $i$. It is inserted into the reservoir exactly when $j \le k$. Since $j$ is uniform on $\{1,\dots,i\}$, this happens with probability $ {k}/{i}. $ Item $i$ is in the reservoir with probability $\frac{k}{i}$.

Now consider an older item $r < i$. By inductive hypothesis, before step $i$, item $r$ is in the reservoir with probability $ {k}/({i-1}). $ If it is currently in the reservoir, then it is removed only if the algorithm chooses to replace its slot.

At step $i$:

  • the probability that some replacement happens is $ {k}/{i}, $
  • conditional on replacement happening, each of the $k$ reservoir positions is equally likely to be replaced, so the probability that item $r$ is removed is $ {1}/{k}. $

Therefore, given that item $r$ is currently in the reservoir, the probability it survives step $i$ is $$ 1 - \frac{k}{i}\cdot\frac{1}{k} = 1 - \frac{1}{i} = \frac{i-1}{i}. $$

Hence, the probability that item $r$ is in the reservoir after step $i$ is $$ \frac{k}{i-1}\cdot\frac{i-1}{i} = \frac{k}{i}. $$

After processing item $i$, every one of the $i$ items is in the reservoir with probability $\frac{k}{i}$. We completes the induction.