Original

Reservoir Sampling


Motivation

We would like to select k items from a large data stream of unknown size n. All the items in the stream should be equally likely to be chosen. However, there is usually no sufficient memory to save the data stream and that's where Reservoir Sampling comes into play.

Steps

Assume that we have a data stream of n items [i1, i2, ... in ].

When k = 1, the size of the reservoir x is one. We have to iterate through the data stream to consider each item until all the data are exhausted. The probability of each item being selected should be 1 / n.

  • When the 1st item i1 is considered, let x = i1. So the probability of i1 being selected Pi1 = 1.
  • When the 2rd item i2 is considered, we replace x with i2 with probability 1/2. Therefore, Pi1 = 1 × (1 - 1/2) = 1/2, Pi2 = 1/2.
  • When the 3rd item i3 is considered, we replace x with i3 with probability 1/3. Therefore, Pi1 = 1/2 × (1 - 1/3) = 1/3, Pi2 = 1/2 × (1 - 1/3) = 1/3, pi3 = 1/3.
  • ... ...
  • Finally when the last item in is considered, we replace x with in with probability 1/n. Therefore, Pi1 = Pi2 = .... = Pin = 1/n.

When k > 1, the probability of each item being selected should be k / n.

  • For the first k items, we add them to the reservoir. So, Pi1 = Pi2 = .... = Pik = 1.
  • For the k + 1 th item ik+1, we keep it with probability k / (k + 1) and let it substitute a random item in the current reservoir. For an item among i1 - ik, the probability of it being in the reservoir now is 1 - 1 / k × k / (k + 1) = k / (k + 1).
  • For the k + 2 th item ik+2, we keep it with probalibility k / (k + 2). For an item among i1 - ik+1, the probability of it being in the reservoir now is k / (k + 1) × (1 - 1 / k × k / (k + 2) )= k / (k + 2).
  • ... ...
  • Finally for the n th item, we keep it with probability k / n. Similarly, for an item among i1 - ikn, the probability of it being in the reservoir now is k / n.

Python Code

import random
def reservoirSampling(nums, k):
    n = len(nums)
    results = nums[:k]
    for i in range(k, n):
        rand = random.randrange(n)
        if rand < k:
            results[rand] = nums[i]
    return results


References:

[1] Leetcode 382. Linked List Random Node

[2] Illustration

[3] Reservoir Sampling from Wikipedia

0
  • By Yuechun Wang
  • Published 2022-01-13
  • The materials on this website may be freely copied and distributed for noncommercial personal use only so long as our copyright notice and website address is included.
  • Comments