Don’t know the reservoir sampling algorithm? Then come in and have a look~

Time:2022-5-26

The official label of reservoir sampling in Li Kou is 2. According to my work, there may be three or four. The proportion is relatively low. You can selectively grasp it according to your actual situation.

The algorithm thinking of reservoir sampling is very ingenious, and the code is simple and easy to understand. Even if you don’t master it, it’s good to understand it.

Problem description

Given a data stream, we need to randomly select k numbers in this data stream. Because the length of this data stream is very large, it needs to be processed while traversing, rather than loading it all into memory at one time.

Please write a random selection algorithm so that all data in the data stream is selectedEqual probabilitySelect.

There are many expressions of this problem. For example, let you randomly select k points from a rectangle, randomly select k words from a word list, etc., and ask you to waitProbabilistic random extraction。 No matter how the description changes, it is essentially the same. Today, let’s see how to do this problem.

Algorithm description

This algorithm is called reservoir sampling.

The basic idea is:

  • Build an array with the size of K and put the first k elements of the data stream into the array.
  • The first k number of data streamsbeforeNo processing.
  • Starting from the K + 1st number of the data stream, select a number Rand between [1, I], where I represents the current number.
  • If Rand is greater than or equal to K, do nothing
  • If Rand is less than k, swap Rand and I, that is, select the current number instead of the selected number (spare tire).
  • Finally return to the surviving spare tire

The core of this algorithm is to select the number with a certain probability, and replace the previously selected number with another probability in the subsequent process. Therefore, in fact, the probability of each number being finally selected isProbability of being selected * probability of not being replaced

Pseudo code:

Pseudo code reference to an algorithm book, and slightly modified.

Init : a reservoir with the size: k
for i= k+1 to N
    if(random(1, i) < k) {
        SWAP the Mth value and ith value
    }

Can this ensure that the selected number is equal probability? The answer is yes.

  • When I < = k, the probability of I being selected is 1.
  • When the number K + 1 is reached, the probability that the number K + 1 is selected (the probability of entering the if branch above) is $\ frac {K} {K + 1} $; when the number K + 2 is reached, the probability that the number K + 2 is selected (the probability of entering the if branch above) is $\ frac {K} {K + 2} $, and so on. Then the probability that the nth number is selected is $\ frac {K} {n}$
  • The probability of being selected is analyzed above, and the probability of not being replaced is analyzed next. When the number K + 1 is reached, the probability that the first k numbers will be replaced is $\ frac {1} {K} $. When the first K + 2 numbers are reached, the probability of K + 2 numbers being replaced is $\ frac {1} {K} $, and so on. That is to say, the probability of all being replaced is $\ frac {1} {K} $. Knowing the probability of being replaced, the probability of not being replaced is actually 1 – the probability of being replaced.

Therefore, for the first k numbers, the probability of being finally selected is 1 * the probability of not being replaced by K + 1 * the probability of not being replaced by K + 2 * Probability of not being replaced by N, i.e. 1 * (1 – probability of being replaced by K + 1) * (1 – probability of being replaced by K + 2) * (1 – probability of being replaced by n), i.e. $1 \ times (1 – \ frac {K} {K + 1} \ times \ frac {1} {K}) \ times (1 – \ frac {K} {K + 2} \ times \ frac {1} {K}) \ times \times (1 – \frac{k}{n} \times \frac{1}{k}) = \frac{k}{n} $。

For the number I (I > k), the probability of being finally selected is the probability of being selected in step I * the probability of not being replaced by step I + 1 ** Probability of not being replaced by step n, i.e. $\ frac {K} {K + 1} \ times (1 – \ frac {K} {K + 2} \ times \ frac {1} {K}) \ times \times (1 – \frac{k}{n} \times \frac{1}{k}) = \frac{k}{n} $。

In short, no matter which number is selected, the probability of being selected is $\ frac {K}{n} $, which meets the requirements of equal probability.

Related topics

  • 382. Linked list random node
  • 398. Random number index
  • 497. Random points in non overlapping rectangles

summary

The core code of reservoir sampling algorithm is very simple. But it’s not easy to think of it, especially when I haven’t seen it before. Its core point is that the probability of each number being finally selected isProbability of being selected * probability of not being replaced。 So we can take some dynamic means to make it possible to select and replace some numbers in each round. We have given the proof process of equal probability above. You may as well try to prove it yourself. After that, combined with the relevant topics at the end of the article, the effect will be better.

Recommended Today

Flutter state management based on riverpod

original text https://itnext.io/flutter-sta… code https://github.com/iisprey/ri… reference resources https://itnext.io/a-minimalis… https://pub.dev/packages/stat… https://iisprey.medium.com/ge… https://iisprey.medium.com/ho… text As I promised last week, I will show you my own path to a final state management solution Riverpod + StateNotifier + Hooks + Freezed Riverpod is great! But there are not many good examples. Only the most basic, that’s all. This time, […]