Reservoir sampling in C++

This article aims to facilitate your comprehension of Reservoir Sampling in C++ by presenting an algorithmic explanation accompanied by illustrative code. The content encompasses the fundamentals of Reservoir Sampling, featuring a practical use case, a detailed algorithm walkthrough, and a hands-on C++ implementation with a corresponding example.

Understanding Reservoir Sampling

Reservoir Sampling constitutes a family of randomized algorithms employed for selecting a random sample of k numbers without replacement from a set of n numbers, where n can either be explicitly defined or remain undefined. The article introduces a use case to explain the algorithm.

Algorithm:

Within the algorithm, an array denoted as reservoir[] is created with a size of k, alongside a collection of random numbers with a size of n (an undefined size). The process involves selecting a random number from the list and depositing it into the reservoir[] list. The caveat is that once an item is selected, it is ineligible for subsequent selection. The algorithm unfolds as follows:

Duplicate the initial k items from the reservoir list.
Sequentially select the subsequent items from the (k+1)^th number and allocate them to index i.
Derive a random index ranging from 0 to i and assign it to j.
If j falls within the range of 0 to k, interchange reservoir[j] with array[i].

Begin
   define output array size [k]
   copy first items from (k) array to output

   while i < n, do
      j = randomly choose one value from 0 to i
      if j < k, then
         output[j] = array[i]
      increase i by 1
   done
   display output array
End

C++ Implementation Example

The article provides a C++ implementation of the Reservoir Sampling algorithm, complemented by a code snippet exemplifying its application. The code encompasses functions for displaying an array and electing k items from the array through the Reservoir Sampling algorithm.

#include <iostream>
#include <cstdlib>
#include <ctime>
using namespace std;

void displayArray(const int items[], const int size) {
   // Display elements of an array
   for (int i = 0; i < size; ++i)
      cout << items[i] << " ";
}

void reservoirSampling(int population[], const int populationSize, int sample[], const int sampleSize) {
   // Perform Reservoir Sampling and select sampleSize items
   for (int i = 0; i < sampleSize; ++i)
      sample[i] = population[i];

   srand(time(nullptr)); // Use the time function for a unique seed value

   for (int i = sampleSize; i < populationSize; ++i) {
      int randomIndex = rand() % (i + 1); // Random index from 0 to i

      if (randomIndex < sampleSize)
         sample[randomIndex] = population[i]; // Replace the (randomIndex)th element with the ith element in the sample
   }

   cout << "Selected items from the given population are: ";
   displayArray(sample, sampleSize);
}

int main() {
   // Example usage
   int population[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12};
   int populationSize = 12;
   int sampleSize = 6;

   int sample[sampleSize];
   reservoirSampling(population, populationSize, sample, sampleSize);
   cout << endl;

   return 0;
}

Output:

The output of this code will get returned different every time it is run as the reservoir is randomised with every compilation.

Comparison with Other Sampling Methods

While simple random sampling is straightforward, Reservoir Sampling stands out in scenarios where the dataset size is unknown or too large to fit into memory. Unlike systematic sampling, which follows a fixed pattern, reservoir sampling provides uncertainty essential for balanced sampling.

In situations where maintaining a large dataset in memory is not practical, reservoir sampling works well by processing data, making it a preferred choice in streaming and online processing scenarios.

Applications

Reservoir Sampling finds applications in diverse fields. In data streaming, it efficiently samples data points arriving sequentially, making it suitable for real-time analytics. Randomized algorithms leverage reservoir sampling for tasks like approximate counting and sampling. In machine learning, it plays a role in creating representative training datasets without storing the entire dataset in memory.

Efficiency and Time Complexity

The time complexity of Reservoir Sampling is O(n), making it highly efficient for large datasets. Its ability to provide an unbiased sample with a constant amount of memory makes it an attractive option in scenarios where efficiency is paramount.

Variations of Reservoir Sampling

Variations of Reservoir Sampling cater to specific needs. Weighted Reservoir Sampling introduces weights to elements, altering their likelihood of being chosen. Graded Reservoir Sampling divides the dataset into layers, ensuring representation from each layer in the final sample.

Conclusion

The article concludes by summarizing Reservoir Sampling and offering examples featuring diverse algorithms. Post-reading this tutorial, readers are anticipated to possess a rational understanding of Reservoir Sampling.

Next Topic0/1 Knapsack using Least Cost Branch and Bound

← prev next →