Find the median from running data stream

The median is a statistical metric used in the data analysis and computer science that represents the middle value of a sorted dataset. It is an important measure of central tendency that provides information about the distribution and properties of a dataset. Finding the median from a static dataset is straightforward, but things become more challenging when working with a live data stream, where new data is continually entering. In this post, we'll look at the challenges of finding the median from a streaming data stream and present a common algorithmic solution.

The concept is to divide the data stream in half: one half is kept in a max-heap for lower values, while the other half is stored in a min-heap for bigger values. By keeping these two heaps, you may efficiently obtain the minimum and maximum values corresponding to the dataset's median elements.

Finding the Median using Heaps

One typical method for calculating the median from a flowing data stream is to keep two data structures: a max-heap and a min-heap. These data structures are critical for monitoring the median effectively when additional data points are added.

Maximum-Heap and Minimum-Heap

A max-heap is a binary tree in which the parent node is more than or equal to the nodes of its children. The dataset's greatest value is always at the root of the max heap.
A min-heap is a binary tree in which the parent node is smaller than or equal to the nodes of its children. The dataset's minimal value is always at the top of the min-heap.

The concept is to divide the data stream in half: one half is kept in a max-heap for lower values, while the other half is stored in a min-heap for bigger values. By keeping these two heaps, you may efficiently obtain the minimum and maximum values corresponding to the dataset's median elements.

Python Implementation

import heapq
class RunningMedian:
    def __init__(self):
        self.min_heap = []  
        self.max_heap = []  
    def add_number(self, num):
        if not self.max_heap or num < -self.max_heap[0]:
            heapq.heappush(self.max_heap, -num)  
        else:
            heapq.heappush(self.min_heap, num)
        if len(self.max_heap) > len(self.min_heap) + 1:
            heapq.heappush(self.min_heap, -heapq.heappop(self.max_heap))
        elif len(self.min_heap) > len(self.max_heap):
            heapq.heappush(self.max_heap, -heapq.heappop(self.min_heap))
    def find_median(self):
        if len(self.max_heap) == len(self.min_heap):
            return (-self.max_heap[0] + self.min_heap[0]) / 2
        else:
            return -self.max_heap[0
running_median = RunningMedian()
data_stream = [2, 3, 1, 5, 7, 6]
medians = []
for num in data_stream:
    running_median.add_number(num)
    medians.append(running_median.find_median())
print(medians)  

Output:

[2.0, 2.5, 2.0, 2.5, 3.0, 4.0]

Using an Augmented self-balanced binary search tree

A practical and successful way is to use an augmented self-balanced binary search tree, such as an AVL tree or a Red-Black tree, to find the median from a continuously updated data stream. This approach entails managing a self-balancing binary search tree in which each node stores additional data, namely the size of the subtree rooted at that node. Including this site, information is critical in obtaining the median in a streamlined and effective manner.

Using an augmented self-balanced binary search tree entails integrating additional information, often the size of subtrees, into the tree nodes. These self-balancing trees, such as AVL or Red-Black trees, maintain their balance automatically, allowing for quick insertion and retrieval.
Augmentation allows for the real-time computation of statistical measures such as the median from a flowing data stream. As fresh data enters, the tree is updated to preserve balance, allowing the median to be found with logarithmic time complexity.
This method is especially useful for applications that need real-time data processing, such as financial data tracking, system performance monitoring, and healthcare. The enhanced self-balanced binary search tree improves memory use and computational performance, making it an excellent tool for dynamic datasets and online learning scenarios.

Python Implementation

class TreeNode:
    def __init__(self, value):
        self.value = value
        self.left = None
        self.right = None
        self.size = 1  
class AVLTree:
    def __init__(self):
        self.root = None
    def insert(self, root, value):
        if not root:
            return TreeNode(value)
        if value <= root.value:
            root.left = self.insert(root.left, value)
        else:
            root.right = self.insert(root.right, value)
        root.size = 1 + self.get_size(root.left) + self.get_size(root.right)
        balance = self.get_balance(root)
        if balance > 1:
            if value < root.left.value:
                return self.right_rotate(root)
            else:
                root.left = self.left_rotate(root.left)
                return self.right_rotate(root)
        if balance < -1:
            if value > root.right.value:
                return self.left_rotate(root)
            else:
                root.right = self.right_rotate(root.right)
                return self.left_rotate(root)
        return root
    def get_size(self, node):
        return node.size if node else 0
    def get_balance(self, node):
        return self.get_size(node.left) - self.get_size(node.right) if node else 0
    def left_rotate(self, node):
        right_child = node.right
        node.right = right_child.left
        right_child.left = node
        node.size = 1 + self.get_size(node.left) + self.get_size(node.right)
        right_child.size = 1 + self.get_size(right_child.left) + self.get_size(right_child.right)
        return right_child
    def right_rotate(self, node):
        left_child = node.left
        node.left = left_child.right
        left_child.right = node
        node.size = 1 + self.get_size(node.left) + self.get_size(node.right)
        left_child.size = 1 + self.get_size(left_child.left) + self.get_size(left_child.right)
        return left_child
    def insert_value(self, value):
        self.root = self.insert(self.root, value)
    def find_median(self):
        total_size = self.get_size(self.root)
        if total_size % 2 == 0:
            return self.find_kth_element((total_size // 2) + 1) / 2.0 + self.find_kth_element(total_size // 2) / 2.0
        else:
            return float(self.find_kth_element((total_size // 2) + 1))
    def find_kth_element(self, k):
        return self._find_kth_element(self.root, k)
    def _find_kth_element(self, node, k):
        left_size = self.get_size(node.left) if node.left else 0
        if k == left_size + 1:
            return node.value
        elif k <= left_size:
            return self._find_kth_element(node.left, k)
        else:
            return self._find_kth_element(node.right, k - left_size - 1)
avl_tree = AVLTree()
data_stream = [2, 3, 1, 5, 7, 6]
medians = []
for num in data_stream:
    avl_tree.insert_value(num)
    medians.append(avl_tree.find_median())
print(medians)

Output:

[2.0, 2.5, 2.0, 2.5, 3.0, 4.0]

Using Insertion sort

Finding the median from a flowing data stream using insertion sort is not the most efficient method, especially for big data sets. In the worst scenario, insertion sort has a temporal complexity of O(n2), where n is the number of items. It is, nonetheless, possible. Here's an example of how insertion sort may be used to get the median from a streaming data stream:

Python Implementation

def find_median(data_stream):
    sorted_data = []
    medians = []
    for num in data_stream:
        i = len(sorted_data) - 1
        while i >= 0 and sorted_data[i] > num:
            sorted_data[i + 1] = sorted_data[i]
            i -= 1
        sorted_data[i + 1] = num
        n = len(sorted_data)
        if n % 2 == 0:
            median = (sorted_data[n // 2 - 1] + sorted_data[n // 2]) / 2
        else:
            median = sorted_data[n // 2]
        medians.append(median)
    return medians
data_stream = [2, 3, 1, 5, 7, 6]
medians = find_median(data_stream)
print(medians)

Output:

[2.0, 2.5, 2.0, 2.5, 3.0, 4.0]

Advantages

1. Decision Making in Real Time:

Median calculations provide immediate insights, allowing organizations to make educated financial, healthcare, and emergency response judgments.

2. Resistance to Outliers

Unlike the mean, which outliers may severely impact, the Median is more resistant to extreme values, making it appropriate for scenarios skewed or anomalous data.

3. Memory Effectiveness

Because it does not need to keep the complete dataset, the max-heap and min-heap methods for calculating the median from a data stream are memory-efficient.

4. Online Education

The ability to determine the median from a flowing data stream matches well with online machine learning and adaptive systems, in which models adjust to incoming data continually.

5. Detecting Anomalies

Identifying variations in data distribution is crucial for anomaly detection, and the median may assist in locating such deviations in real time.

Applications

Real-time Analytics:
- Financial Data Analysis: Finding the Median is critical for tracking asset values, recognizing trends, and assessing risk in the financial sector.
- Traders and investors use stock market monitoring to analyze market behavior and make educated decisions.
Healthcare:
- Patient Monitoring: Real-time median computations can aid in detecting abnormalities or substantial changes in vital signs (e.g., heart rate, blood pressure).
Network and System Monitoring:
- Network Traffic Analysis: In information technology and networking, measuring the Median of network traffic can aid in identifying abnormal spikes or trends.
- System Performance Monitoring: Finding the median of reaction times might be critical for system managers to ensure optimal system performance.
Streaming Data:
- Stream Processing: Real-time analytics solutions process data streams using median calculations, allowing for rapid insights from constantly changing data.
Internet of Things (IoT):
- Sensor Data Analysis: Data streams from various sensors are used in IoT applications. In real-time, median computations can indicate key data or occurrences.

Conclusion

In summary, determining the median from a continuously evolving data stream is a frequently encountered challenge, offering practical utility across diverse fields. Employing the strategy of max-heaps and min-heaps allows for agile and effective median calculation as new data points are introduced. This approach strikes an optimal balance between memory efficiency and computational speed, rendering it particularly suitable for real-time applications that demand continuous data analysis. Whether you are involved in finance, system monitoring, or any other domain, the competence to ascertain the median from an evolving data stream is a valuable asset for data scientists and engineers.

Next TopicFlattening a Linked List

← prev next →