top of page

Find the Median from a Data Stream

  • mosesg1123
  • 3 hours ago
  • 5 min read

Finding the median from a data stream pops up in real-time analytics, financial tickers, and sensor dashboards. You need to support two operations—adding a number and retrieving the current median—both in sublinear time.


Problem Statement

Design a data structure that supports:

  • void addNum(int num): Insert integer num into the data structure.

  • double findMedian(): Return the median of all elements inserted so far.

The median is the middle number in a sorted list. If the list length is odd, it’s the middle element; if even, it’s the average of the two middle elements.

Example:
Stream:  [1] → median = 1
         [1, 2] → median = (1+2)/2 = 1.5
         [1, 2, 3] → median = 2
         [1, 2, 3, 4] → median = (2+3)/2 = 2.5

1. Clarify Requirements Before Jumping Into Code

I need to confirm:

  • Can numbers be negative or large? (Yes.)

  • Will we ever call findMedian before any insert? (Assume at least one insert happens first.)


2. Identify the Category of the Question

This is an online order-statistic problem on a stream. Common patterns:

  • Maintaining a sorted array or list

  • Using two heaps (max-heap and min-heap) to split data

  • Balanced BST or order-statistic tree

Two-heaps is the classic pattern for median-of-stream in interviews.


3. Consider a Brute-Force Approach to Understand the Problem

For a brute-force solution, what if I start by just keeping all of my numbers in an ArrayList? Then, for the operations:

  • addNum: append the new element to the list in O(1).

  • findMedian: I could sort the list in O(n log n) time, or I could scan the whole ArrayList to find the median in O(n).

So the best I can do with this solution is O(n) time.


4. Brainstorm More Solutions

Building on my brute-force solution, let's first acknowledge that findMedian is my most time-consuming operation at O(n), so if I want to optimize, that's a great place to start.


Now let's ask: do I have repeat work in this function that I could potentially eliminate to reduce the runtime?


In this case, yes - my current solution requires the findMedian operation to scan the entire ArrayList on every call. Why? Because the array is unsorted. If the array were sorted, then I could easily determine the median by just jumping directly to the middle of the array.


Option 1 - Maintain a sorted list

So what would it look like to to maintain a sorted list? I don't want to have to sort an unsorted list, because that takes O(n log n) time. But inserting into an already sorted list only take O(log n) time!

  • So, In my addNum operation, I'll attempt to insert into its sorted position via binary search. That's O(log n) time.

  • Then findMedian is O(1) time by simply reading the middle index(es).


That seems good at first glance, but there's a hitch - an ArrayList requires me to shift elements to the right after I insert a new element. In the worst case, that's O(n) time.


So really all I've done here is shift things around so that the addNum operation takes O(n) time and the findMedian operation takes O(1) time. :/


Option 2 - Use two heaps

So if an ArrayList can't give me O(log n) time on insert, is there another data structure that could, while still maintaining my ability to immediately jump to the middle?


A doubly linked list would give me constant time insert, but you can't jump around a linked list for binary search or finding the median element, so that doesn't seem right.


Hash maps/sets have constant look-up time, but can't maintain an ordering.


Min/Max Heaps seem promising - they can maintain an ordering for O(log n) time on insert, and you can get the min/ max value in constant time. But I don't need to know the minimum value or the maximum value, I need to know the middle value!


Well, if one data structure isn't enough, can I use two?

Let me focus on the data structures that seem closest to what I need, heaps. How could I structure my data if I wanted to use two heaps?

This seems promising - what if I split my numbers in half and put the smallest numbers in one heap and the largest numbers in the other?

  • If my lowHeap is a maxHeap, then I can identify the largest number in that heap in O(1) time.

  • And if my highHeap is a minHeap, then I can identify the smallest number in that heap in O(1) time as well!

  • The median must be one of these two numbers (or the average of these two numbers)!

  • Inserting a new number would require me to push the number into the either the lowHeap or highHeap depending on its size, and then rebalancing if necessary, all in O(log n) time!


Two heaps gives me O(log n) on inserts and O(1) for median retrieval.


To summarize:

  • Max-heap (lowHeap) holds the smaller half of numbers.

  • Min-heap (highHeap) holds the larger half.

  • Balance sizes so that the two heaps are equal in size, or if I have odd numbers, the lowHeap has one extra.

  • addNum: compare to low.peek(), push into one heap, then rebalance in O(log n).

  • findMedian: if sizes equal, average low.peek() and high.peek(); else return low.peek().

  • Both operations O(log n) and O(1) respectively.


5. Discuss Trade-Offs Between Your Solutions

Approach

addNum

findMedian

Space

Pros

Cons

Sorted list insert

O(n)

O(1)

O(n)

Simple concept

Slow inserts

Two heaps

O(log n)

O(1)

O(n)

Balanced, standard interview

Need two data structures


6. Write Pseudocode to Structure Your Thoughts

initialize maxHeap low, minHeap high

function addNum(num):
  if low is empty or num ≤ low.peek():
    low.push(num)
  else:
    high.push(num)
  // rebalance sizes
  if low.size > high.size + 1:
    high.push(low.pop())
  else if high.size > low.size:
    low.push(high.pop())

function findMedian():
  if low.size > high.size:
    return low.peek()
  else:
    return (low.peek() + high.peek()) / 2.0

7. Consider Edge Cases

  • First insert: both heaps empty → goes into low.

  • All equal numbers → heaps still balanced.

  • Even vs. odd total count → median formula changes.

  • Large positive/negative ints → ensure no overflow when averaging two ints (cast to double).


8. Write Full Code Syntax

import java.util.PriorityQueue;

public class MedianFinder {
    private PriorityQueue<Integer> low;   // max-heap
    private PriorityQueue<Integer> high;  // min-heap

    public MedianFinder() {
        low  = new PriorityQueue<>((a, b) -> b - a);
        high = new PriorityQueue<>();
    }

    public void addNum(int num) {
        if (low.isEmpty() || num <= low.peek()) {
            low.offer(num);
        } else {
            high.offer(num);
        }
        // rebalance
        if (low.size() > high.size() + 1) {
            high.offer(low.poll());
        } else if (high.size() > low.size()) {
            low.offer(high.poll());
        }
    }

    public double findMedian() {
        if (low.size() > high.size()) {
            return low.peek();
        } else {
            return (low.peek() + high.peek()) / 2.0;
        }
    }
}

9. Test Your Code

public static void main(String[] args) {
    MedianFinder mf = new MedianFinder();

    mf.addNum(1);
    System.out.println(mf.findMedian() == 1.0);

    mf.addNum(2);
    System.out.println(mf.findMedian() == 1.5);

    mf.addNum(3);
    System.out.println(mf.findMedian() == 2.0);

    mf.addNum(2);
    System.out.println(mf.findMedian() == 2.0);

    // negative numbers
    MedianFinder mf2 = new MedianFinder();
    mf2.addNum(-5);
    mf2.addNum(-10);
    System.out.println(mf2.findMedian() == -7.5);

    // mixed
    mf2.addNum(0);
    System.out.println(mf2.findMedian() == -5.0);
}

10. Key Lessons to Remember for Future Questions

  • Two heaps (max + min) neatly maintain a dynamic median in O(log n)/O(1).

  • This pattern generalizes to other order-statistic streaming problems (e.g., kth largest).

  • Always rebalance after insertion to keep the size of the heaps balanced.

  • If you're stuck, don't be afraid to just list off data structures and talk through whether or not that data structure could be useful! And if it turns out one isn't enough, then think through combining data structures.


Good Luck and Happy Coding!

Recent Posts

See All
Design an LRU Cache

Caching is critical in systems design, from web browsers to database engines. An LRU (Least Recently Used) cache evicts the least recently used entry when full, ensuring hot data stays accessible. Imp

 
 
 
Merge k Sorted Lists

Combining multiple sorted streams into one sorted output is a classic challenge faced in database merge operations, log file aggregation, and external sorting. The “merge K sorted lists” problem tests

 
 
 
Contains Duplicate Within K Distance

Checking for nearby repeated events in a log or sensor stream is a common task in monitoring, fraud detection, and real‑time analytics....

 
 
 

留言


Drop Me a Line, Let Me Know What You Think

Thanks for submitting!

© 2023 by Train of Thoughts. Proudly created with Wix.com

bottom of page