DSA: Heap vs Quickselect for Top-K

7 min read4.7k

Solution to Last Week's Challenge

Last week's problem was: DSA: Kadane’s Algorithm and Variants

The optimal solution uses dsa: kadane’s algorithm and variants pattern with careful state management.

Complexity Analysis:

  • Time Complexity: O(n)
  • Space Complexity: O(1)

In the world of high-scale system design and competitive programming, the "Top-K" problem is a recurring theme that tests an engineer's ability to balance time complexity with memory constraints. Whether you are building a real-time leaderboard for a gaming platform, identifying the most frequent search queries in a search engine, or calculating the nearest neighbors in a machine learning model, the core challenge remains the same: efficiently extracting a specific subset of elements from a much larger collection.

Mastering the trade-offs between a Heap-based approach and the Quickselect algorithm is a hallmark of a senior engineer. While both techniques solve the problem of finding the $k^{th}$ largest or smallest elements, they serve different operational needs. A Heap is ideal for streaming data where $k$ is small relative to $N$, whereas Quickselect offers superior average-case performance for static datasets. This post will deconstruct these patterns, providing you with a mental framework to choose the right tool during your next system design review or coding interview.

Theory and Complexity

At its core, the Top-K problem asks us to find the $k$ largest (or smallest) elements in an unsorted array of size $n$. The naive approach of sorting the entire array takes $O(n \log n)$ time, which is often unacceptable when $n$ is in the millions and $k$ is small.

The Heap approach maintains a Min-Heap of size $k$. As we iterate through the $n$ elements, we push to the heap. If the heap exceeds size $k$, we pop the smallest element. By the end, the heap contains the $k$ largest elements. The complexity is $O(n \log k)$.

Quickselect, based on the Quicksort partitioning logic, works by picking a pivot and partitioning the array such that all elements to the left are smaller and all to the right are larger. Unlike Quicksort, which recurses into both halves, Quickselect only recurses into the half containing the $k^{th}$ index. This results in a geometric series $n + n/2 + n/4 \dots$ which sums to $O(n)$ on average.

Pattern Visualization

Choosing between these two depends on two factors: the nature of the data (static vs. streaming) and the value of $k$.

Implementation

Let's look at how we implement these in Python, moving from the standard Heap approach to the optimized Quickselect.

Method 1: Min-Heap (O(n log k))

This is the most common interview solution due to its balance of readability and efficiency.

python
import heapq

def find_top_k_heap(nums, k):
    # Time: O(n log k)
    # Space: O(k)
    # Use a min-heap to keep track of the k largest elements
    min_heap = []
    
    for num in nums:
        heapq.heappush(min_heap, num)
        if len(min_heap) > k:
            heapq.heappop(min_heap)
            
    return min_heap

Method 2: Quickselect (O(n) Average)

This is the "optimized" answer often expected when the interviewer asks, "Can we do better than $O(n \log k)$?"

python
import random

def find_top_k_quickselect(nums, k):
    # Time: O(n) average, O(n^2) worst case
    # Space: O(1) if in-place
    target_index = len(nums) - k
    
    def partition(left, right, pivot_index):
        pivot_val = nums[pivot_index]
        # Move pivot to end
        nums[pivot_index], nums[right] = nums[right], nums[pivot_index]
        store_index = left
        for i in range(left, right):
            if nums[i] < pivot_val:
                nums[store_index], nums[i] = nums[i], nums[store_index]
                store_index += 1
        # Move pivot to its final place
        nums[right], nums[store_index] = nums[store_index], nums[right]
        return store_index

    def select(left, right):
        if left == right:
            return nums[left]
        
        pivot_idx = random.randint(left, right)
        pivot_idx = partition(left, right, pivot_idx)
        
        if pivot_idx == target_index:
            return nums[pivot_idx]
        elif pivot_idx < target_index:
            return select(pivot_idx + 1, right)
        else:
            return select(left, pivot_idx - 1)

    select(0, len(nums) - 1)
    return nums[target_index:]

Complexity Analysis Table

MethodTime ComplexitySpace ComplexityBest Use Case
Full Sort$O(n \log n)$$O(1)$Small $n$, simplicity is priority
Min-Heap$O(n \log k)$$O(k)$Streaming data, small $k$
Quickselect$O(n)$ average$O(1)$Large static datasets
Bucket Sort$O(n)$$O(n)$Known range of frequencies

Common Patterns

1. K-Closest Points to Origin

This pattern applies when the "value" is a calculated distance.

python
# Using heap for K-Closest
def kClosest(points, k):
    # Store (-distance, x, y) in max-heap to keep k smallest
    heap = []
    for x, y in points:
        dist = -(x**2 + y**2)
        if len(heap) < k:
            heapq.heappush(heap, (dist, x, y))
        else:
            heapq.heappushpop(heap, (dist, x, y))
    return [[x, y] for d, x, y in heap]

2. Top K Frequent Elements

This combines a Hash Map with a Heap.

python
from collections import Counter
def topKFrequent(nums, k):
    count = Counter(nums)
    # Return k elements with highest frequency
    return heapq.nlargest(k, count.keys(), key=count.get)

Practice Problems

This Week's Interview Challenge

Problem: The K-Maintenance Stream

You are designing a system that monitors a stream of incoming integers representing server response times. You need to implement a class KthLargest that maintains the $k^{th}$ largest element of the stream at all times.

Example 1: Input: k = 3, initial_nums = [4, 5, 8, 2] Operations: add(3), add(5), add(10), add(9), add(4) Output: 4, 5, 5, 8, 8

Example 2: Input: k = 1, initial_nums = [] Operations: add(-3), add(-2) Output: -3, -2

Constraints:

  • $1 \le k \le 10^4$
  • $0 \le \text{initial_nums.length} \le 10^4$
  • $-10^4 \le \text{val} \le 10^4$
  • At most $10^4$ calls will be made to add.

Hints:

  1. Since the data is a stream, does Quickselect make sense here?
  2. If you need the $k^{th}$ largest, you need to keep track of the $k$ largest elements seen so far.
  3. What happens to the "smallest" of those $k$ largest elements when a new, larger number arrives?
  4. Think about the property of a Min-Heap.

Template Code:

python
class KthLargest:
    def __init__(self, k: int, nums: list[int]):
        # Initialize your data structure here
        pass

    def add(self, val: int) -> int:
        # Add a new value and return the current kth largest
        pass

Submit your solution and check back next week for the detailed answer!

Conclusion

The choice between Heap and Quickselect isn't just about big-O notation; it's about the constraints of your environment. In a distributed system handling live events, the Heap is your best friend because it handles streams gracefully. In a batch processing job on a static file, Quickselect can shave precious seconds off your runtime. When interviewing, always start by clarifying if the data is static or a stream—this single question demonstrates senior-level foresight.

References: