DSA: Heap vs Quickselect for Top-K
Solution to Last Week's Challenge
Last week's problem was: DSA: Kadane’s Algorithm and Variants
The optimal solution uses dsa: kadane’s algorithm and variants pattern with careful state management.
Complexity Analysis:
- Time Complexity: O(n)
- Space Complexity: O(1)
In the world of high-scale system design and competitive programming, the "Top-K" problem is a recurring theme that tests an engineer's ability to balance time complexity with memory constraints. Whether you are building a real-time leaderboard for a gaming platform, identifying the most frequent search queries in a search engine, or calculating the nearest neighbors in a machine learning model, the core challenge remains the same: efficiently extracting a specific subset of elements from a much larger collection.
Mastering the trade-offs between a Heap-based approach and the Quickselect algorithm is a hallmark of a senior engineer. While both techniques solve the problem of finding the $k^{th}$ largest or smallest elements, they serve different operational needs. A Heap is ideal for streaming data where $k$ is small relative to $N$, whereas Quickselect offers superior average-case performance for static datasets. This post will deconstruct these patterns, providing you with a mental framework to choose the right tool during your next system design review or coding interview.
Theory and Complexity
At its core, the Top-K problem asks us to find the $k$ largest (or smallest) elements in an unsorted array of size $n$. The naive approach of sorting the entire array takes $O(n \log n)$ time, which is often unacceptable when $n$ is in the millions and $k$ is small.
The Heap approach maintains a Min-Heap of size $k$. As we iterate through the $n$ elements, we push to the heap. If the heap exceeds size $k$, we pop the smallest element. By the end, the heap contains the $k$ largest elements. The complexity is $O(n \log k)$.
Quickselect, based on the Quicksort partitioning logic, works by picking a pivot and partitioning the array such that all elements to the left are smaller and all to the right are larger. Unlike Quicksort, which recurses into both halves, Quickselect only recurses into the half containing the $k^{th}$ index. This results in a geometric series $n + n/2 + n/4 \dots$ which sums to $O(n)$ on average.
Pattern Visualization
Choosing between these two depends on two factors: the nature of the data (static vs. streaming) and the value of $k$.
Implementation
Let's look at how we implement these in Python, moving from the standard Heap approach to the optimized Quickselect.
Method 1: Min-Heap (O(n log k))
This is the most common interview solution due to its balance of readability and efficiency.
import heapq
def find_top_k_heap(nums, k):
# Time: O(n log k)
# Space: O(k)
# Use a min-heap to keep track of the k largest elements
min_heap = []
for num in nums:
heapq.heappush(min_heap, num)
if len(min_heap) > k:
heapq.heappop(min_heap)
return min_heapMethod 2: Quickselect (O(n) Average)
This is the "optimized" answer often expected when the interviewer asks, "Can we do better than $O(n \log k)$?"
import random
def find_top_k_quickselect(nums, k):
# Time: O(n) average, O(n^2) worst case
# Space: O(1) if in-place
target_index = len(nums) - k
def partition(left, right, pivot_index):
pivot_val = nums[pivot_index]
# Move pivot to end
nums[pivot_index], nums[right] = nums[right], nums[pivot_index]
store_index = left
for i in range(left, right):
if nums[i] < pivot_val:
nums[store_index], nums[i] = nums[i], nums[store_index]
store_index += 1
# Move pivot to its final place
nums[right], nums[store_index] = nums[store_index], nums[right]
return store_index
def select(left, right):
if left == right:
return nums[left]
pivot_idx = random.randint(left, right)
pivot_idx = partition(left, right, pivot_idx)
if pivot_idx == target_index:
return nums[pivot_idx]
elif pivot_idx < target_index:
return select(pivot_idx + 1, right)
else:
return select(left, pivot_idx - 1)
select(0, len(nums) - 1)
return nums[target_index:]Complexity Analysis Table
| Method | Time Complexity | Space Complexity | Best Use Case |
|---|---|---|---|
| Full Sort | $O(n \log n)$ | $O(1)$ | Small $n$, simplicity is priority |
| Min-Heap | $O(n \log k)$ | $O(k)$ | Streaming data, small $k$ |
| Quickselect | $O(n)$ average | $O(1)$ | Large static datasets |
| Bucket Sort | $O(n)$ | $O(n)$ | Known range of frequencies |
Common Patterns
1. K-Closest Points to Origin
This pattern applies when the "value" is a calculated distance.
# Using heap for K-Closest
def kClosest(points, k):
# Store (-distance, x, y) in max-heap to keep k smallest
heap = []
for x, y in points:
dist = -(x**2 + y**2)
if len(heap) < k:
heapq.heappush(heap, (dist, x, y))
else:
heapq.heappushpop(heap, (dist, x, y))
return [[x, y] for d, x, y in heap]2. Top K Frequent Elements
This combines a Hash Map with a Heap.
from collections import Counter
def topKFrequent(nums, k):
count = Counter(nums)
# Return k elements with highest frequency
return heapq.nlargest(k, count.keys(), key=count.get)Practice Problems
This Week's Interview Challenge
Problem: The K-Maintenance Stream
You are designing a system that monitors a stream of incoming integers representing server response times. You need to implement a class KthLargest that maintains the $k^{th}$ largest element of the stream at all times.
Example 1:
Input: k = 3, initial_nums = [4, 5, 8, 2]
Operations: add(3), add(5), add(10), add(9), add(4)
Output: 4, 5, 5, 8, 8
Example 2:
Input: k = 1, initial_nums = []
Operations: add(-3), add(-2)
Output: -3, -2
Constraints:
- $1 \le k \le 10^4$
- $0 \le \text{initial_nums.length} \le 10^4$
- $-10^4 \le \text{val} \le 10^4$
- At most $10^4$ calls will be made to
add.
Hints:
- Since the data is a stream, does Quickselect make sense here?
- If you need the $k^{th}$ largest, you need to keep track of the $k$ largest elements seen so far.
- What happens to the "smallest" of those $k$ largest elements when a new, larger number arrives?
- Think about the property of a Min-Heap.
Template Code:
class KthLargest:
def __init__(self, k: int, nums: list[int]):
# Initialize your data structure here
pass
def add(self, val: int) -> int:
# Add a new value and return the current kth largest
passSubmit your solution and check back next week for the detailed answer!
Conclusion
The choice between Heap and Quickselect isn't just about big-O notation; it's about the constraints of your environment. In a distributed system handling live events, the Heap is your best friend because it handles streams gracefully. In a batch processing job on a static file, Quickselect can shave precious seconds off your runtime. When interviewing, always start by clarifying if the data is static or a stream—this single question demonstrates senior-level foresight.