DSA: Top-K Elements Using Heaps
Solution to Last Week's Challenge
Last week's problem was: DSA: Sliding Window Pattern Explained
The optimal solution uses dsa: sliding window pattern explained pattern with careful state management.
Complexity Analysis:
- Time Complexity: O(n)
- Space Complexity: O(1)
In the landscape of technical interviews and high-performance system design, the ability to efficiently extract a subset of extreme values from a massive dataset is a fundamental skill. Whether you are building a real-time leaderboard for a gaming platform, identifying the most frequent search queries in a search engine, or calculating the K-closest delivery hubs in a logistics app, you are essentially solving the Top-K elements problem. For a senior engineer, the goal is not just to find a solution, but to find the one that balances memory constraints with computational speed.
While a naive approach involving sorting might suffice for small arrays, it fails to scale when dealing with streaming data or massive datasets that do not fit into memory. This is where the Heap data structure—specifically the Priority Queue—becomes an indispensable tool. By maintaining a specialized tree structure, we can reduce our time complexity from a full sort to a much more manageable logarithmic relationship with K, the number of elements we wish to track.
Theory
The core concept of the Top-K pattern revolves around maintaining a "pool" of the best candidates encountered so far. To find the K largest elements, we use a Min-Heap of size K. As we iterate through the dataset, we compare each new element to the smallest element currently in our "top" set (the root of the Min-Heap). If the new element is larger, it deserves a spot in our Top-K list, so we remove the root and insert the new candidate.
This approach is highly efficient because a Min-Heap allows us to access the smallest element in O(1) time and perform insertions/deletions in O(log K) time. By keeping the heap size restricted to K, we ensure that our memory footprint remains constant regardless of the total number of elements processed (N).
The complexity of the heap-based approach is O(N log K). This is a significant improvement over O(N log N) sorting, especially when K is much smaller than N. In terms of space, we only require O(K) additional memory to store the heap, making it ideal for processing data streams where we cannot store all N elements.
Pattern Visualization
The logic for implementing a Top-K solution follows a consistent pipeline. The following flowchart illustrates the decision-making process for finding the K largest elements in a collection.
Implementation
Let's look at the transition from a brute-force sorting approach to the optimized heap-based implementation using Python.
The Brute Force Approach (Sorting)
def get_top_k_sorting(nums, k):
# Sorting takes O(N log N) time
# Space complexity: O(N) or O(log N) depending on sort implementation
nums.sort(reverse=True)
return nums[:k]The Optimized Approach (Min-Heap)
import heapq
def get_top_k_heap(nums, k):
# Initialize a min-heap with the first k elements: O(K)
min_heap = nums[:k]
heapq.heapify(min_heap)
# Iterate through the remaining elements: O((N-K) log K)
for i in range(k, len(nums)):
if nums[i] > min_heap[0]:
heapq.heapreplace(min_heap, nums[i])
# Total Time Complexity: O(N log K)
# Total Space Complexity: O(K)
return min_heapComplexity Analysis Table
| Approach | Time Complexity | Space Complexity | Best Used When... |
|---|---|---|---|
| Sorting | O(N log N) | O(N) | N is small and K is close to N. |
| Min-Heap | O(N log K) | O(K) | N is large, K is small, or data is streaming. |
| QuickSelect | O(N) average | O(1) | You need the fastest average time and can modify the input. |
| Max-Heap | O(N + K log N) | O(N) | You need to extract elements repeatedly from a static set. |
Common Patterns
The Top-K pattern manifests in several variations during interviews. Here are two of the most common.
1. Top K Frequent Elements
Instead of comparing raw values, we compare frequencies stored in a hash map.
from collections import Counter
import heapq
def top_k_frequent(nums, k):
count = Counter(nums)
# Use a heap to store (frequency, value) tuples
return heapq.nlargest(k, count.keys(), key=count.get)2. K Closest Points to Origin
This variation applies the pattern to spatial data using the Euclidean distance formula.
import heapq
def k_closest(points, k):
# Min-heap stores (-distance, x, y) to simulate a Max-Heap
# This keeps the 'closest' points by removing the 'farthest'
heap = []
for x, y in points:
dist = -(x**2 + y**2)
if len(heap) < k:
heapq.heappush(heap, (dist, x, y))
else:
heapq.heappushpop(heap, (dist, x, y))
return [[x, y] for d, x, y in heap]Practice Problems
To master this pattern, you should progress through problems that increase in complexity, from simple numerical extraction to coordinate geometry and frequency analysis.
This Week's Interview Challenge
Problem: K-th Largest Element in a Stream
Design a class to find the k-th largest element in a stream. Note that it is the k-th largest element in the sorted order, not the k-th distinct element.
Implement the KthLargest class:
KthLargest(int k, int[] nums)Initializes the object with the integerkand the stream of integersnums.int add(int val)Appends the integervalto the stream and returns the element representing the k-th largest element in the stream.
Example 1:
- Input:
["KthLargest", "add", "add", "add"],[[3, [4, 5, 8, 2]], [3], [5], [10]] - Output:
[null, 4, 5, 5] - Explanation:
KthLargest(3, [4, 5, 8, 2])-> Heap:[4, 5, 8]add(3)->[4, 5, 8](3 is smaller than 4), returns 4add(5)->[5, 5, 8](5 replaces 4), returns 5add(10)->[5, 8, 10](10 replaces 5), returns 5
Constraints:
1 <= k <= 10^40 <= nums.length <= 10^4-10^4 <= nums[i], val <= 10^4- At most
10^4calls will be made toadd.
Hints:
- Since we only care about the k-th largest element, do we need to store every single number?
- Think about which type of heap would keep the "smallest of the largest" at the root.
- If the heap size exceeds
k, which element should be removed? - The
addfunction should maintain the heap property and return the root.
Template:
import heapq
class KthLargest:
def __init__(self, k: int, nums: list[int]):
# Initialize your heap here
pass
def add(self, val: int) -> int:
# Add value and return the k-th largest
pass
# Submit your solution and check back next week for the detailed answer!Conclusion
Mastering the Top-K pattern using heaps is a rite of passage for software engineers. It demonstrates a deep understanding of how to optimize for both time and space, especially in environments where data is dynamic. In an interview, always start by clarifying if the data is static or a stream. If it is static, you might mention QuickSelect for O(N) performance, but for almost any other scenario, the Min-Heap is your most robust and reliable tool. Remember: to find the largest elements, use a Min-Heap; to find the smallest elements, use a Max-Heap. This inversion is the key to maintaining the correct boundary of "extreme" values.