DSA: Top-K Elements Using Heaps

7 min read5k

Solution to Last Week's Challenge

Last week's problem was: DSA: Sliding Window Pattern Explained

The optimal solution uses dsa: sliding window pattern explained pattern with careful state management.

Complexity Analysis:

  • Time Complexity: O(n)
  • Space Complexity: O(1)

In the landscape of technical interviews and high-performance system design, the ability to efficiently extract a subset of extreme values from a massive dataset is a fundamental skill. Whether you are building a real-time leaderboard for a gaming platform, identifying the most frequent search queries in a search engine, or calculating the K-closest delivery hubs in a logistics app, you are essentially solving the Top-K elements problem. For a senior engineer, the goal is not just to find a solution, but to find the one that balances memory constraints with computational speed.

While a naive approach involving sorting might suffice for small arrays, it fails to scale when dealing with streaming data or massive datasets that do not fit into memory. This is where the Heap data structure—specifically the Priority Queue—becomes an indispensable tool. By maintaining a specialized tree structure, we can reduce our time complexity from a full sort to a much more manageable logarithmic relationship with K, the number of elements we wish to track.

Theory

The core concept of the Top-K pattern revolves around maintaining a "pool" of the best candidates encountered so far. To find the K largest elements, we use a Min-Heap of size K. As we iterate through the dataset, we compare each new element to the smallest element currently in our "top" set (the root of the Min-Heap). If the new element is larger, it deserves a spot in our Top-K list, so we remove the root and insert the new candidate.

This approach is highly efficient because a Min-Heap allows us to access the smallest element in O(1) time and perform insertions/deletions in O(log K) time. By keeping the heap size restricted to K, we ensure that our memory footprint remains constant regardless of the total number of elements processed (N).

The complexity of the heap-based approach is O(N log K). This is a significant improvement over O(N log N) sorting, especially when K is much smaller than N. In terms of space, we only require O(K) additional memory to store the heap, making it ideal for processing data streams where we cannot store all N elements.

Pattern Visualization

The logic for implementing a Top-K solution follows a consistent pipeline. The following flowchart illustrates the decision-making process for finding the K largest elements in a collection.

Implementation

Let's look at the transition from a brute-force sorting approach to the optimized heap-based implementation using Python.

The Brute Force Approach (Sorting)

python
def get_top_k_sorting(nums, k):
    # Sorting takes O(N log N) time
    # Space complexity: O(N) or O(log N) depending on sort implementation
    nums.sort(reverse=True)
    return nums[:k]

The Optimized Approach (Min-Heap)

python
import heapq

def get_top_k_heap(nums, k):
    # Initialize a min-heap with the first k elements: O(K)
    min_heap = nums[:k]
    heapq.heapify(min_heap)
    
    # Iterate through the remaining elements: O((N-K) log K)
    for i in range(k, len(nums)):
        if nums[i] > min_heap[0]:
            heapq.heapreplace(min_heap, nums[i])
            
    # Total Time Complexity: O(N log K)
    # Total Space Complexity: O(K)
    return min_heap

Complexity Analysis Table

ApproachTime ComplexitySpace ComplexityBest Used When...
SortingO(N log N)O(N)N is small and K is close to N.
Min-HeapO(N log K)O(K)N is large, K is small, or data is streaming.
QuickSelectO(N) averageO(1)You need the fastest average time and can modify the input.
Max-HeapO(N + K log N)O(N)You need to extract elements repeatedly from a static set.

Common Patterns

The Top-K pattern manifests in several variations during interviews. Here are two of the most common.

1. Top K Frequent Elements

Instead of comparing raw values, we compare frequencies stored in a hash map.

python
from collections import Counter
import heapq

def top_k_frequent(nums, k):
    count = Counter(nums)
    # Use a heap to store (frequency, value) tuples
    return heapq.nlargest(k, count.keys(), key=count.get)

2. K Closest Points to Origin

This variation applies the pattern to spatial data using the Euclidean distance formula.

python
import heapq

def k_closest(points, k):
    # Min-heap stores (-distance, x, y) to simulate a Max-Heap
    # This keeps the 'closest' points by removing the 'farthest'
    heap = []
    for x, y in points:
        dist = -(x**2 + y**2)
        if len(heap) < k:
            heapq.heappush(heap, (dist, x, y))
        else:
            heapq.heappushpop(heap, (dist, x, y))
    
    return [[x, y] for d, x, y in heap]

Practice Problems

To master this pattern, you should progress through problems that increase in complexity, from simple numerical extraction to coordinate geometry and frequency analysis.

This Week's Interview Challenge

Problem: K-th Largest Element in a Stream

Design a class to find the k-th largest element in a stream. Note that it is the k-th largest element in the sorted order, not the k-th distinct element.

Implement the KthLargest class:

  • KthLargest(int k, int[] nums) Initializes the object with the integer k and the stream of integers nums.
  • int add(int val) Appends the integer val to the stream and returns the element representing the k-th largest element in the stream.

Example 1:

  • Input: ["KthLargest", "add", "add", "add"], [[3, [4, 5, 8, 2]], [3], [5], [10]]
  • Output: [null, 4, 5, 5]
  • Explanation:
    • KthLargest(3, [4, 5, 8, 2]) -> Heap: [4, 5, 8]
    • add(3) -> [4, 5, 8] (3 is smaller than 4), returns 4
    • add(5) -> [5, 5, 8] (5 replaces 4), returns 5
    • add(10) -> [5, 8, 10] (10 replaces 5), returns 5

Constraints:

  • 1 <= k <= 10^4
  • 0 <= nums.length <= 10^4
  • -10^4 <= nums[i], val <= 10^4
  • At most 10^4 calls will be made to add.

Hints:

  1. Since we only care about the k-th largest element, do we need to store every single number?
  2. Think about which type of heap would keep the "smallest of the largest" at the root.
  3. If the heap size exceeds k, which element should be removed?
  4. The add function should maintain the heap property and return the root.

Template:

python
import heapq

class KthLargest:
    def __init__(self, k: int, nums: list[int]):
        # Initialize your heap here
        pass

    def add(self, val: int) -> int:
        # Add value and return the k-th largest
        pass

# Submit your solution and check back next week for the detailed answer!

Conclusion

Mastering the Top-K pattern using heaps is a rite of passage for software engineers. It demonstrates a deep understanding of how to optimize for both time and space, especially in environments where data is dynamic. In an interview, always start by clarifying if the data is static or a stream. If it is static, you might mention QuickSelect for O(N) performance, but for almost any other scenario, the Min-Heap is your most robust and reliable tool. Remember: to find the largest elements, use a Min-Heap; to find the smallest elements, use a Max-Heap. This inversion is the key to maintaining the correct boundary of "extreme" values.