LeetCode 692 - Top K Frequent Words

LeetCode Problem 692

Difficulty: 🟡 Medium
Topics: Array, Hash Table, String, Trie, Sorting, Heap (Priority Queue), Bucket Sort, Counting

Solution

Problem Understanding

The problem gives us an array of strings called words and an integer k. Each string represents a word, and words may appear multiple times in the array. Our task is to return the k most frequent words.

The result must follow two ordering rules. First, words with higher frequency must appear earlier in the result. Second, if two words have the same frequency, the word with smaller lexicographical order, meaning alphabetical order, must come first.

In other words, we are ranking words using two criteria:

Higher frequency is better.
If frequencies are equal, alphabetically smaller words are better.

For example, if both "apple" and "banana" appear three times, then "apple" must come before "banana" because "apple" is lexicographically smaller.

The input constraints are relatively small. The array length is at most 500, so even an O(n log n) solution is perfectly acceptable. However, the follow-up asks whether we can achieve O(n log k) time, which suggests using a heap to avoid sorting all unique words when we only need the top k.

The constraints also guarantee several important things:

Every word contains only lowercase English letters.
k is always valid.
There is at least one word.
k never exceeds the number of unique words.

These guarantees simplify implementation because we never need to handle invalid input or empty results.

Several edge cases are important to consider:

Multiple words may have exactly the same frequency.
k may equal the number of unique words.
All words may be identical.
Every word may be unique.
Lexicographical ordering becomes critical when frequencies tie.

A naive implementation often fails on tie-breaking rules, especially because the problem requires alphabetical order only when frequencies are equal.

Approaches

Brute Force Approach

The most straightforward solution is to count the frequency of every word using a hash map, then sort all unique words using a custom sorting rule.

The sorting rule is:

Higher frequency first.
Lexicographically smaller word first when frequencies are equal.

After sorting, we simply take the first k words.

This approach is correct because sorting globally guarantees the final ordering satisfies both requirements. However, it may do unnecessary work because we sort all unique words even though we only need the top k.

Key Insight for the Optimal Solution

The key observation is that we do not actually need all words fully sorted if k is much smaller than the number of unique words.

A heap allows us to maintain only the best k candidates seen so far. Instead of sorting everything, we continuously keep the heap size limited to k.

The tricky part is tie-breaking. Since we want:

Higher frequency to rank higher
Lexicographically smaller words to rank higher on ties

We construct a min-heap where the "worst" candidate stays at the top. That way, when the heap grows beyond size k, we remove the least desirable element.

This reduces the complexity from sorting all unique words to maintaining only k elements at a time.

Approach	Time Complexity	Space Complexity	Notes
Brute Force	O(n + m log m)	O(m)	Count frequencies, then sort all unique words
Optimal	O(n log k)	O(m)	Use frequency map and min-heap of size `k`

Here, n is the total number of words and m is the number of unique words.

Algorithm Walkthrough

Optimal Heap-Based Algorithm

Count the frequency of every word using a hash map.

We iterate through the input array once and store counts in a dictionary. This gives constant-time updates and allows efficient frequency lookup later. 2. Create a min-heap to store at most k elements.

The heap will contain pairs representing each word and its frequency. The heap ordering is designed so the least desirable candidate appears at the top. 3. Iterate through all unique words in the frequency map.

For each word:

Push it into the heap.
If the heap size exceeds k, remove the smallest element.

This ensures the heap always contains the current best k words. 4. Extract all words from the heap.

Since the heap is a min-heap, extraction gives words from worst to best. We therefore reverse the final result. 5. Return the reversed list.

After reversal, the result satisfies:

Higher frequencies first
Lexicographically smaller words first for ties

Why it works

The heap invariant is that the heap always contains the best k candidates seen so far. Whenever a new candidate is better than the worst element currently in the heap, it survives while the worse element gets removed.

Because the heap ordering exactly mirrors the ranking rules in reverse priority, removing the smallest heap element always discards the least desirable candidate. After processing all words, the heap therefore contains exactly the top k frequent words.

Python Solution

from typing import List
from collections import Counter
import heapq

class Solution:
    def topKFrequent(self, words: List[str], k: int) -> List[str]:
        frequency = Counter(words)

        heap = []

        for word, count in frequency.items():
            heapq.heappush(heap, (-count, word))

        result = []

        for _ in range(k):
            count, word = heapq.heappop(heap)
            result.append(word)

        return result

The implementation begins by using Counter from Python's collections module to build the frequency map. This is both concise and efficient.

The heap stores tuples of (-count, word). Python's heap implementation is a min-heap, so negating the count effectively simulates max-heap behavior. This allows words with larger frequencies to come out first.

The second tuple element, word, automatically handles lexicographical ordering when frequencies are equal. Python compares tuples element by element, so:

Larger frequency becomes smaller negative number.
Lexicographically smaller word comes first naturally.

We then pop exactly k elements from the heap and append the words to the result list.

Although the earlier walkthrough discussed a heap of size k, Python's tuple ordering allows a simpler implementation that pushes all elements and extracts the top k. This is still efficient for the given constraints and very clean to read.

Go Solution

package main

import (
	"container/heap"
)

type WordFreq struct {
	word  string
	count int
}

type MinHeap []WordFreq

func (h MinHeap) Len() int {
	return len(h)
}

func (h MinHeap) Less(i, j int) bool {
	if h[i].count == h[j].count {
		return h[i].word < h[j].word
	}

	return h[i].count > h[j].count
}

func (h MinHeap) Swap(i, j int) {
	h[i], h[j] = h[j], h[i]
}

func (h *MinHeap) Push(x interface{}) {
	*h = append(*h, x.(WordFreq))
}

func (h *MinHeap) Pop() interface{} {
	old := *h
	n := len(old)

	item := old[n-1]
	*h = old[:n-1]

	return item
}

func topKFrequent(words []string, k int) []string {
	frequency := make(map[string]int)

	for _, word := range words {
		frequency[word]++
	}

	h := &MinHeap{}
	heap.Init(h)

	for word, count := range frequency {
		heap.Push(h, WordFreq{
			word:  word,
			count: count,
		})
	}

	result := make([]string, 0, k)

	for i := 0; i < k; i++ {
		item := heap.Pop(h).(WordFreq)
		result = append(result, item.word)
	}

	return result
}

The Go implementation follows the same overall strategy as the Python version but requires explicit heap implementation because Go's container/heap package only provides heap mechanics, not built-in priority queue behavior.

The Less function defines ordering rules. Higher frequency should appear first, so larger counts are treated as higher priority. When counts are equal, lexicographically smaller words rank higher.

Unlike Python, Go does not provide tuple comparison automatically, so the comparison logic must be written explicitly.

Slices are dynamically resized during insertion, and no special handling for nil is required because the constraints guarantee valid input.

Worked Examples

Example 1

Input:

words = ["i","love","leetcode","i","love","coding"]
k = 2

Step 1: Count Frequencies

Word	Frequency
i	2
love	2
leetcode	1
coding	1

Step 2: Build Heap

Heap contents conceptually become:

Priority	Word
2	i
2	love
1	coding
1	leetcode

Because "i" is lexicographically smaller than "love", it ranks higher when frequencies tie.

Step 3: Extract Top K

First pop:

("i", 2)

Second pop:

("love", 2)

Final Result

["i", "love"]

Example 2

Input:

words = ["the","day","is","sunny","the","the","the","sunny","is","is"]
k = 4

Step 1: Count Frequencies

Word	Frequency
the	4
is	3
sunny	2
day	1

Step 2: Build Heap

Conceptual ordering:

Priority	Word
4	the
3	is
2	sunny
1	day

Step 3: Extract Top K

Pop sequence:

Extraction	Word
1	the
2	is
3	sunny
4	day

Final Result

["the", "is", "sunny", "day"]

Complexity Analysis

Measure	Complexity	Explanation
Time	O(n + m log m)	Counting takes O(n), heap insertions take O(m log m)
Space	O(m)	Frequency map and heap store unique words

Here:

n is the total number of words.
m is the number of unique words.

The frequency counting phase is linear because each word is processed exactly once. Heap insertion occurs once per unique word. Since the heap may contain all unique words, insertion complexity becomes O(m log m).

If we instead maintain a heap of size k, the complexity can be improved to O(n log k).

Test Cases

solution = Solution()

# Example 1
assert solution.topKFrequent(
    ["i","love","leetcode","i","love","coding"], 2
) == ["i","love"]

# Example 2
assert solution.topKFrequent(
    ["the","day","is","sunny","the","the","the","sunny","is","is"], 4
) == ["the","is","sunny","day"]

# Single word input
assert solution.topKFrequent(
    ["hello"], 1
) == ["hello"]

# All words identical
assert solution.topKFrequent(
    ["a","a","a","a"], 1
) == ["a"]

# All words unique
assert solution.topKFrequent(
    ["c","b","a"], 3
) == ["a","b","c"]

# Tie frequency with lexicographical ordering
assert solution.topKFrequent(
    ["apple","banana","apple","banana","cat"], 2
) == ["apple","banana"]

# k equals number of unique words
assert solution.topKFrequent(
    ["x","y","z","x"], 3
) == ["x","y","z"]

# Larger frequency dominance
assert solution.topKFrequent(
    ["a","a","a","b","b","c"], 2
) == ["a","b"]

# Lexicographical tie-breaking
assert solution.topKFrequent(
    ["aa","ab","ac","aa","ab","ac"], 3
) == ["aa","ab","ac"]

Test	Why
Example 1	Validates standard frequency counting
Example 2	Validates multiple frequencies
Single word input	Smallest valid input
All words identical	Ensures repeated counting works
All words unique	Tests lexicographical ordering only
Tie frequency case	Validates alphabetical tie-breaking
k equals unique count	Ensures full output handling
Larger frequency dominance	Confirms highest counts come first
Equal frequency ordering	Verifies lexicographical comparison

Edge Cases

One important edge case occurs when all words have the same frequency. In this situation, the output ordering depends entirely on lexicographical order. Many incorrect solutions forget this secondary sorting rule and produce arbitrary ordering. The implementation handles this correctly because tuple comparison in Python automatically compares the word when frequencies tie.

Another important case is when there is only one unique word repeated many times. A buggy implementation may incorrectly assume multiple distinct elements exist or mishandle heap extraction. Since the frequency map stores unique words naturally, the solution correctly returns the single repeated word.

A third edge case happens when k equals the number of unique words. In this scenario, the algorithm must return every unique word in properly sorted order. Some heap-based implementations accidentally discard elements or reverse ordering incorrectly. Because the implementation extracts exactly k elements after building the heap, every unique word appears exactly once in the final result.

A final subtle case involves words that differ only slightly lexicographically, such as "aa", "ab", and "ac". Incorrect custom comparators often mishandle alphabetical ordering. Using tuple ordering in Python guarantees lexicographical comparisons follow the language's built-in string comparison semantics.