LeetCode 192 - Word Frequency

This problem asks us to write a shell script that reads a text file named words.txt and computes how many times each word appears. After counting the occurrences, the script must print every unique word together with its frequency, sorted in descending order of frequency.

LeetCode Problem 192

Difficulty: 🟡 Medium
Topics: Shell

Solution

Problem Understanding

This problem asks us to write a shell script that reads a text file named words.txt and computes how many times each word appears. After counting the occurrences, the script must print every unique word together with its frequency, sorted in descending order of frequency.

The input is represented as a plain text file rather than function arguments. The file contains only lowercase English letters and spaces. A word is guaranteed to contain lowercase letters only, and words are separated by one or more whitespace characters. This guarantee simplifies parsing because we do not need to worry about punctuation, uppercase conversion, or special characters.

The expected output consists of lines where each line contains a word followed by its count, separated by a space. The lines must be ordered from highest frequency to lowest frequency. For example, if "the" appears four times and "sunny" appears twice, "the 4" must appear before "sunny 2".

The constraints tell us several important things about the nature of the problem. Since words are separated by arbitrary whitespace, a naive split based on only single spaces could fail when multiple spaces or newline characters are present. We also know that frequency counts are guaranteed to be unique, meaning there will never be ties in sorting order, so we do not need to handle secondary sorting logic.

Several edge cases are worth identifying upfront. Multiple consecutive spaces or line breaks could cause parsing issues if we assume every word is separated by exactly one space. A file containing only one unique word repeated many times should still work correctly. A file containing only one word must produce exactly one output line. The problem guarantees valid lowercase input, which means we do not need defensive handling for punctuation or malformed words.

Approaches

Brute Force Approach

A brute force solution would repeatedly scan the file for every unique word. First, we could extract all words, then for each word, scan the entire file again and count how many times that word appears.

This approach works because every word's frequency is eventually computed by exhaustive comparison. However, it is inefficient. If there are n total words, and for every word we perform another full traversal of the file, the total work becomes quadratic.

In shell scripting, this might involve nested loops or repeated calls to commands like grep, which would repeatedly reread the file. Since file operations are relatively expensive, this approach performs unnecessary repeated work.

Optimal Approach

The key observation is that we only need to scan the input once to count frequencies. Unix command line utilities already provide efficient tools for text transformation, grouping, and sorting.

The optimal pipeline works as follows:

  1. Convert all whitespace-separated words into one word per line.
  2. Sort the words so identical words become adjacent.
  3. Count consecutive duplicates.
  4. Sort again by frequency in descending order.
  5. Reformat the output to match the required format.

The reason sorting is important is that commands like uniq -c only count consecutive duplicate lines. By sorting first, identical words become grouped together automatically.

A concise one-line Unix pipeline solution is:

cat words.txt | tr -s ' ' '\n' | sort | uniq -c | sort -rn | awk '{print $2, $1}'

We can improve this slightly by avoiding unnecessary cat usage:

tr -s ' ' '\n' < words.txt | sort | uniq -c | sort -rn | awk '{print $2, $1}'

Here is what each command does:

  • tr -s ' ' '\n' converts spaces into newlines and squeezes repeated spaces.
  • sort groups identical words together.
  • uniq -c counts adjacent duplicates.
  • sort -rn sorts numerically in descending order by count.
  • awk '{print $2, $1}' swaps the output order to word count.
Approach Time Complexity Space Complexity Notes
Brute Force O(n²) O(1) Repeatedly scans the file for each word
Optimal O(n log n) O(n) Uses sorting and grouping utilities

Algorithm Walkthrough

  1. Read the contents of words.txt and normalize the input into one word per line. This is necessary because words may be separated by spaces or newlines, and tools like uniq work line by line.
  2. Sort all words alphabetically. This step is crucial because uniq -c only counts consecutive repeated values. Without sorting, the same word appearing in different parts of the file would be counted separately.
  3. Count consecutive duplicate words using uniq -c. At this stage, every unique word will be paired with its frequency.
  4. Sort the results numerically in descending order based on the frequency count. This ensures the highest frequency words appear first.
  5. Reformat the output into the required word count format. Since uniq -c prints count word, we reverse the order using awk.

The choice of Unix pipes is important because each utility specializes in one task, and pipelines allow data to flow efficiently between commands without creating unnecessary intermediate files.

Python Solution

Since this is a Shell problem, LeetCode expects a Bash script rather than a Python implementation. However, to mirror the counting logic in Python, the equivalent implementation would look like this:

from collections import Counter

class Solution:
    def wordFrequency(self) -> None:
        with open("words.txt", "r", encoding="utf-8") as file:
            words = file.read().split()

        word_counts = Counter(words)

        sorted_words = sorted(
            word_counts.items(),
            key=lambda item: item[1],
            reverse=True
        )

        for word, count in sorted_words:
            print(word, count)

This implementation reads the file, splits words by whitespace, counts frequencies using Counter, sorts by descending frequency, and prints the result.

The code begins by opening words.txt and reading its contents. Calling .split() is sufficient because it automatically handles one or more whitespace characters, including spaces and newlines. The Counter object acts as a hash map that tracks how many times each word appears. After counting, we sort the dictionary items using the frequency value as the sorting key in descending order. Finally, the results are printed in the required format.

Go Solution

Although LeetCode expects Bash for this problem, the same logic can be implemented in Go for comparison:

package main

import (
	"fmt"
	"os"
	"sort"
	"strings"
)

type WordCount struct {
	Word  string
	Count int
}

func wordFrequency() {
	content, err := os.ReadFile("words.txt")
	if err != nil {
		return
	}

	words := strings.Fields(string(content))

	wordCounts := make(map[string]int)

	for _, word := range words {
		wordCounts[word]++
	}

	result := make([]WordCount, 0, len(wordCounts))

	for word, count := range wordCounts {
		result = append(result, WordCount{
			Word:  word,
			Count: count,
		})
	}

	sort.Slice(result, func(i, j int) bool {
		return result[i].Count > result[j].Count
	})

	for _, item := range result {
		fmt.Printf("%s %d\n", item.Word, item.Count)
	}
}

The Go version uses os.ReadFile to load the file content and strings.Fields() to split words using arbitrary whitespace. A map acts as the frequency table, similar to Python's Counter. Because Go maps are unordered, we create a slice of structs and sort it explicitly using sort.Slice().

Unlike Python, Go does not provide a built in frequency counting utility like Counter, so manual counting through a map is necessary. There are no concerns about integer overflow because word frequencies in this problem remain comfortably within Go's integer limits.

Worked Examples

Consider the input:

the day is sunny the the
the sunny is is

Step 1: Split Into Individual Words

After whitespace normalization:

Position Word
1 the
2 day
3 is
4 sunny
5 the
6 the
7 the
8 sunny
9 is
10 is

Step 2: Count Frequencies

The frequency table becomes:

Word Count
the 4
day 1
is 3
sunny 2

Step 3: Sort by Descending Frequency

After sorting:

Word Count
the 4
is 3
sunny 2
day 1

Final Output

the 4
is 3
sunny 2
day 1

If we trace the Unix pipeline:

After tr -s ' ' '\n'

the
day
is
sunny
the
the
the
sunny
is
is

After sort

day
is
is
is
sunny
sunny
the
the
the
the

After uniq -c

1 day
3 is
2 sunny
4 the

After sort -rn

4 the
3 is
2 sunny
1 day

After awk '{print $2, $1}'

the 4
is 3
sunny 2
day 1

Complexity Analysis

Measure Complexity Explanation
Time O(n log n) Sorting dominates the runtime
Space O(n) Sorting and storing words require extra memory

The algorithm spends most of its time sorting the words. If there are n words, sorting requires O(n log n) time. Operations like splitting input and counting duplicates are linear. The space complexity is O(n) because the words and intermediate sorted data structures must be stored.

Edge Cases

One important edge case is multiple consecutive spaces or newline characters between words. A naive parser that splits only on single spaces could produce empty tokens or incorrect counts. The implementation handles this correctly because tr -s compresses repeated spaces in Bash, while Python's .split() and Go's strings.Fields() naturally handle arbitrary whitespace.

Another edge case is a file containing only one unique word repeated many times, such as:

hello hello hello hello

A poor implementation might accidentally print duplicate entries if grouping is not handled properly. The sorting followed by uniq -c guarantees that identical words are merged into a single frequency count.

A third edge case is a file containing exactly one word:

leetcode

The algorithm still behaves correctly because sorting and counting work for single element inputs. The output becomes:

leetcode 1

Finally, words appearing across multiple lines could be problematic if line breaks are treated differently from spaces. Since the problem states words are separated by whitespace characters, using tools that normalize whitespace ensures correct behavior regardless of line formatting.