LeetCode 192: Word Frequency

A clear explanation of the Word Frequency shell problem using Unix text-processing tools.

Problem Restatement

We are given a text file named words.txt.

We need to write a Bash script that counts how many times each word appears in the file.

The output should show each word followed by its frequency, sorted from highest frequency to lowest frequency.

The problem assumes that:

Rule Meaning
Characters The file contains only lowercase letters and spaces
Words Each word contains lowercase letters only
Separators Words are separated by one or more whitespace characters
Ties Frequency counts are unique, so tie handling is not needed

Example input:

the day is sunny the the
the sunny is is

Expected output:

the 4
is 3
sunny 2
day 1

Input and Output

Item Meaning
Input A file named words.txt
Output Each word and its frequency
Order Descending by frequency
Language Bash shell script

Key Insight

The important detail is that uniq -c only counts adjacent duplicate lines.

So before counting, we must make equal words adjacent.

That gives us the pipeline:

  1. Put each word on its own line.
  2. Sort the words alphabetically.
  3. Count adjacent equal words.
  4. Sort the counts in descending order.
  5. Print the word first, then the count.

Algorithm

Start with the file:

cat words.txt

Convert spaces into newlines:

tr -s ' ' '\n'

The -s option squeezes repeated spaces, so multiple spaces behave like one separator.

Then sort the words:

sort

Now equal words are next to each other.

Count repeated adjacent words:

uniq -c

This produces output like:

1 day
3 is
2 sunny
4 the

Sort by count in descending numeric order:

sort -nr

Finally, swap the columns so the word appears before the count:

awk '{ print $2, $1 }'

Correctness

After tr -s ' ' '\n', every word from the file appears on its own line.

After sort, all equal words are grouped together. Therefore, each group contains exactly all occurrences of one word.

The command uniq -c counts the size of each group, so it computes the correct frequency for every distinct word.

Then sort -nr orders the counted rows by descending numeric frequency.

Finally, awk '{ print $2, $1 }' changes the display format from count word to word count.

Therefore, the pipeline prints every word with its correct frequency in descending frequency order.

Complexity

Let n be the number of words in the file.

Metric Value Why
Time O(n log n) Sorting dominates the pipeline
Space O(n) Sorting may need storage proportional to the input size

Implementation

# Read from the file words.txt and output the word frequency list to stdout.
cat words.txt | tr -s ' ' '\n' | sort | uniq -c | sort -nr | awk '{ print $2, $1 }'

Code Explanation

The command starts by reading the file:

cat words.txt

Then it normalizes the input so each word appears on a separate line:

tr -s ' ' '\n'

Next, it sorts the words:

sort

This step is required because uniq -c only counts duplicate lines that are next to each other.

Then it counts each group:

uniq -c

The result has the count first and the word second.

Then it sorts by frequency:

sort -nr

The -n option means numeric sort.

The -r option means reverse order, so larger counts come first.

Finally:

awk '{ print $2, $1 }'

prints the second field first, then the first field.

So this:

4 the

becomes:

the 4

Testing

Create a sample file:

cat > words.txt << 'EOF'
the day is sunny the the
the sunny is is
EOF

Run the solution:

cat words.txt | tr -s ' ' '\n' | sort | uniq -c | sort -nr | awk '{ print $2, $1 }'

Expected output:

the 4
is 3
sunny 2
day 1

Another test with repeated spaces:

cat > words.txt << 'EOF'
a   b a
c b   a
EOF

Expected output:

a 3
b 2
c 1