LeetCode 820 - Short Encoding of Words

The problem asks us to find the length of the shortest reference string that can encode a list of words. A reference string is formed by concatenating some of the words with a '' character at the end of each word.

LeetCode Problem 820

Difficulty: 🟡 Medium
Topics: Array, Hash Table, String, Trie

Solution

Problem Understanding

The problem asks us to find the length of the shortest reference string that can encode a list of words. A reference string is formed by concatenating some of the words with a '#' character at the end of each word. Importantly, a word can be represented by a substring of the reference string if it appears as a suffix of another word.

For example, if we have ["time", "me", "bell"], the word "me" is a suffix of "time", so it does not need to be included separately. The valid encoding "time#bell#" covers all words because "time" and "bell" are directly in the string, and "me" can be found as the substring "time"[2:].

The input is an array of words where each word has length up to 7 and there can be up to 2000 words. This is small enough to consider solutions using tries or sets efficiently. The key points are that words are all lowercase and can have overlaps as suffixes.

Edge cases include a single word, words that are identical, or words that are entirely suffixes of others. For example, ["a", "aa", "aaa"] requires careful handling to avoid double counting.

Approaches

Brute Force

The brute-force approach would try every possible order of concatenating words with '#' and then check whether each word is a substring ending before a '#'. While this could eventually find the shortest reference string, the number of permutations grows factorially with the number of words (O(n!)), making it infeasible for n = 2000.

Optimal Approach

The key observation is that a word only needs to be explicitly included in the reference string if it is not a suffix of another word. This means we can focus on removing words that are suffixes of other words.

One efficient method is to use a set of all words. For each word, we remove all its suffixes (other than itself) from the set. At the end, only words that are not suffixes of any other word remain, and the total encoding length is the sum of their lengths plus one '#' per word.

Alternatively, we can use a Trie built on the reversed words. Words that end in leaf nodes of the Trie are the ones that must be included in the encoding. This method scales well and is clean in implementation.

Approach Time Complexity Space Complexity Notes
Brute Force O(n!) O(n) Try all concatenations, impractical for n=2000
Suffix Set / Trie O(n * k^2) O(n * k) Remove suffixes or use reversed Trie to find necessary words

Algorithm Walkthrough

  1. Initialize a set containing all words.
  2. Iterate over each word in the list.
  3. For each word, generate all possible non-empty suffixes (excluding the word itself) and remove them from the set.

This ensures only words that are not suffixes of other words remain. 4. After processing all words, each remaining word in the set must appear in the encoding. 5. The shortest reference string length is the sum of the lengths of the remaining words plus one '#' per word. 6. Return this total length.

Why it works: By removing all words that are suffixes of another, we guarantee no word is redundantly encoded. Each remaining word contributes exactly once to the final encoding, ensuring minimal length.

Python Solution

from typing import List

class Solution:
    def minimumLengthEncoding(self, words: List[str]) -> int:
        unique_words = set(words)
        for word in words:
            for k in range(1, len(word)):
                suffix = word[k:]
                if suffix in unique_words:
                    unique_words.discard(suffix)
        return sum(len(word) + 1 for word in unique_words)

Explanation:

We start by creating a set of all words to allow O(1) removal of suffixes. For each word, we generate every possible suffix and remove it from the set if it exists. Finally, we sum the lengths of the remaining words plus one for the '#' character per word to get the minimal encoding length.

Go Solution

func minimumLengthEncoding(words []string) int {
    uniqueWords := make(map[string]struct{})
    for _, word := range words {
        uniqueWords[word] = struct{}{}
    }

    for _, word := range words {
        for k := 1; k < len(word); k++ {
            suffix := word[k:]
            delete(uniqueWords, suffix)
        }
    }

    total := 0
    for word := range uniqueWords {
        total += len(word) + 1
    }
    return total
}

Explanation:

In Go, we use a map[string]struct{} to represent the set of unique words. For each word, we delete all its suffixes from the map. Finally, we iterate over the remaining words and sum their lengths plus one for '#' to get the result.

Worked Examples

Example 1: ["time", "me", "bell"]

Step Set contents Action
Initial {"time", "me", "bell"} Start
Word "time" {"time", "me", "bell"} Remove "ime", "me", "e" → "me" removed
Word "me" {"time", "bell"} Remove "e" → no change
Word "bell" {"time", "bell"} Remove "ell", "ll", "l" → no change
Sum lengths 4+1 + 4+1 = 10 Final answer 10

Example 2: ["t"]

Step Set contents Action
Initial {"t"} Start
Word "t" {"t"} No suffixes to remove
Sum lengths 1+1 = 2 Final answer 2

Complexity Analysis

Measure Complexity Explanation
Time O(n * k^2) Each word generates k suffixes, each removal is O(1) in a set, n words total
Space O(n * k) Set stores up to n words of length k

The complexity is efficient for n ≤ 2000 and k ≤ 7.

Test Cases

# Basic examples
assert Solution().minimumLengthEncoding(["time", "me", "bell"]) == 10  # "time#bell#"
assert Solution().minimumLengthEncoding(["t"]) == 2  # "t#"

# Edge cases
assert Solution().minimumLengthEncoding(["a", "aa", "aaa"]) == 4  # "aaa#"
assert Solution().minimumLengthEncoding(["me", "time", "ime"]) == 5  # "time#"
assert Solution().minimumLengthEncoding(["apple", "ple", "e"]) == 6  # "apple#"

# Duplicate words
assert Solution().minimumLengthEncoding(["time", "time"]) == 5  # "time#"
Test Why
["time", "me", "bell"] Checks standard suffix removal
["t"] Single word minimal case
["a", "aa", "aaa"] Nested suffixes
["me", "time", "ime"] Suffixes in different order
["apple", "ple", "e"] All suffixes included
["time", "time"] Duplicate words should not increase length

Edge Cases

One important edge case is when a single character word exists alongside longer words that end with the same character, e.g., ["a", "ba"]. The single character "a" should be removed as it is a suffix of "ba".

Another edge case is multiple identical words. The algorithm must handle duplicates correctly by using a set, ensuring each word contributes only once to the encoding length.

Finally, completely nested suffix chains such as ["a", "aa", "aaa", "aaaa"] must be handled so only the longest words that are not suffixes themselves are included, ensuring minimal encoding length.