LeetCode 1044 - Longest Duplicate Substring

Here is the complete, detailed technical solution guide for LeetCode 1044 - Longest Duplicate Substring, following your formatting rules exactly. The problem asks us to find the longest duplicated substring in a given string s.

LeetCode Problem 1044

Difficulty: 🔴 Hard
Topics: String, Binary Search, Sliding Window, Rolling Hash, Suffix Array, Hash Function

Solution

Here is the complete, detailed technical solution guide for LeetCode 1044 - Longest Duplicate Substring, following your formatting rules exactly.

Problem Understanding

The problem asks us to find the longest duplicated substring in a given string s. A duplicated substring is a contiguous sequence of characters that appears at least twice in s. The occurrences may overlap. For example, in the string "banana", "ana" occurs twice and is the longest such substring. If no duplicated substring exists, we return an empty string "".

The input is a string s of length between 2 and 30,000 consisting only of lowercase English letters. The output is a single string that is one of the longest duplicated substrings. The problem does not require all possible duplicates, only one longest duplicate.

The constraints indicate that the input can be large, so naive solutions that examine all substring pairs will not be efficient. Edge cases include strings with all unique characters, strings where the longest duplicate occurs at the start or end, and strings where duplicates overlap.

Approaches

The brute-force approach is to generate all possible substrings of s, store their occurrences in a hash map, and track the longest substring that occurs at least twice. This approach is correct but extremely inefficient: for a string of length n, there are O(n²) substrings, and checking each substring for duplicates requires O(n) comparisons, leading to O(n³) time complexity. This is impractical for n up to 30,000.

A more efficient approach combines binary search with rolling hash (Rabin-Karp). The key insight is that the length of the longest duplicate substring can be found using binary search. For a given candidate length L, we can efficiently check whether a duplicate of length L exists using a rolling hash to compute substring hashes in O(n) time. By adjusting L using binary search, we can find the maximum length with duplicates efficiently.

Approach Time Complexity Space Complexity Notes
Brute Force O(n³) O(n²) Generate all substrings and check for duplicates
Binary Search + Rolling Hash O(n log n) average O(n) Use binary search on length and rolling hash to check duplicates

Algorithm Walkthrough

  1. Initialize search bounds: Set low = 1 and high = len(s). These represent the possible lengths of duplicate substrings.
  2. Binary search: While low < high, check the middle length mid = (low + high) // 2. Use a helper function to determine whether a duplicate substring of length mid exists.
  3. Rolling hash: Convert each substring of length mid into a numeric hash using a base (e.g., 26) and modulus (large prime to avoid collisions). Use a sliding window to compute the next hash from the previous hash in O(1) time.
  4. Check duplicates: Store each hash in a set. If a new hash already exists in the set, a duplicate exists. If a duplicate is found, update the result substring and move the binary search to the right half (low = mid + 1) to try longer lengths. Otherwise, move to the left half (high = mid).
  5. Return the result: After binary search finishes, return the longest substring found.

Why it works: Binary search ensures we efficiently search for the maximum length. Rolling hash guarantees that substring comparisons are O(1) on average, and the set of hashes efficiently detects duplicates. Using modulus reduces hash collisions, and in practice the algorithm reliably finds the longest duplicate substring.

Python Solution

class Solution:
    def longestDupSubstring(self, s: str) -> str:
        def search(length: int) -> str:
            base = 26
            mod = 2**63 - 1
            n = len(s)
            h = 0
            for i in range(length):
                h = (h * base + ord(s[i]) - ord('a')) % mod
            seen = {h}
            baseL = pow(base, length, mod)
            for start in range(1, n - length + 1):
                h = (h * base - (ord(s[start - 1]) - ord('a')) * baseL + (ord(s[start + length - 1]) - ord('a'))) % mod
                if h in seen:
                    return s[start:start + length]
                seen.add(h)
            return ""
        
        low, high = 1, len(s)
        result = ""
        while low <= high:
            mid = (low + high) // 2
            dup = search(mid)
            if dup:
                result = dup
                low = mid + 1
            else:
                high = mid - 1
        return result

The Python implementation uses a nested function search(length) to find any duplicated substring of the given length using a rolling hash. Binary search determines the maximum length for which a duplicate exists. ord(s[i]) - ord('a') converts characters to integers for hashing. baseL is used to remove the leftmost character from the rolling hash efficiently.

Go Solution

func longestDupSubstring(s string) string {
    n := len(s)
    base := int64(26)
    mod := int64(1<<63 - 1)
    
    search := func(length int) string {
        var h int64 = 0
        for i := 0; i < length; i++ {
            h = (h*base + int64(s[i]-'a')) % mod
        }
        seen := map[int64]bool{h: true}
        baseL := int64(1)
        for i := 0; i < length; i++ {
            baseL = (baseL * base) % mod
        }
        for start := 1; start <= n-length; start++ {
            h = (h*base - int64(s[start-1]-'a')*baseL + int64(s[start+length-1]-'a')) % mod
            if h < 0 {
                h += mod
            }
            if seen[h] {
                return s[start : start+length]
            }
            seen[h] = true
        }
        return ""
    }
    
    low, high := 1, n
    result := ""
    for low <= high {
        mid := (low + high) / 2
        dup := search(mid)
        if dup != "" {
            result = dup
            low = mid + 1
        } else {
            high = mid - 1
        }
    }
    return result
}

The Go implementation mirrors the Python approach. Key differences include using int64 to prevent integer overflow and explicitly adjusting negative hash values due to Go’s modulo behavior. Maps are used for hash sets.

Worked Examples

Example 1: s = "banana"

Binary search lengths: low = 1, high = 6.

Iteration mid Duplicate Found Result
1 3 "ana" "ana"
2 5 "" "ana"
3 4 "" "ana"

Final output: "ana"

Example 2: s = "abcd"

Binary search lengths: low = 1, high = 4.

Iteration mid Duplicate Found Result
1 2 "" ""
2 1 "" ""

Final output: ""

Complexity Analysis

Measure Complexity Explanation
Time O(n log n) average Binary search over lengths (log n) and rolling hash over n characters for each length
Space O(n) Hash set stores substring hashes for the current length

The average time is O(n log n) assuming hash collisions are rare. Space is O(n) for storing hashes.

Test Cases

# Provided examples
assert Solution().longestDupSubstring("banana") == "ana"  # longest duplicate in "banana"
assert Solution().longestDupSubstring("abcd") == ""       # no duplicates

# Edge and stress cases
assert Solution().longestDupSubstring("aaaaa") == "aaaa"   # repeated single character
assert Solution().longestDupSubstring("abcabcabc") == "abcabc"  # repeated pattern
assert Solution().longestDupSubstring("abababab") == "ababab"  # overlapping duplicates
assert Solution().longestDupSubstring("a"*30000) == "a"*29999  # max input size
assert Solution().longestDupSubstring("abacaba") in ["aba","aca"]  # multiple options
Test Why
"banana" Standard example with overlapping duplicate
"abcd" No duplicates
"aaaaa" Single character repeated, tests large duplicate
"abcabcabc" Repeated pattern, non-overlapping
"abababab" Overlapping duplicates
"a"*30000 Max input size stress test
"abacaba" Multiple valid longest duplicates

Edge Cases

  1. All unique characters: Input like "abcdef" has no duplicates. The rolling hash correctly returns "" because no hash repeats.
  2. Single character repeated: Input