LeetCode 3557 - Find Maximum Number of Non Intersecting Substrings
This problem asks us to identify the maximum number of non-intersecting substrings from a given string word such that each substring is at least four characters long and starts and ends with the same character.
Difficulty: 🟡 Medium
Topics: Hash Table, String, Dynamic Programming, Greedy
Solution
Problem Understanding
This problem asks us to identify the maximum number of non-intersecting substrings from a given string word such that each substring is at least four characters long and starts and ends with the same character. In other words, we are scanning the string for patterns where the first and last character match and the substring is long enough, while ensuring that chosen substrings do not overlap.
The input is a single string word of lowercase English letters, with a length up to 2 * 10^5. The output is a single integer representing the maximum number of valid, non-overlapping substrings.
Key observations and edge cases include cases where the string has repeated letters close together (allowing many valid substrings) versus strings with sparse repeated letters (limiting substrings). Strings shorter than four characters cannot contribute to the count. Intersections are disallowed, so we cannot pick overlapping candidates even if they are valid individually.
Approaches
The brute-force approach is to iterate over all possible substrings of word of length at least four, check if the substring starts and ends with the same character, and then attempt to select a set of non-overlapping substrings that maximizes the count. This approach is correct but inefficient because it generates O(n^2) substrings and checking non-overlap combinations would lead to exponential complexity, which is infeasible for n up to 200,000.
The key insight for an optimal solution is that for a substring starting at index i with character c, the earliest valid substring must end at the furthest occurrence of c that includes all characters that appear between the first and last occurrence of c. This is similar to the "interval merging" problem. Once we calculate the minimal valid intervals for all characters, we can use a greedy strategy: sort intervals by end position and choose the next interval that starts after the previous one ends. This guarantees the maximum number of non-overlapping substrings.
| Approach | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Brute Force | O(n^3) | O(n^2) | Generate all substrings and check overlap, infeasible for large n |
| Optimal | O(n) | O(n) | Calculate minimal intervals per character, then select non-overlapping intervals greedily |
Algorithm Walkthrough
- First, record the first and last occurrences of every character in the string. This can be done in a single pass over the string using a hash map or array of size 26.
- Iterate over each character, treating the first occurrence as a potential start of a substring. Expand the substring to the right to include the last occurrence of all characters seen within the interval. This ensures the substring is minimal but valid.
- Store all intervals of substrings that are at least four characters long.
- Sort these intervals by their end index to prepare for greedy selection.
- Initialize a variable
prev_endto-1to track the end of the last selected substring. Iterate over intervals in sorted order. If the start index of the current interval is greater thanprev_end, select it and updateprev_end. - Count the number of selected intervals. This is the maximum number of non-overlapping substrings satisfying the conditions.
Why it works: By extending intervals to cover all last occurrences of characters inside, we ensure that each candidate substring is valid and minimal. Sorting by end positions ensures that when we select an interval, we leave maximal space for future intervals, achieving the maximum count.
We are given a string word of length $n$, where $1 \le n \le 2 \cdot 10^5$. The task is to select as many substrings as possible such that each selected substring satisfies three constraints simultaneously.
First, every substring must have length at least four, meaning that if a substring spans indices $[i, j]$, then $j - i + 1 \ge 4$, equivalently $j \ge i + 3$. Second, each substring must start and end with the same character, so $word[i] = word[j]$. Third, no two chosen substrings may intersect, meaning their index intervals must be disjoint.
The output is the maximum cardinality of such a set of valid substrings.
The constraints immediately imply that a naive enumeration of all substrings is infeasible, since there are $O(n^2)$ substrings. The input size up to $2 \cdot 10^5$ strongly suggests an $O(n \log n)$ or $O(n)$ solution.
Edge cases arise when no character repeats with sufficient distance (answer is 0), when many overlapping candidate substrings exist (requiring greedy selection), and when multiple valid endpoints exist for the same starting index (requiring a consistent tie-breaking rule).
Approaches
Brute Force
A direct approach considers every pair of indices $(i, j)$ such that $i < j$, checks whether $word[i] = word[j]$, and verifies whether $j - i + 1 \ge 4$. Each such pair defines a candidate interval. Once all intervals are collected, the problem reduces to selecting the maximum number of non-overlapping intervals, which can be solved via sorting by end time and backtracking or dynamic programming.
However, generating all valid intervals already costs $O(n^2)$, and even with optimal interval scheduling, the preprocessing dominates. This is therefore computationally infeasible for the input constraints.
Key Insight
The critical observation is that for each starting position $i$, it is never beneficial to consider multiple ending positions for the same character occurrence. If multiple valid endpoints exist for a fixed $i$, the optimal choice is always the smallest valid endpoint $j$, since it leaves the most remaining space for future intervals.
Thus, each index $i$ can contribute at most one candidate interval: the earliest index $j \ge i + 3$ such that $word[i] = word[j]$.
Once we construct this reduced set of at most $n$ intervals, the problem becomes a classic interval scheduling maximization problem with unit weight, solvable greedily by sorting intervals by their end points and selecting compatible ones.
| Approach | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Brute Force | $O(n^2)$ | $O(n^2)$ | Enumerate all valid substrings and run interval scheduling |
| Optimal | $O(n \log n)$ | $O(n)$ | Build one interval per index using binary search, then greedy selection |
Algorithm Walkthrough
We construct a set of candidate intervals and then apply greedy selection.
- We preprocess the string by building a mapping from each character $c \in [a..z]$ to a sorted list of indices where $c$ appears. This structure allows efficient successor queries.
- For each index $i$, we attempt to treat it as the left endpoint of a substring. Let $c = word[i]$. We require an index $j \ge i + 3$ such that $word[j] = c$.
- Using binary search on the precomputed list for character $c$, we find the smallest valid index $j$ satisfying $j \ge i + 3$. If no such $j$ exists, index $i$ cannot serve as a valid substring start.
- We construct an interval $(i, j)$ for each valid pair. This produces at most $n$ intervals.
- We sort all intervals by their right endpoint $j$ in ascending order. This ordering is essential for greedy optimality.
- We iterate through sorted intervals and maintain a variable
last_end, initially $-1$. For each interval $(l, r)$, if $l > last_end$, we select it and updatelast_end = r. - The count of selected intervals is returned as the final answer.
Why it works
The correctness follows from the classical exchange argument for interval scheduling. Since all intervals have equal weight, any optimal solution can be transformed into one that always picks the interval with the smallest finishing time among those compatible with previous choices. Our preprocessing ensures we never include a dominated interval for a fixed start index, so greedy selection over endpoints yields a maximal compatible set.
Python Solution
class Solution:
def maxSubstrings(self, word: str) -> int:
n = len(word)
first = {}
last = {}
for i, c in enumerate(word):
if c not in first:
first[c] = i
last[c] = i
intervals = []
for i, c in enumerate(word):
if i != first[c]:
continue
start = i
end = last[c]
j = start
while j <= end:
end = max(end, last[word[j]])
j += 1
if end - start + 1 >= 4:
intervals.append((start, end))
intervals.sort(key=lambda x: x[1])
count = 0
prev_end = -1
for start, end in intervals:
if start > prev_end:
count += 1
prev_end = end
return count
The code first records first and last occurrences of each character, then constructs valid minimal substrings as intervals. Intervals are sorted by their end positions, and a greedy pass selects the maximum number of non-overlapping substrings. from bisect import bisect_left from collections import defaultdict from typing import List
class Solution: def maxSubstrings(self, word: str) -> int: n = len(word) pos = defaultdict(list)
for i, ch in enumerate(word):
pos[ch].append(i)
intervals: List[tuple[int, int]] = []
for i, ch in enumerate(word):
arr = pos[ch]
target = i + 3
idx = bisect_left(arr, target)
if idx < len(arr):
j = arr[idx]
intervals.append((i, j))
intervals.sort(key=lambda x: x[1])
res = 0
last_end = -1
for l, r in intervals:
if l > last_end:
res += 1
last_end = r
return res
The implementation first constructs occurrence lists for each character. For every index $i$, it performs a binary search to find the earliest valid endpoint $j$. Each such pair forms a candidate interval. Sorting by right endpoint enables greedy selection. The variable `last_end` enforces non-intersection.
## Go Solution
```go
func maxSubstrings(word string) int {
n := len(word)
first := make(map[byte]int)
last := make(map[byte]int)
for i := 0; i < n; i++ {
c := word[i]
if _, ok := first[c]; !ok {
first[c] = i
}
last[c] = i
}
type Interval struct {
start int
end int
}
intervals := []Interval{}
for i := 0; i < n; i++ {
c := word[i]
if i != first[c] {
continue
}
start := i
end := last[c]
j := start
for j <= end {
if last[word[j]] > end {
end = last[word[j]]
}
j++
}
if end-start+1 >= 4 {
intervals = append(intervals, Interval{start, end})
import (
"sort"
)
func maxSubstrings(word string) int {
n := len(word)
pos := make(map[byte][]int)
for i := 0; i < n; i++ {
c := word[i]
pos[c] = append(pos[c], i)
}
type interval struct {
l int
r int
}
intervals := make([]interval, 0, n)
for i := 0; i < n; i++ {
c := word[i]
arr := pos[c]
target := i + 3
idx := sort.Search(len(arr), func(k int) bool {
return arr[k] >= target
})
if idx < len(arr) {
intervals = append(intervals, interval{i, arr[idx]})
}
}
sort.Slice(intervals, func(i, j int) bool {
return intervals[i].end < intervals[j].end
})
count := 0
prevEnd := -1
for _, inter := range intervals {
if inter.start > prevEnd {
count++
prevEnd = inter.end
}
}
return count
}
The Go implementation mirrors the Python logic, using maps for first and last occurrences and a slice of custom Interval structs. Sorting is done with sort.Slice, and the greedy selection is identical.
Worked Examples
Example 1: word = "abcdeafdef"
First and last occurrences:
| Char | First | Last |
|---|---|---|
| a | 0 | 5 |
| b | 1 | 1 |
| c | 2 | 2 |
| d | 3 | 7 |
| e | 4 | 8 |
| f | 6 | 9 |
Intervals generated:
aat index 0 → end expands to 5 (abcdea)dat index 3 → end expands to 9 (defoverlaps with previous, only non-overlapping selected)
Final greedy selection picks abcdea and fdef → count = 2.
Example 2: word = "bcdaaaab"
Intervals:
bat 0 → end = 0 (length < 4, discard)aat 3 → end = 6 → interval length 4 (aaaa)
Only one substring possible → count = 1. return intervals[i].r < intervals[j].r })
res := 0
lastEnd := -1
for _, in := range intervals {
if in.l > lastEnd {
res++
lastEnd = in.r
}
}
return res
}
The Go implementation mirrors the Python logic. The primary difference is the explicit use of `sort.Search` for binary search and struct-based interval representation. Slices are used instead of lists, and character indexing relies on bytes due to the lowercase alphabet constraint.
## Worked Examples
### Example 1: `"abcdeafdef"`
We first build intervals.
| i | char | valid j | interval |
| --- | --- | --- | --- |
| 0 | a | 5 | (0,5) |
| 1 | b | none | - |
| 2 | c | none | - |
| 3 | d | none | - |
| 4 | e | none | - |
| 5 | a | none | - |
| 6 | f | 9 | (6,9) |
| 7 | d | none | - |
| 8 | e | none | - |
| 9 | f | none | - |
After sorting by end: $(0,5), (6,9)$
Greedy selection:
We pick $(0,5)$, then $(6,9)$ since $6 > 5$. Result is 2.
### Example 2: `"bcdaaaab"`
Intervals:
| i | char | valid j | interval |
| --- | --- | --- | --- |
| 0 | b | 7 | (0,7) |
| 1 | c | none | - |
| 2 | d | none | - |
| 3 | a | 6 | (3,6) |
| 4 | a | 6 | (4,6) |
| 5 | a | 6 | (5,6) |
| 6 | a | none | - |
| 7 | b | none | - |
After reduction, intervals are $(0,7)$, $(3,6)$, $(4,6)$, $(5,6)$.
Sorting by end gives: $(3,6), (4,6), (5,6), (0,7)$
Greedy selection:
We pick $(3,6)$, then skip all others starting before or equal to 6, and finally cannot take $(0,7)$ since it overlaps. Result is 1.
## Complexity Analysis
| Measure | Complexity | Explanation |
| --- | --- | --- |
| Time | O(n) | Single pass to compute first/last, expand intervals, and sort intervals (at most 26 intervals per char), greedy selection is linear |
| Space | O(n) | Storage for first/last maps and intervals list |
The algorithm scales linearly with the string length, making it efficient for the maximum constraint of 200,000 characters.
| Time | $O(n \log n)$ | Each index performs a binary search over its character list, followed by sorting intervals |
| Space | $O(n)$ | Stores position lists and at most one interval per index |
The logarithmic factor arises from binary searching within character occurrence lists. Since each index contributes at most one lookup, the total cost is bounded by $O(n \log n)$.
## Test Cases
s = Solution() assert s.maxSubstrings("abcdeafdef") == 2 # example 1 assert s.maxSubstrings("bcdaaaab") == 1 # example 2 assert s.maxSubstrings("aaaa") == 1 # minimal length 4 substring assert s.maxSubstrings("abcd") == 0 # no valid substring assert s.maxSubstrings("aabbccddeeff") == 3 # multiple disjoint substrings assert s.maxSubstrings("a"*100000 + "b"*100000) == 2 # stress test large input assert Solution().maxSubstrings("abcdeafdef") == 2 # example 1 basic case assert Solution().maxSubstrings("bcdaaaab") == 1 # example 2 overlapping constraint assert Solution().maxSubstrings("aaaa") == 1 # single minimal valid substring assert Solution().maxSubstrings("ab") == 0 # no valid length >= 4 assert Solution().maxSubstrings("abcdabcd") == 2 # multiple disjoint repeats assert Solution().maxSubstrings("aabbccddaa") == 2 # mixed repeats with spacing assert Solution().maxSubstrings("abcabcabcabc") == 3 # dense repeating structure
| Test | Why |
| --- | --- |
| "abcdeafdef" | Multiple overlapping substrings, tests greedy selection |
| "bcdaaaab" | Only one valid substring |
| "aaaa" | Minimal length substring allowed |
| "abcd" | No substring satisfies length and start-end condition |
| "aabbccddeeff" | Multiple disjoint valid substrings |
| large input of repeated letters | Stress test for efficiency |
## Edge Cases
1. **String shorter than four characters**: Any string with length less than 4 cannot contain a valid substring. The algorithm correctly handles this by checking `end - start + 1 >= 4`.
2. **Overlapping minimal intervals**: Some substrings may share characters, like `"abcdea"` and `"cdefa"`. Greedy selection by end index ensures only the maximal count of non-overlapping substrings is chosen.
3. **Repeated single character blocks**: For strings like `"aaaaaa"`, the algorithm correctly identifies all possible minimal intervals but only selects non-overlapping ones, avoiding double counting while maximizing the number of substrings.
| `"abcdeafdef"` | standard multi-interval selection |
| `"bcdaaaab"` | overlapping intervals force exclusion |
| `"aaaa"` | smallest valid single interval |
| `"ab"` | length too small |
| `"abcdabcd"` | multiple disjoint same-character matches |
| `"aabbccddaa"` | mixed independent segments |
| `"abcabcabcabc"` | dense repetition stress test |
## Edge Cases
One important edge case is when the string has no repeated character at a distance of at least three positions. In this case, the position lists exist but binary search fails for every index, producing zero intervals and correctly returning zero.
Another edge case occurs when multiple valid endpoints exist for a given starting index. Without selecting the earliest valid endpoint, a naive approach may construct unnecessarily long intervals that block optimal future choices. The implementation avoids this by always selecting the first feasible index via binary search.
A final edge case arises in highly repetitive strings such as `"aaaaa..."`, where nearly every index can form a valid interval. The greedy algorithm ensures correctness by always selecting the earliest finishing interval, preventing pathological nesting from inflating the count incorrectly.