LeetCode 3303 - Find the Occurrence of First Almost Equal Substring

We are given two strings, s and pattern. The task is to find the smallest starting index in s such that the substring of length len(pattern) is "almost equal" to pattern.

LeetCode Problem 3303

Difficulty: 🔴 Hard
Topics: String, String Matching

Solution

Problem Understanding

We are given two strings, s and pattern. The task is to find the smallest starting index in s such that the substring of length len(pattern) is "almost equal" to pattern.

Two strings are considered almost equal if we can modify at most one character in one string to make the two strings identical. That means:

  • The substring may already be exactly equal to pattern
  • The substring may differ in exactly one position
  • If it differs in two or more positions, it is invalid

Suppose the pattern length is m. For every valid starting position i in s, we examine the substring:

s[i : i + m]

and determine whether its Hamming distance from pattern is at most 1.

The goal is to return the smallest valid index. If no such substring exists, we return -1.

The constraints are extremely important:

  • 1 <= pattern.length < s.length <= 10^5

A naive comparison of every substring against the pattern would require comparing up to m characters for each starting position. Since there are roughly n starting positions, that leads to O(n * m) time complexity, which becomes too slow when both strings are close to 10^5.

This means we need a near linear time string matching solution.

Several edge cases are important:

  • The substring may already exactly match the pattern
  • The mismatch may occur at the beginning, middle, or end
  • The pattern may have length 1
  • There may be no valid substring at all
  • Multiple valid substrings may exist, but we must return the smallest index

The key challenge is efficiently determining whether two strings differ in at most one position.

Approaches

Brute Force Approach

The simplest solution is to check every possible substring of length m in s.

For each starting index i:

  1. Compare s[i + j] with pattern[j]
  2. Count mismatches
  3. If mismatches exceed 1, stop early
  4. Otherwise, return i

This approach is correct because it explicitly checks the exact condition required by the problem.

However, its worst case complexity is too large. There are O(n) substrings, and each comparison may require O(m) work. Therefore the total complexity becomes O(n * m).

With input sizes up to 10^5, this can reach 10^10 operations, which is far beyond acceptable limits.

Optimal Approach

The main insight is that two strings differ in at most one position if and only if:

  • Their matching prefix before the mismatch is identical
  • Their matching suffix after the mismatch is identical

Suppose a mismatch occurs at position k.

Then:

substring[0:k] == pattern[0:k]
substring[k+1:] == pattern[k+1:]

So instead of directly comparing every substring character by character, we can precompute:

  • The longest prefix match length at every alignment
  • The longest suffix match length at every alignment

Using the Z algorithm, we can compute these efficiently in linear time.

For each starting position i:

  • Let leftMatch be the number of matching characters from the front
  • Let rightMatch be the number of matching characters from the back

If:

leftMatch + rightMatch >= m - 1

then the substring differs in at most one position.

This works because all positions except possibly one are matched.

Approach Comparison

Approach Time Complexity Space Complexity Notes
Brute Force O(n * m) O(1) Compares every substring directly
Optimal O(n + m) O(n + m) Uses Z algorithm for prefix and suffix matching

Algorithm Walkthrough

Step 1: Compute prefix matches using the Z algorithm

We build the string:

pattern + "#" + s

Then we compute its Z array.

The Z value at a position tells us how many characters match the prefix of the entire string.

Since the prefix is pattern, the Z values corresponding to positions inside s tell us how many characters match between:

pattern
and
s[i:]

This gives the longest matching prefix length for every alignment.

Step 2: Compute suffix matches

To compute matching suffixes efficiently, we reverse both strings:

reverse(pattern)
reverse(s)

Then we repeat the same Z algorithm process.

This lets us determine how many characters match from the end.

Step 3: Evaluate every possible starting position

For every valid index i:

  • prefix[i] tells us how many characters match from the front
  • suffix[i] tells us how many characters match from the back

If:

prefix[i] + suffix[i] >= m - 1

then all characters except possibly one match.

We immediately return the first valid index.

Step 4: Return -1 if no valid substring exists

If we finish scanning all positions without finding a valid substring, then no answer exists.

Why it works

The algorithm relies on the fact that a substring and the pattern can differ in at most one position only if all remaining positions match.

Suppose the mismatch occurs at index k.

Then:

  • The prefix before k must match
  • The suffix after k must match

The total number of matched positions becomes:

k + (m - k - 1) = m - 1

Therefore checking whether:

prefixMatch + suffixMatch >= m - 1

is sufficient and necessary.

The Z algorithm allows these prefix and suffix matches to be computed in linear time.

Python Solution

class Solution:
    def minStartingIndex(self, s: str, pattern: str) -> int:
        n = len(s)
        m = len(pattern)

        def z_function(text: str) -> list[int]:
            length = len(text)
            z = [0] * length

            left = 0
            right = 0

            for i in range(1, length):
                if i <= right:
                    z[i] = min(right - i + 1, z[i - left])

                while (
                    i + z[i] < length
                    and text[z[i]] == text[i + z[i]]
                ):
                    z[i] += 1

                if i + z[i] - 1 > right:
                    left = i
                    right = i + z[i] - 1

            return z

        forward = pattern + "#" + s
        z_forward = z_function(forward)

        prefix_match = [0] * (n - m + 1)

        for i in range(n - m + 1):
            prefix_match[i] = min(m, z_forward[m + 1 + i])

        reversed_pattern = pattern[::-1]
        reversed_s = s[::-1]

        backward = reversed_pattern + "#" + reversed_s
        z_backward = z_function(backward)

        suffix_match = [0] * (n - m + 1)

        for i in range(n - m + 1):
            reversed_index = n - (i + m)
            suffix_match[i] = min(
                m,
                z_backward[m + 1 + reversed_index]
            )

        for i in range(n - m + 1):
            if prefix_match[i] + suffix_match[i] >= m - 1:
                return i

        return -1

The implementation begins with a standard Z algorithm helper function. The Z function computes, for every position, the longest substring starting there that matches the prefix of the string.

Next, we concatenate:

pattern + "#" + s

This structure ensures that any match length computed inside the s portion corresponds directly to a prefix match against pattern.

The same process is repeated with reversed strings to compute suffix matches.

The arrays prefix_match and suffix_match store, for each starting position in s, how many characters match from the front and back respectively.

Finally, we scan every possible substring position. If the combined matched characters cover at least m - 1 positions, then the substring differs in at most one place, so we return that index.

Go Solution

func minStartingIndex(s string, pattern string) int {
	n := len(s)
	m := len(pattern)

	zFunction := func(text string) []int {
		length := len(text)
		z := make([]int, length)

		left, right := 0, 0

		for i := 1; i < length; i++ {
			if i <= right {
				if z[i-left] < right-i+1 {
					z[i] = z[i-left]
				} else {
					z[i] = right - i + 1
				}
			}

			for i+z[i] < length &&
				text[z[i]] == text[i+z[i]] {
				z[i]++
			}

			if i+z[i]-1 > right {
				left = i
				right = i + z[i] - 1
			}
		}

		return z
	}

	forward := pattern + "#" + s
	zForward := zFunction(forward)

	prefixMatch := make([]int, n-m+1)

	for i := 0; i <= n-m; i++ {
		value := zForward[m+1+i]
		if value > m {
			value = m
		}
		prefixMatch[i] = value
	}

	reversedPattern := reverseString(pattern)
	reversedS := reverseString(s)

	backward := reversedPattern + "#" + reversedS
	zBackward := zFunction(backward)

	suffixMatch := make([]int, n-m+1)

	for i := 0; i <= n-m; i++ {
		reversedIndex := n - (i + m)

		value := zBackward[m+1+reversedIndex]
		if value > m {
			value = m
		}

		suffixMatch[i] = value
	}

	for i := 0; i <= n-m; i++ {
		if prefixMatch[i]+suffixMatch[i] >= m-1 {
			return i
		}
	}

	return -1
}

func reverseString(s string) string {
	bytes := []byte(s)

	left := 0
	right := len(bytes) - 1

	for left < right {
		bytes[left], bytes[right] = bytes[right], bytes[left]
		left++
		right--
	}

	return string(bytes)
}

The Go implementation closely mirrors the Python version. Since Go strings are immutable, reversing requires converting the string into a byte slice.

The Z algorithm implementation uses slices instead of Python lists, but the logic remains identical.

Because the input contains only lowercase English letters, using byte indexing is completely safe and avoids Unicode complications.

Worked Examples

Example 1

s = "abcdefg"
pattern = "bcdffg"

Pattern length:

m = 6

Possible substrings:

Index Substring Differences
0 abcdef many
1 bcdefg 1
2 cdefg invalid length

At index 1:

bcdefg
bcdffg
    ^

Only one mismatch exists.

Prefix matches:

Position Match Length
1 3

Suffix matches:

Position Match Length
1 2

Total:

3 + 2 = 5
m - 1 = 5

Valid answer:

1

Example 2

s = "ababbababa"
pattern = "bacaba"

Substring at index 4:

bababa
bacaba
  ^

Exactly one mismatch.

Prefix match length:

2

Suffix match length:

3

Total:

5 >= 5

Return:

4

Example 3

s = "abcd"
pattern = "dba"

Substrings:

Index Substring Differences
0 abc 3
1 bcd 3

No valid substring exists.

Return:

-1

Example 4

s = "dde"
pattern = "d"

Substring at index 0:

d
d

Exact match.

Return:

0

Complexity Analysis

Measure Complexity Explanation
Time O(n + m) Two Z algorithm runs plus one linear scan
Space O(n + m) Storage for Z arrays and match arrays

The Z algorithm runs in linear time because each character is processed at most a constant number of times while maintaining the current matching interval.

We execute the Z algorithm twice:

  • Once for prefix matches
  • Once for suffix matches

All remaining work is linear, so the total complexity remains O(n + m).

Test Cases

sol = Solution()

assert sol.minStartingIndex("abcdefg", "bcdffg") == 1  # one mismatch
assert sol.minStartingIndex("ababbababa", "bacaba") == 4  # mismatch in middle
assert sol.minStartingIndex("abcd", "dba") == -1  # no valid substring
assert sol.minStartingIndex("dde", "d") == 0  # single character exact match

assert sol.minStartingIndex("aaaaa", "aaa") == 0  # many exact matches
assert sol.minStartingIndex("aaaaa", "aab") == 0  # one mismatch at end
assert sol.minStartingIndex("xyzabc", "abc") == 3  # exact match at end
assert sol.minStartingIndex("xyzabc", "abd") == 3  # one mismatch at end
assert sol.minStartingIndex("xyzabc", "zzz") == -1  # too many mismatches

assert sol.minStartingIndex("abcde", "xbcde") == 0  # mismatch at beginning
assert sol.minStartingIndex("abcde", "abxde") == 0  # mismatch in middle
assert sol.minStartingIndex("abcde", "abcdx") == 0  # mismatch at end

assert sol.minStartingIndex("bbbbbbbb", "bbbb") == 0  # repeated characters
assert sol.minStartingIndex("abababab", "abac") == 0  # periodic structure

assert sol.minStartingIndex("qwerty", "asdfg") == -1  # completely different

assert sol.minStartingIndex("aaab", "aabb") == 0  # exactly one mismatch
assert sol.minStartingIndex("aaab", "bbbb") == -1  # too many mismatches
Test Why
"abcdefg", "bcdffg" Standard one mismatch case
"ababbababa", "bacaba" Mismatch in middle
"abcd", "dba" No valid substring
"dde", "d" Single character pattern
"aaaaa", "aaa" Multiple exact matches
"aaaaa", "aab" One mismatch near end
"xyzabc", "abc" Match at final valid index
"xyzabc", "abd" One mismatch at final valid index
"xyzabc", "zzz" No overlap at all
"abcde", "xbcde" Mismatch at first character
"abcde", "abxde" Mismatch in center
"abcde", "abcdx" Mismatch at last character
"bbbbbbbb", "bbbb" Repeated characters
"abababab", "abac" Alternating pattern
"qwerty", "asdfg" Completely different strings
"aaab", "aabb" Exactly one mismatch
"aaab", "bbbb" More than one mismatch

Edge Cases

One important edge case occurs when the substring already exactly matches the pattern. Since the problem allows changing at most one character, zero mismatches is completely valid. The implementation handles this naturally because exact matches produce:

prefix_match + suffix_match >= m

which certainly satisfies the required condition.

Another important edge case is when the mismatch occurs at the very beginning or very end of the pattern. Some implementations incorrectly assume both prefix and suffix lengths must be positive. In this solution, either side may contribute zero characters, and the combined total still correctly determines validity.

A particularly tricky case involves patterns of length 1. In this situation, every substring of length 1 differs in at most one position, because a single replacement can transform any character into any other character. The condition:

prefix_match + suffix_match >= m - 1

becomes:

>= 0

which is always true, correctly returning index 0.

Repeated character strings are another common source of bugs for string matching algorithms. The Z algorithm handles repeated prefixes correctly because it carefully maintains the current matching interval and reuses previously computed information without missing overlaps.