LeetCode 28 - Find the Index of the First Occurrence in a String

This problem asks us to locate the first occurrence of one string inside another string. The string we are searching for is called needle, and the larger string we search inside is called haystack.

LeetCode Problem 28

Difficulty: 🟢 Easy
Topics: Two Pointers, String, String Matching

Solution

Problem Understanding

This problem asks us to locate the first occurrence of one string inside another string. The string we are searching for is called needle, and the larger string we search inside is called haystack.

More formally, we need to determine whether needle appears as a contiguous substring inside haystack. If it does, we return the starting index of its first appearance. If it does not exist anywhere in haystack, we return -1.

For example, if haystack = "sadbutsad" and needle = "sad", the substring "sad" appears twice, once starting at index 0 and again at index 6. Since the problem asks for the first occurrence, the correct answer is 0.

The input consists of two lowercase English strings. The constraints tell us that both strings have lengths between 1 and 10^4. This is important because it means even a quadratic solution may still pass in practice, but we should still aim for a more efficient and scalable approach.

Several edge cases are important to think about before implementing the solution.

If needle is longer than haystack, it is impossible for a match to exist. If both strings are identical, the answer should be 0. If the matching substring appears multiple times, we must return only the first occurrence. Another important case is overlapping patterns, such as searching for "aaa" inside "aaaaa". A naive implementation can easily make indexing mistakes in such scenarios.

Approaches

Brute Force Approach

The most straightforward solution is to check every possible starting position in haystack.

At each index i, we compare characters one by one between haystack[i:] and needle. If all characters match, we return i. If a mismatch occurs, we move to the next starting position and try again.

This approach is correct because every possible alignment of needle inside haystack is examined. If a valid occurrence exists, the algorithm will eventually find it.

However, the downside is efficiency. In the worst case, for every starting position, we may compare almost the entire needle before discovering a mismatch. This leads to a worst case time complexity of O(n * m), where:

  • n is the length of haystack
  • m is the length of needle

For the given constraints, this is still acceptable, but there is a more elegant linear-time solution.

Optimal Approach, Knuth-Morris-Pratt (KMP)

The key insight behind KMP is that when a mismatch occurs, we do not need to restart matching from scratch.

Suppose we have already matched several characters successfully. Some portion of that matched prefix may also be a suffix. Instead of rechecking characters we already know match, we can reuse previous information to skip unnecessary comparisons.

KMP preprocesses the needle to build an array called the Longest Prefix Suffix array, commonly called the LPS array. For each position, the array stores the length of the longest proper prefix that is also a suffix.

This preprocessing allows the search phase to move intelligently after mismatches, ensuring that each character is processed at most a constant number of times.

As a result, KMP achieves linear time complexity.

Approach Time Complexity Space Complexity Notes
Brute Force O(n * m) O(1) Tries every starting position and compares characters directly
Optimal, KMP O(n + m) O(m) Uses prefix information to avoid redundant comparisons

Algorithm Walkthrough

Step 1: Build the LPS Array

The KMP algorithm begins by preprocessing the needle.

The LPS array stores, for each position, the length of the longest proper prefix that is also a suffix for the substring ending at that position.

For example, for the pattern "ababaca":

Index Character LPS
0 a 0
1 b 0
2 a 1
3 b 2
4 a 3
5 c 0
6 a 1

This preprocessing tells us how far we can shift the pattern after a mismatch without losing useful information.

Step 2: Initialize Search Pointers

We use two pointers:

  1. One pointer i for haystack
  2. One pointer j for needle

Initially, both start at 0.

Step 3: Compare Characters

At each step:

  1. If haystack[i] == needle[j], both pointers move forward.
  2. If j reaches the end of needle, we found a complete match, so we return i - j.

Step 4: Handle Mismatches Efficiently

When characters do not match:

  1. If j > 0, we do not restart from the beginning.
  2. Instead, we set j = lps[j - 1].

This moves the pattern pointer to the longest valid prefix position.

If j == 0, there is no reusable prefix, so we simply move i forward.

Step 5: Return -1 if No Match Exists

If we finish scanning the entire haystack without fully matching needle, the substring does not exist, so we return -1.

Why it works

The correctness of KMP comes from the invariant that all characters before the current pointers have already been processed consistently. The LPS array guarantees that after a mismatch, we only jump to positions that could still produce a valid match. No possible occurrence is skipped, and no character in haystack is unnecessarily rechecked multiple times. This ensures both correctness and linear efficiency.

Python Solution

class Solution:
    def strStr(self, haystack: str, needle: str) -> int:
        def build_lps(pattern: str) -> list[int]:
            lps = [0] * len(pattern)

            length = 0
            i = 1

            while i < len(pattern):
                if pattern[i] == pattern[length]:
                    length += 1
                    lps[i] = length
                    i += 1
                else:
                    if length != 0:
                        length = lps[length - 1]
                    else:
                        lps[i] = 0
                        i += 1

            return lps

        lps = build_lps(needle)

        haystack_index = 0
        needle_index = 0

        while haystack_index < len(haystack):
            if haystack[haystack_index] == needle[needle_index]:
                haystack_index += 1
                needle_index += 1

                if needle_index == len(needle):
                    return haystack_index - needle_index
            else:
                if needle_index != 0:
                    needle_index = lps[needle_index - 1]
                else:
                    haystack_index += 1

        return -1

The implementation is divided into two logical phases.

The first phase builds the LPS array. The variable length tracks the current longest prefix-suffix length, while i scans through the pattern. When characters match, the prefix length grows. When they do not match, we fall back using previously computed LPS values instead of restarting entirely.

The second phase performs the actual substring search. The two indices move through haystack and needle. Matching characters advance both pointers. A complete match occurs when needle_index reaches the end of the pattern.

The important optimization happens during mismatches. Instead of restarting the pattern comparison from index 0, the algorithm reuses previously matched prefix information from the LPS array.

Go Solution

func strStr(haystack string, needle string) int {
	buildLPS := func(pattern string) []int {
		lps := make([]int, len(pattern))

		length := 0
		i := 1

		for i < len(pattern) {
			if pattern[i] == pattern[length] {
				length++
				lps[i] = length
				i++
			} else {
				if length != 0 {
					length = lps[length-1]
				} else {
					lps[i] = 0
					i++
				}
			}
		}

		return lps
	}

	lps := buildLPS(needle)

	haystackIndex := 0
	needleIndex := 0

	for haystackIndex < len(haystack) {
		if haystack[haystackIndex] == needle[needleIndex] {
			haystackIndex++
			needleIndex++

			if needleIndex == len(needle) {
				return haystackIndex - needleIndex
			}
		} else {
			if needleIndex != 0 {
				needleIndex = lps[needleIndex-1]
			} else {
				haystackIndex++
			}
		}
	}

	return -1
}

The Go implementation follows the same logic as the Python version. Since Go strings are byte indexed, direct indexing works correctly here because the problem guarantees lowercase English letters only.

The LPS array is implemented using a slice of integers. Go does not have Python style nested helper functions with type inference in exactly the same way, so the helper function is assigned to a variable using an anonymous function.

No special handling for integer overflow is needed because all indices remain within the problem constraints.

Worked Examples

Example 1

Input:

haystack = "sadbutsad"
needle = "sad"

First, construct the LPS array.

Index Character LPS
0 s 0
1 a 0
2 d 0

Now perform the search.

Step haystack_index needle_index Compared Characters Action
1 0 0 s vs s Match
2 1 1 a vs a Match
3 2 2 d vs d Match

At this point, needle_index == len(needle).

The starting index is:

3 - 3 = 0

So the algorithm returns:

0

Example 2

Input:

haystack = "leetcode"
needle = "leeto"

Construct the LPS array.

Index Character LPS
0 l 0
1 e 0
2 e 0
3 t 0
4 o 0

Now search through the string.

Step haystack_index needle_index Compared Characters Action
1 0 0 l vs l Match
2 1 1 e vs e Match
3 2 2 e vs e Match
4 3 3 t vs t Match
5 4 4 c vs o Mismatch

Since needle_index > 0, we fall back using the LPS array.

Eventually, no valid match exists, and the algorithm returns:

-1

Complexity Analysis

Measure Complexity Explanation
Time O(n + m) Each character in both strings is processed at most a constant number of times
Space O(m) The LPS array stores one value per character in needle

The preprocessing phase for the LPS array takes O(m) time, where m is the length of needle. The search phase scans through haystack in O(n) time. Since the algorithm never backtracks through the text, the total complexity remains linear.

The extra memory usage comes entirely from the LPS array.

Test Cases

solution = Solution()

assert solution.strStr("sadbutsad", "sad") == 0  # basic match at beginning
assert solution.strStr("leetcode", "leeto") == -1  # substring does not exist
assert solution.strStr("hello", "ll") == 2  # match in middle
assert solution.strStr("aaaaa", "bba") == -1  # completely absent pattern
assert solution.strStr("aaaaa", "aaa") == 0  # overlapping matches
assert solution.strStr("abc", "abc") == 0  # entire string match
assert solution.strStr("mississippi", "issip") == 4  # repeated prefixes
assert solution.strStr("a", "a") == 0  # single character match
assert solution.strStr("abc", "c") == 2  # match at end
assert solution.strStr("ababababca", "abababca") == 2  # complex prefix reuse
assert solution.strStr("aaaab", "aaab") == 1  # partial fallback behavior
assert solution.strStr("abcd", "e") == -1  # single character absent
Test Why
"sadbutsad", "sad" Verifies basic successful match
"leetcode", "leeto" Verifies missing substring
"hello", "ll" Tests middle substring detection
"aaaaa", "bba" Tests complete mismatch
"aaaaa", "aaa" Tests overlapping matches
"abc", "abc" Tests full string equality
"mississippi", "issip" Tests repeated prefix patterns
"a", "a" Smallest valid matching input
"abc", "c" Tests match at final position
"ababababca", "abababca" Tests efficient KMP fallback
"aaaab", "aaab" Tests repeated character fallback logic
"abcd", "e" Tests single character absence

Edge Cases

Needle Longer Than Haystack

If needle is longer than haystack, no match is possible because a larger string cannot fit inside a smaller one. Some implementations accidentally access out of bounds indices in this scenario. The KMP implementation handles this naturally because the search loop never reaches a complete match.

Overlapping Patterns

Patterns such as "aaa" inside "aaaaa" are tricky because matches overlap heavily. A naive implementation may restart too aggressively and miss valid occurrences. KMP handles this correctly using the LPS array, which preserves partial match information and avoids redundant work.

Repeated Prefixes in the Pattern

Patterns like "ababaca" contain repeated prefix structures. These are exactly the situations where KMP provides the greatest benefit. Incorrect LPS construction can easily cause skipped matches or infinite loops. The implementation carefully falls back using previously computed prefix lengths, ensuring correctness.

Match at the Very End

Cases such as searching for "c" inside "abc" can expose off by one indexing bugs. The implementation checks for a complete match immediately after incrementing the pattern pointer, ensuring that matches ending at the final character are detected correctly.