LeetCode 187 - Repeated DNA Sequences

This problem asks us to find all repeated DNA subsequences of a fixed length, specifically length 10, inside a given DNA string. The input is a string s, where each character represents a DNA nucleotide.

LeetCode Problem 187

Difficulty: 🟡 Medium
Topics: Hash Table, String, Bit Manipulation, Sliding Window, Rolling Hash, Hash Function

Solution

Problem Understanding

This problem asks us to find all repeated DNA subsequences of a fixed length, specifically length 10, inside a given DNA string.

The input is a string s, where each character represents a DNA nucleotide. The string contains only four possible characters:

  • 'A'
  • 'C'
  • 'G'
  • 'T'

We must examine every substring of length 10 and determine which ones appear more than once in the DNA sequence. The output should contain every repeated 10-character sequence exactly once, regardless of how many times it appears.

For example, if a substring appears three or four times, it should still appear only once in the final result.

The key detail is that we are looking for fixed-length windows of size 10, not arbitrary repeated substrings. This immediately suggests a sliding window style approach because we repeatedly examine overlapping substrings of the same length.

The constraints are important:

  • 1 <= s.length <= 10^5
  • Only four possible characters exist

A string length of 10^5 means we cannot afford expensive repeated substring comparisons or nested scanning approaches. We need something close to linear time.

Several edge cases stand out immediately:

If s.length < 10, there are no valid 10-character substrings, so the answer must be empty.

If the same sequence repeats many times, we should include it only once in the result. For example, "AAAAAAAAAAAAA" contains multiple overlapping occurrences of "AAAAAAAAAA", but the output contains only one copy.

Overlapping repetitions are valid. A repeated sequence does not need to start at non-overlapping positions.

Approaches

Brute Force Approach

The simplest way to solve this problem is to generate every possible substring of length 10 and count how many times each appears.

We slide through the string one character at a time. For each position i, we extract:

s[i:i+10]

We store each substring in a hash map where the key is the sequence and the value is its frequency.

After processing all windows, we iterate through the frequency map and collect every substring whose count is greater than 1.

This approach is correct because every possible 10-character sequence is explicitly counted. If a sequence appears multiple times, its frequency will exceed 1.

However, substring creation costs time. Since each substring has length 10, this technically becomes O(10 × n), which simplifies to O(n) because 10 is constant. In practice, this solution is already efficient enough for the problem constraints.

Still, we can do better conceptually by exploiting the fact that DNA only contains four characters.

Key Insight for an Optimal Solution

Since DNA characters come from only four possible values, we can encode each nucleotide using 2 bits:

Character Binary
A 00
C 01
G 10
T 11

A sequence of length 10 therefore requires only:

10 × 2 = 20 bits

This means every 10-character DNA substring can be represented as a single integer.

Instead of storing actual substrings in a hash set, we can maintain a rolling hash encoded as bits. As the sliding window moves:

  • Shift left by 2 bits
  • Add the new nucleotide
  • Remove bits that fall outside the 20-bit window

This allows us to represent every window in constant time without repeatedly constructing strings.

We then track:

  • Encodings we have seen before
  • Encodings already added to the answer

When we encounter an encoding a second time, we append the actual substring to the result.

Approach Comparison

Approach Time Complexity Space Complexity Notes
Brute Force O(n) O(n) Stores actual substrings and counts frequencies
Optimal O(n) O(n) Uses rolling hash with bit encoding for constant time window updates

Algorithm Walkthrough

Optimal Algorithm, Rolling Hash with Bit Manipulation

  1. First, check whether the input length is smaller than 10.

If len(s) < 10, return an empty list immediately because no valid substring exists. 2. Create a mapping for DNA characters.

Since only four nucleotides exist, encode them using two bits:

A → 00
C → 01
G → 10
T → 11

This compact representation allows efficient rolling updates. 3. Initialize data structures.

Create:

  • A set called seen to track hashes already encountered
  • A set called repeated to avoid duplicate insertions
  • A result list for final answers
  1. Build the rolling hash.

As we iterate through the string:

  • Shift the current hash left by 2
  • Add the encoded value of the new character
  • Keep only the last 20 bits

The mask:

(1 << 20) - 1

ensures we discard older bits outside the current 10-character window. 5. Start processing once the first 10 characters are available.

We cannot form a valid sequence before index 9.

For each valid window:

  • If the encoded sequence already exists in seen, it means this DNA substring has appeared before.
  • If it has not yet been added to repeated, append the substring to the result and mark it as repeated.
  • Otherwise, add the encoding to seen.
  1. Continue until the end of the string.

The result list will contain all repeated 10-character DNA sequences.

Why it works

The algorithm works because every 10-character DNA sequence maps to a unique 20-bit integer. Since each character uses exactly 2 bits and only four characters exist, no collisions occur. The rolling hash always represents the current window of length 10, and the seen set guarantees we detect repeated sequences. The repeated set ensures each repeated sequence is added to the answer exactly once.

Python Solution

from typing import List

class Solution:
    def findRepeatedDnaSequences(self, s: str) -> List[str]:
        sequence_length = 10

        if len(s) < sequence_length:
            return []

        char_to_bits = {
            'A': 0,
            'C': 1,
            'G': 2,
            'T': 3
        }

        seen = set()
        repeated = set()
        result = []

        rolling_hash = 0
        mask = (1 << (2 * sequence_length)) - 1

        for index, char in enumerate(s):
            rolling_hash = (
                (rolling_hash << 2) | char_to_bits[char]
            ) & mask

            if index >= sequence_length - 1:
                if rolling_hash in seen:
                    if rolling_hash not in repeated:
                        result.append(
                            s[index - sequence_length + 1:index + 1]
                        )
                        repeated.add(rolling_hash)
                else:
                    seen.add(rolling_hash)

        return result

The implementation begins with an early return for strings shorter than 10, because such inputs cannot contain valid DNA sequences.

The char_to_bits dictionary converts DNA characters into compact 2-bit representations. This encoding is crucial because it enables us to store an entire 10-character sequence inside a single integer.

The variable rolling_hash stores the current encoded window. At each iteration, we shift the previous value left by two bits and append the new nucleotide encoding. The mask ensures only the most recent 20 bits remain.

The seen set stores hashes already encountered. If the current hash already exists, we know the sequence has appeared before. The repeated set prevents duplicates from being appended to the answer multiple times.

Finally, the substring corresponding to the repeated window is extracted and added to result.

Go Solution

func findRepeatedDnaSequences(s string) []string {
	const sequenceLength = 10

	if len(s) < sequenceLength {
		return []string{}
	}

	charToBits := map[byte]int{
		'A': 0,
		'C': 1,
		'G': 2,
		'T': 3,
	}

	seen := make(map[int]bool)
	repeated := make(map[int]bool)
	result := []string{}

	rollingHash := 0
	mask := (1 << (2 * sequenceLength)) - 1

	for i := 0; i < len(s); i++ {
		rollingHash = ((rollingHash << 2) | charToBits[s[i]]) & mask

		if i >= sequenceLength-1 {
			if seen[rollingHash] {
				if !repeated[rollingHash] {
					result = append(
						result,
						s[i-sequenceLength+1:i+1],
					)
					repeated[rollingHash] = true
				}
			} else {
				seen[rollingHash] = true
			}
		}
	}

	return result
}

The Go implementation follows exactly the same logic as the Python version.

One notable difference is that Go strings are indexed using byte, so the nucleotide encoding map uses map[byte]int. Instead of Python sets, Go uses map[int]bool to simulate membership checking.

Go does not distinguish between nil and empty slices in LeetCode submissions for this problem, but returning an empty slice is more explicit and consistent.

Integer overflow is not a concern because the rolling hash uses only 20 bits.

Worked Examples

Example 1

Input

s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"

We examine every 10-character window.

Index Range Substring Seen Before? Result
0-9 AAAAACCCCC No []
1-10 AAAACCCCCA No []
2-11 AAACCCCCAA No []
... ... ... ...
10-19 AAAAACCCCC Yes ["AAAAACCCCC"]
16-25 CCCCCAAAAA Yes ["AAAAACCCCC", "CCCCCAAAAA"]

Final output:

["AAAAACCCCC", "CCCCCAAAAA"]

The important observation is that both repeated sequences occur multiple times and are added only once.

Example 2

Input

s = "AAAAAAAAAAAAA"

The windows are:

Index Range Substring
0-9 AAAAAAAAAA
1-10 AAAAAAAAAA
2-11 AAAAAAAAAA
3-12 AAAAAAAAAA

Processing:

Step Window Seen Repeated Result
1 AAAAAAAAAA Add No []
2 AAAAAAAAAA Yes Add ["AAAAAAAAAA"]
3 AAAAAAAAAA Yes Already Exists ["AAAAAAAAAA"]

Final output:

["AAAAAAAAAA"]

Complexity Analysis

Measure Complexity Explanation
Time O(n) Each character is processed once with constant time rolling updates
Space O(n) Hash sets may store up to O(n) unique sequences

The time complexity is linear because we scan the string exactly once. Each iteration performs constant time work, shifting bits, masking, and hash set lookups.

The space complexity is O(n) in the worst case because every 10-character substring may be unique and stored in the hash sets.

Test Cases

solution = Solution()

# Example 1
assert sorted(solution.findRepeatedDnaSequences(
    "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
)) == sorted(["AAAAACCCCC", "CCCCCAAAAA"])

# Example 2
assert solution.findRepeatedDnaSequences(
    "AAAAAAAAAAAAA"
) == ["AAAAAAAAAA"]  # overlapping repetition

# Length smaller than 10
assert solution.findRepeatedDnaSequences(
    "ACGT"
) == []  # no valid windows

# Length exactly 10, no repetition
assert solution.findRepeatedDnaSequences(
    "ACGTACGTAC"
) == []  # single valid sequence

# No repeated sequences
assert solution.findRepeatedDnaSequences(
    "ACGTACGTAG"
) == []  # all unique

# One repeated sequence multiple times
assert solution.findRepeatedDnaSequences(
    "AAAAAAAAAAAAAA"
) == ["AAAAAAAAAA"]  # repeated many times

# Multiple overlapping repeated sequences
assert sorted(solution.findRepeatedDnaSequences(
    "AAAAAAAAAAA"
)) == sorted(["AAAAAAAAAA"])  # overlap handling

# Mixed repeated patterns
assert sorted(solution.findRepeatedDnaSequences(
    "CCCCCAAAAACCCCCAAAAACCCCC"
)) == sorted([
    "CCCCCAAAAA",
    "CCCCAAAAAC",
    "CCCAAAAACC",
    "CCAAAAACCC",
    "CAAAAACCCC",
    "AAAAACCCCC"
])  # multiple overlaps

Test Summary

Test Why
Example 1 Validates multiple repeated sequences
Example 2 Validates overlapping repetition
Length smaller than 10 Ensures early return works
Length exactly 10 Ensures single window case works
No repeated sequences Confirms empty output when nothing repeats
Same sequence repeated many times Ensures no duplicate entries
Overlapping repeats Verifies overlapping windows count
Mixed repeated patterns Stresses repeated detection

Edge Cases

Input Length Smaller Than 10

If the input string has fewer than 10 characters, there are no valid DNA windows to examine. A naive implementation might accidentally attempt substring extraction and cause indexing issues. This implementation prevents that with an immediate early return.

Overlapping Repeated Sequences

Repeated DNA sequences can overlap. For example:

"AAAAAAAAAAAAA"

contains many overlapping "AAAAAAAAAA" windows. A buggy implementation may accidentally ignore overlaps by moving too far ahead after detecting a repeat. Since this solution processes every adjacent window, overlapping repetitions are handled naturally.

Repeated Sequence Appearing Many Times

A sequence may appear more than twice. For example:

"AAAAAAAAAAAAAA"

contains the same sequence repeatedly. Without the repeated set, the algorithm could add the same substring multiple times to the result. The implementation avoids this by recording sequences already included in the answer.

All Unique Sequences

Some inputs contain no repeated 10-character substrings at all. In these cases, the algorithm correctly returns an empty list because no rolling hash is encountered twice.