LeetCode 187 - Repeated DNA Sequences
This problem asks us to find all repeated DNA subsequences of a fixed length, specifically length 10, inside a given DNA string. The input is a string s, where each character represents a DNA nucleotide.
Difficulty: 🟡 Medium
Topics: Hash Table, String, Bit Manipulation, Sliding Window, Rolling Hash, Hash Function
Solution
Problem Understanding
This problem asks us to find all repeated DNA subsequences of a fixed length, specifically length 10, inside a given DNA string.
The input is a string s, where each character represents a DNA nucleotide. The string contains only four possible characters:
'A''C''G''T'
We must examine every substring of length 10 and determine which ones appear more than once in the DNA sequence. The output should contain every repeated 10-character sequence exactly once, regardless of how many times it appears.
For example, if a substring appears three or four times, it should still appear only once in the final result.
The key detail is that we are looking for fixed-length windows of size 10, not arbitrary repeated substrings. This immediately suggests a sliding window style approach because we repeatedly examine overlapping substrings of the same length.
The constraints are important:
1 <= s.length <= 10^5- Only four possible characters exist
A string length of 10^5 means we cannot afford expensive repeated substring comparisons or nested scanning approaches. We need something close to linear time.
Several edge cases stand out immediately:
If s.length < 10, there are no valid 10-character substrings, so the answer must be empty.
If the same sequence repeats many times, we should include it only once in the result. For example, "AAAAAAAAAAAAA" contains multiple overlapping occurrences of "AAAAAAAAAA", but the output contains only one copy.
Overlapping repetitions are valid. A repeated sequence does not need to start at non-overlapping positions.
Approaches
Brute Force Approach
The simplest way to solve this problem is to generate every possible substring of length 10 and count how many times each appears.
We slide through the string one character at a time. For each position i, we extract:
s[i:i+10]
We store each substring in a hash map where the key is the sequence and the value is its frequency.
After processing all windows, we iterate through the frequency map and collect every substring whose count is greater than 1.
This approach is correct because every possible 10-character sequence is explicitly counted. If a sequence appears multiple times, its frequency will exceed 1.
However, substring creation costs time. Since each substring has length 10, this technically becomes O(10 × n), which simplifies to O(n) because 10 is constant. In practice, this solution is already efficient enough for the problem constraints.
Still, we can do better conceptually by exploiting the fact that DNA only contains four characters.
Key Insight for an Optimal Solution
Since DNA characters come from only four possible values, we can encode each nucleotide using 2 bits:
| Character | Binary |
|---|---|
| A | 00 |
| C | 01 |
| G | 10 |
| T | 11 |
A sequence of length 10 therefore requires only:
10 × 2 = 20 bits
This means every 10-character DNA substring can be represented as a single integer.
Instead of storing actual substrings in a hash set, we can maintain a rolling hash encoded as bits. As the sliding window moves:
- Shift left by 2 bits
- Add the new nucleotide
- Remove bits that fall outside the 20-bit window
This allows us to represent every window in constant time without repeatedly constructing strings.
We then track:
- Encodings we have seen before
- Encodings already added to the answer
When we encounter an encoding a second time, we append the actual substring to the result.
Approach Comparison
| Approach | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Brute Force | O(n) | O(n) | Stores actual substrings and counts frequencies |
| Optimal | O(n) | O(n) | Uses rolling hash with bit encoding for constant time window updates |
Algorithm Walkthrough
Optimal Algorithm, Rolling Hash with Bit Manipulation
- First, check whether the input length is smaller than
10.
If len(s) < 10, return an empty list immediately because no valid substring exists.
2. Create a mapping for DNA characters.
Since only four nucleotides exist, encode them using two bits:
A → 00
C → 01
G → 10
T → 11
This compact representation allows efficient rolling updates. 3. Initialize data structures.
Create:
- A set called
seento track hashes already encountered - A set called
repeatedto avoid duplicate insertions - A result list for final answers
- Build the rolling hash.
As we iterate through the string:
- Shift the current hash left by
2 - Add the encoded value of the new character
- Keep only the last
20bits
The mask:
(1 << 20) - 1
ensures we discard older bits outside the current 10-character window.
5. Start processing once the first 10 characters are available.
We cannot form a valid sequence before index 9.
For each valid window:
- If the encoded sequence already exists in
seen, it means this DNA substring has appeared before. - If it has not yet been added to
repeated, append the substring to the result and mark it as repeated. - Otherwise, add the encoding to
seen.
- Continue until the end of the string.
The result list will contain all repeated 10-character DNA sequences.
Why it works
The algorithm works because every 10-character DNA sequence maps to a unique 20-bit integer. Since each character uses exactly 2 bits and only four characters exist, no collisions occur. The rolling hash always represents the current window of length 10, and the seen set guarantees we detect repeated sequences. The repeated set ensures each repeated sequence is added to the answer exactly once.
Python Solution
from typing import List
class Solution:
def findRepeatedDnaSequences(self, s: str) -> List[str]:
sequence_length = 10
if len(s) < sequence_length:
return []
char_to_bits = {
'A': 0,
'C': 1,
'G': 2,
'T': 3
}
seen = set()
repeated = set()
result = []
rolling_hash = 0
mask = (1 << (2 * sequence_length)) - 1
for index, char in enumerate(s):
rolling_hash = (
(rolling_hash << 2) | char_to_bits[char]
) & mask
if index >= sequence_length - 1:
if rolling_hash in seen:
if rolling_hash not in repeated:
result.append(
s[index - sequence_length + 1:index + 1]
)
repeated.add(rolling_hash)
else:
seen.add(rolling_hash)
return result
The implementation begins with an early return for strings shorter than 10, because such inputs cannot contain valid DNA sequences.
The char_to_bits dictionary converts DNA characters into compact 2-bit representations. This encoding is crucial because it enables us to store an entire 10-character sequence inside a single integer.
The variable rolling_hash stores the current encoded window. At each iteration, we shift the previous value left by two bits and append the new nucleotide encoding. The mask ensures only the most recent 20 bits remain.
The seen set stores hashes already encountered. If the current hash already exists, we know the sequence has appeared before. The repeated set prevents duplicates from being appended to the answer multiple times.
Finally, the substring corresponding to the repeated window is extracted and added to result.
Go Solution
func findRepeatedDnaSequences(s string) []string {
const sequenceLength = 10
if len(s) < sequenceLength {
return []string{}
}
charToBits := map[byte]int{
'A': 0,
'C': 1,
'G': 2,
'T': 3,
}
seen := make(map[int]bool)
repeated := make(map[int]bool)
result := []string{}
rollingHash := 0
mask := (1 << (2 * sequenceLength)) - 1
for i := 0; i < len(s); i++ {
rollingHash = ((rollingHash << 2) | charToBits[s[i]]) & mask
if i >= sequenceLength-1 {
if seen[rollingHash] {
if !repeated[rollingHash] {
result = append(
result,
s[i-sequenceLength+1:i+1],
)
repeated[rollingHash] = true
}
} else {
seen[rollingHash] = true
}
}
}
return result
}
The Go implementation follows exactly the same logic as the Python version.
One notable difference is that Go strings are indexed using byte, so the nucleotide encoding map uses map[byte]int. Instead of Python sets, Go uses map[int]bool to simulate membership checking.
Go does not distinguish between nil and empty slices in LeetCode submissions for this problem, but returning an empty slice is more explicit and consistent.
Integer overflow is not a concern because the rolling hash uses only 20 bits.
Worked Examples
Example 1
Input
s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
We examine every 10-character window.
| Index Range | Substring | Seen Before? | Result |
|---|---|---|---|
| 0-9 | AAAAACCCCC | No | [] |
| 1-10 | AAAACCCCCA | No | [] |
| 2-11 | AAACCCCCAA | No | [] |
| ... | ... | ... | ... |
| 10-19 | AAAAACCCCC | Yes | ["AAAAACCCCC"] |
| 16-25 | CCCCCAAAAA | Yes | ["AAAAACCCCC", "CCCCCAAAAA"] |
Final output:
["AAAAACCCCC", "CCCCCAAAAA"]
The important observation is that both repeated sequences occur multiple times and are added only once.
Example 2
Input
s = "AAAAAAAAAAAAA"
The windows are:
| Index Range | Substring |
|---|---|
| 0-9 | AAAAAAAAAA |
| 1-10 | AAAAAAAAAA |
| 2-11 | AAAAAAAAAA |
| 3-12 | AAAAAAAAAA |
Processing:
| Step | Window | Seen | Repeated | Result |
|---|---|---|---|---|
| 1 | AAAAAAAAAA | Add | No | [] |
| 2 | AAAAAAAAAA | Yes | Add | ["AAAAAAAAAA"] |
| 3 | AAAAAAAAAA | Yes | Already Exists | ["AAAAAAAAAA"] |
Final output:
["AAAAAAAAAA"]
Complexity Analysis
| Measure | Complexity | Explanation |
|---|---|---|
| Time | O(n) | Each character is processed once with constant time rolling updates |
| Space | O(n) | Hash sets may store up to O(n) unique sequences |
The time complexity is linear because we scan the string exactly once. Each iteration performs constant time work, shifting bits, masking, and hash set lookups.
The space complexity is O(n) in the worst case because every 10-character substring may be unique and stored in the hash sets.
Test Cases
solution = Solution()
# Example 1
assert sorted(solution.findRepeatedDnaSequences(
"AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
)) == sorted(["AAAAACCCCC", "CCCCCAAAAA"])
# Example 2
assert solution.findRepeatedDnaSequences(
"AAAAAAAAAAAAA"
) == ["AAAAAAAAAA"] # overlapping repetition
# Length smaller than 10
assert solution.findRepeatedDnaSequences(
"ACGT"
) == [] # no valid windows
# Length exactly 10, no repetition
assert solution.findRepeatedDnaSequences(
"ACGTACGTAC"
) == [] # single valid sequence
# No repeated sequences
assert solution.findRepeatedDnaSequences(
"ACGTACGTAG"
) == [] # all unique
# One repeated sequence multiple times
assert solution.findRepeatedDnaSequences(
"AAAAAAAAAAAAAA"
) == ["AAAAAAAAAA"] # repeated many times
# Multiple overlapping repeated sequences
assert sorted(solution.findRepeatedDnaSequences(
"AAAAAAAAAAA"
)) == sorted(["AAAAAAAAAA"]) # overlap handling
# Mixed repeated patterns
assert sorted(solution.findRepeatedDnaSequences(
"CCCCCAAAAACCCCCAAAAACCCCC"
)) == sorted([
"CCCCCAAAAA",
"CCCCAAAAAC",
"CCCAAAAACC",
"CCAAAAACCC",
"CAAAAACCCC",
"AAAAACCCCC"
]) # multiple overlaps
Test Summary
| Test | Why |
|---|---|
| Example 1 | Validates multiple repeated sequences |
| Example 2 | Validates overlapping repetition |
| Length smaller than 10 | Ensures early return works |
| Length exactly 10 | Ensures single window case works |
| No repeated sequences | Confirms empty output when nothing repeats |
| Same sequence repeated many times | Ensures no duplicate entries |
| Overlapping repeats | Verifies overlapping windows count |
| Mixed repeated patterns | Stresses repeated detection |
Edge Cases
Input Length Smaller Than 10
If the input string has fewer than 10 characters, there are no valid DNA windows to examine. A naive implementation might accidentally attempt substring extraction and cause indexing issues. This implementation prevents that with an immediate early return.
Overlapping Repeated Sequences
Repeated DNA sequences can overlap. For example:
"AAAAAAAAAAAAA"
contains many overlapping "AAAAAAAAAA" windows. A buggy implementation may accidentally ignore overlaps by moving too far ahead after detecting a repeat. Since this solution processes every adjacent window, overlapping repetitions are handled naturally.
Repeated Sequence Appearing Many Times
A sequence may appear more than twice. For example:
"AAAAAAAAAAAAAA"
contains the same sequence repeatedly. Without the repeated set, the algorithm could add the same substring multiple times to the result. The implementation avoids this by recording sequences already included in the answer.
All Unique Sequences
Some inputs contain no repeated 10-character substrings at all. In these cases, the algorithm correctly returns an empty list because no rolling hash is encountered twice.