LeetCode 433 - Minimum Genetic Mutation
The problem describes a mutation process between genetic sequences. Each gene is represented as a string of exactly 8 characters, and every character must be one of four possible nucleotides: 'A', 'C', 'G', or 'T'.
Difficulty: 🟡 Medium
Topics: Hash Table, String, Breadth-First Search
Solution
Problem Understanding
The problem describes a mutation process between genetic sequences. Each gene is represented as a string of exactly 8 characters, and every character must be one of four possible nucleotides: 'A', 'C', 'G', or 'T'.
We are given three inputs:
startGene, the initial gene sequenceendGene, the target gene sequence we want to reachbank, a list of valid intermediate gene sequences
A mutation consists of changing exactly one character in the gene string. However, there is an important restriction: every intermediate mutation must exist inside the gene bank. Even if a mutation differs by only one character, it cannot be used unless it appears in bank.
The goal is to determine the minimum number of mutations required to transform startGene into endGene. If no valid sequence of mutations exists, we return -1.
This is fundamentally a shortest path problem. Each valid gene can be thought of as a node in a graph. Two genes are connected if they differ by exactly one character. We want the shortest number of transformations from the start node to the target node.
The constraints are very small:
bank.length <= 10- every gene has length
8
These constraints tell us several things:
- brute force exploration is feasible
- generating neighbors dynamically is inexpensive
- graph traversal algorithms like Breadth-First Search are ideal
- performance is not the main challenge, correctness and clean traversal logic are more important
There are several important edge cases:
endGenemay not exist in the bank at all, in which case the answer is immediately-1- the bank may be empty
- there may be multiple possible mutation paths, and we specifically need the shortest one
- cycles are possible if we revisit previously explored genes, so we must track visited states
- the start gene itself may not appear in the bank, which is allowed by the problem statement
Approaches
Brute Force Approach
A brute-force solution would attempt every possible mutation sequence recursively. Starting from startGene, we could try mutating each position into each of the four nucleotide characters, check whether the resulting gene exists in the bank, and continue recursively until either endGene is found or all possibilities are exhausted.
This approach is correct because it explores all possible valid mutation sequences. Eventually, if a path exists, it will discover it.
However, the brute-force approach is inefficient because it repeatedly explores the same states. Different recursive paths may revisit identical genes many times, creating redundant computation. Without careful pruning, the recursion tree grows rapidly.
Even though the constraints are small enough that brute force might pass, it is not the cleanest or most scalable solution.
Key Insight
The key observation is that this problem asks for the minimum number of mutations. Whenever we need the shortest number of steps in an unweighted graph, Breadth-First Search is usually the correct tool.
Each gene represents a node. An edge exists between two genes if they differ by exactly one character.
Breadth-First Search works perfectly here because:
- every mutation has equal cost
- BFS explores all states at distance
1before distance2 - the first time we reach
endGene, we are guaranteed to have found the shortest mutation path
To efficiently check whether a generated mutation is valid, we store the bank in a hash set. This gives constant-time lookup when validating candidate mutations.
| Approach | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Brute Force | O(4^8 * n) approximately | O(n) | Explores many redundant mutation paths recursively |
| Optimal BFS | O(n * 8 * 4) | O(n) | Uses BFS to find the shortest mutation sequence |
Here, n is the number of genes in the bank.
Algorithm Walkthrough
Optimal BFS Algorithm
- Convert the gene bank into a hash set.
We use a hash set because we frequently need to check whether a generated mutation is valid. A set provides average O(1) lookup time.
2. Check whether endGene exists in the bank.
If the target gene is not present, no valid sequence can ever reach it. We can immediately return -1.
3. Initialize a BFS queue.
The queue stores pairs:
- the current gene
- the number of mutations taken to reach it
We begin with:
(startGene, 0)
- Maintain a visited set.
Once we process a gene, we should not revisit it. Revisiting states creates cycles and redundant work. 5. Process the queue level by level.
Remove the front gene from the queue. If it matches endGene, return the mutation count immediately.
6. Generate all possible one-character mutations.
For each of the 8 positions:
- try replacing the current character with
'A','C','G', and'T' - construct the new gene string
- check whether it exists in the bank and has not been visited
- Add valid mutations to the queue.
Every valid next mutation is added with mutation count + 1. 8. Continue until the queue becomes empty.
If BFS finishes without finding endGene, no valid mutation path exists. Return -1.
Why it works
Breadth-First Search explores genes in increasing order of mutation count. Every edge represents exactly one mutation, so all edges have equal weight.
Because BFS fully explores all paths requiring k mutations before exploring paths requiring k + 1 mutations, the first time we encounter endGene must correspond to the minimum possible number of mutations.
The visited set guarantees that each gene is processed only once, preventing infinite loops and redundant exploration.
Python Solution
from collections import deque
from typing import List
class Solution:
def minMutation(self, startGene: str, endGene: str, bank: List[str]) -> int:
bank_set = set(bank)
if endGene not in bank_set:
return -1
queue = deque([(startGene, 0)])
visited = {startGene}
possible_chars = ["A", "C", "G", "T"]
while queue:
current_gene, mutations = queue.popleft()
if current_gene == endGene:
return mutations
gene_chars = list(current_gene)
for i in range(8):
original_char = gene_chars[i]
for char in possible_chars:
if char == original_char:
continue
gene_chars[i] = char
next_gene = "".join(gene_chars)
if next_gene in bank_set and next_gene not in visited:
visited.add(next_gene)
queue.append((next_gene, mutations + 1))
gene_chars[i] = original_char
return -1
The implementation begins by converting the bank into a set for efficient membership checks. This is important because every generated mutation must be validated quickly.
The early return for endGene not in bank_set prevents unnecessary traversal when the target is impossible to reach.
A BFS queue is then initialized with the starting gene and mutation count 0. The queue ensures we explore states in increasing order of distance from the start.
For every dequeued gene, the code generates all possible one-step mutations by iterating through each character position and replacing it with every nucleotide option.
Whenever a generated mutation is both valid and unvisited, it is added to the queue with mutation count incremented by one.
Restoring the original character after each mutation attempt is important because the same character array is reused across iterations.
If BFS exhausts all reachable genes without finding endGene, the function returns -1.
Go Solution
package main
func minMutation(startGene string, endGene string, bank []string) int {
bankSet := make(map[string]bool)
for _, gene := range bank {
bankSet[gene] = true
}
if !bankSet[endGene] {
return -1
}
type State struct {
gene string
mutations int
}
queue := []State{{startGene, 0}}
visited := make(map[string]bool)
visited[startGene] = true
chars := []byte{'A', 'C', 'G', 'T'}
for len(queue) > 0 {
current := queue[0]
queue = queue[1:]
if current.gene == endGene {
return current.mutations
}
geneBytes := []byte(current.gene)
for i := 0; i < 8; i++ {
originalChar := geneBytes[i]
for _, ch := range chars {
if ch == originalChar {
continue
}
geneBytes[i] = ch
nextGene := string(geneBytes)
if bankSet[nextGene] && !visited[nextGene] {
visited[nextGene] = true
queue = append(queue, State{
gene: nextGene,
mutations: current.mutations + 1,
})
}
}
geneBytes[i] = originalChar
}
}
return -1
}
The Go implementation follows the same BFS structure as the Python version. Since Go does not provide a built-in deque, a slice is used as the queue.
A small State struct stores both the gene string and mutation count together.
Instead of converting strings repeatedly, the implementation works with byte slices while generating mutations. This is more efficient because Go strings are immutable.
Go maps are used for both the bank set and visited tracking.
Worked Examples
Example 1
startGene = "AACCGGTT"
endGene = "AACCGGTA"
bank = ["AACCGGTA"]
Initial state:
| Queue | Visited |
|---|---|
| [("AACCGGTT", 0)] | {"AACCGGTT"} |
Process "AACCGGTT":
We try all one-character mutations.
One valid mutation exists:
"AACCGGTA"
It exists in the bank, so enqueue it.
| Queue | Visited |
|---|---|
| [("AACCGGTA", 1)] | {"AACCGGTT", "AACCGGTA"} |
Process "AACCGGTA":
This equals endGene, so return 1.
Final answer:
1
Example 2
startGene = "AACCGGTT"
endGene = "AAACGGTA"
bank = ["AACCGGTA","AACCGCTA","AAACGGTA"]
Initial state:
| Queue | Visited |
|---|---|
| [("AACCGGTT", 0)] | {"AACCGGTT"} |
Process "AACCGGTT":
Valid mutation found:
"AACCGGTA"
| Queue | Visited |
|---|---|
| [("AACCGGTA", 1)] | {"AACCGGTT", "AACCGGTA"} |
Process "AACCGGTA":
Possible valid mutations:
"AAACGGTA"
"AACCGCTA"
Enqueue both.
| Queue | Visited |
|---|---|
| [("AAACGGTA", 2), ("AACCGCTA", 2)] | {"AACCGGTT", "AACCGGTA", "AAACGGTA", "AACCGCTA"} |
Process "AAACGGTA":
This matches endGene.
Return:
2
Complexity Analysis
| Measure | Complexity | Explanation |
|---|---|---|
| Time | O(n * 8 * 4) | For each gene, we try 8 positions and 4 possible mutations |
| Space | O(n) | Queue, visited set, and bank set each store at most n genes |
The BFS processes each valid gene at most once because of the visited set. For every processed gene, we generate at most 8 * 4 = 32 candidate mutations.
Since the bank size is extremely small, this solution is very efficient in practice.
Test Cases
from typing import List
class Solution:
def minMutation(self, startGene: str, endGene: str, bank: List[str]) -> int:
from collections import deque
bank_set = set(bank)
if endGene not in bank_set:
return -1
queue = deque([(startGene, 0)])
visited = {startGene}
possible_chars = ["A", "C", "G", "T"]
while queue:
current_gene, mutations = queue.popleft()
if current_gene == endGene:
return mutations
gene_chars = list(current_gene)
for i in range(8):
original_char = gene_chars[i]
for char in possible_chars:
if char == original_char:
continue
gene_chars[i] = char
next_gene = "".join(gene_chars)
if next_gene in bank_set and next_gene not in visited:
visited.add(next_gene)
queue.append((next_gene, mutations + 1))
gene_chars[i] = original_char
return -1
solution = Solution()
assert solution.minMutation(
"AACCGGTT",
"AACCGGTA",
["AACCGGTA"]
) == 1 # single mutation
assert solution.minMutation(
"AACCGGTT",
"AAACGGTA",
["AACCGGTA", "AACCGCTA", "AAACGGTA"]
) == 2 # two-step mutation path
assert solution.minMutation(
"AAAAACCC",
"AACCCCCC",
["AAAACCCC", "AAACCCCC", "AACCCCCC"]
) == 3 # longer valid chain
assert solution.minMutation(
"AACCGGTT",
"AACCGGTA",
[]
) == -1 # empty bank
assert solution.minMutation(
"AACCGGTT",
"AAACGGTA",
["AACCGCTA"]
) == -1 # end gene missing from bank
assert solution.minMutation(
"AACCGGTT",
"AACCGGTT",
["AACCGGTT"]
) == 0 # already at target
assert solution.minMutation(
"AACCGGTT",
"CCCCCCCC",
["CACCGGTT", "CCCCGGTT", "CCCCCGTT"]
) == -1 # unreachable target
assert solution.minMutation(
"AAAAAAAA",
"CCCCCCCC",
[
"CAAAAAAA",
"CCAAAAAA",
"CCCAAAAA",
"CCCCAAAA",
"CCCCCAAA",
"CCCCCCAA",
"CCCCCCCA",
"CCCCCCCC"
]
) == 8 # maximum-length mutation chain
| Test | Why |
|---|---|
| Single mutation example | Validates the simplest successful transformation |
| Two-step example | Confirms BFS finds shortest path |
| Longer chain | Verifies multi-level traversal |
| Empty bank | Ensures impossible cases return -1 |
| Missing end gene | Tests early termination optimization |
| Start equals end | Verifies zero-mutation scenario |
| Unreachable target | Confirms BFS handles disconnected graphs |
| Long mutation chain | Stress-tests repeated BFS expansion |
Edge Cases
End Gene Missing from the Bank
One of the most common mistakes is forgetting that every valid mutation must exist in the bank, including the final target gene. If endGene is absent, the transformation is impossible regardless of intermediate states.
The implementation handles this immediately with:
if endGene not in bank_set:
return -1
This prevents unnecessary BFS traversal.
Start Gene Already Equals End Gene
Another subtle case occurs when the start and end genes are identical. In this situation, no mutations are required, so the correct answer is 0.
The BFS naturally handles this because the first dequeued state is the start gene itself. The equality check succeeds immediately.
Cycles and Repeated Exploration
Mutation graphs can contain cycles. For example:
AACCGGTT -> AACCGGTA -> AACCGGTT
Without a visited set, BFS could revisit the same genes indefinitely or repeatedly process identical states.
The implementation prevents this by adding genes to visited as soon as they are enqueued. Each gene is therefore processed at most once.
Empty Gene Bank
An empty bank means there are no valid mutations available. Unless the target is already reached, the answer must be -1.
The implementation handles this correctly because:
- the end gene will fail the bank membership check
- BFS will never enqueue additional states
This guarantees correct behavior without requiring special-case logic.