LeetCode 433 - Minimum Genetic Mutation

Difficulty: 🟡 Medium
Topics: Hash Table, String, Breadth-First Search

Solution

Problem Understanding

The problem describes a mutation process between genetic sequences. Each gene is represented as a string of exactly 8 characters, and every character must be one of four possible nucleotides: 'A', 'C', 'G', or 'T'.

We are given three inputs:

startGene, the initial gene sequence
endGene, the target gene sequence we want to reach
bank, a list of valid intermediate gene sequences

A mutation consists of changing exactly one character in the gene string. However, there is an important restriction: every intermediate mutation must exist inside the gene bank. Even if a mutation differs by only one character, it cannot be used unless it appears in bank.

The goal is to determine the minimum number of mutations required to transform startGene into endGene. If no valid sequence of mutations exists, we return -1.

This is fundamentally a shortest path problem. Each valid gene can be thought of as a node in a graph. Two genes are connected if they differ by exactly one character. We want the shortest number of transformations from the start node to the target node.

The constraints are very small:

bank.length <= 10
every gene has length 8

These constraints tell us several things:

brute force exploration is feasible
generating neighbors dynamically is inexpensive
graph traversal algorithms like Breadth-First Search are ideal
performance is not the main challenge, correctness and clean traversal logic are more important

There are several important edge cases:

endGene may not exist in the bank at all, in which case the answer is immediately -1
the bank may be empty
there may be multiple possible mutation paths, and we specifically need the shortest one
cycles are possible if we revisit previously explored genes, so we must track visited states
the start gene itself may not appear in the bank, which is allowed by the problem statement

Approaches

Brute Force Approach

A brute-force solution would attempt every possible mutation sequence recursively. Starting from startGene, we could try mutating each position into each of the four nucleotide characters, check whether the resulting gene exists in the bank, and continue recursively until either endGene is found or all possibilities are exhausted.

This approach is correct because it explores all possible valid mutation sequences. Eventually, if a path exists, it will discover it.

However, the brute-force approach is inefficient because it repeatedly explores the same states. Different recursive paths may revisit identical genes many times, creating redundant computation. Without careful pruning, the recursion tree grows rapidly.

Even though the constraints are small enough that brute force might pass, it is not the cleanest or most scalable solution.

Key Insight

The key observation is that this problem asks for the minimum number of mutations. Whenever we need the shortest number of steps in an unweighted graph, Breadth-First Search is usually the correct tool.

Each gene represents a node. An edge exists between two genes if they differ by exactly one character.

Breadth-First Search works perfectly here because:

every mutation has equal cost
BFS explores all states at distance 1 before distance 2
the first time we reach endGene, we are guaranteed to have found the shortest mutation path

To efficiently check whether a generated mutation is valid, we store the bank in a hash set. This gives constant-time lookup when validating candidate mutations.

Approach	Time Complexity	Space Complexity	Notes
Brute Force	O(4^8 * n) approximately	O(n)	Explores many redundant mutation paths recursively
Optimal BFS	O(n * 8 * 4)	O(n)	Uses BFS to find the shortest mutation sequence

Here, n is the number of genes in the bank.

Algorithm Walkthrough

Optimal BFS Algorithm

Convert the gene bank into a hash set.

We use a hash set because we frequently need to check whether a generated mutation is valid. A set provides average O(1) lookup time. 2. Check whether endGene exists in the bank.

If the target gene is not present, no valid sequence can ever reach it. We can immediately return -1. 3. Initialize a BFS queue.

The queue stores pairs:

the current gene
the number of mutations taken to reach it

We begin with:

(startGene, 0)

Maintain a visited set.

Once we process a gene, we should not revisit it. Revisiting states creates cycles and redundant work. 5. Process the queue level by level.

Remove the front gene from the queue. If it matches endGene, return the mutation count immediately. 6. Generate all possible one-character mutations.

For each of the 8 positions:

try replacing the current character with 'A', 'C', 'G', and 'T'
construct the new gene string
check whether it exists in the bank and has not been visited

Add valid mutations to the queue.

Every valid next mutation is added with mutation count + 1. 8. Continue until the queue becomes empty.

If BFS finishes without finding endGene, no valid mutation path exists. Return -1.

Why it works

Breadth-First Search explores genes in increasing order of mutation count. Every edge represents exactly one mutation, so all edges have equal weight.

Because BFS fully explores all paths requiring k mutations before exploring paths requiring k + 1 mutations, the first time we encounter endGene must correspond to the minimum possible number of mutations.

The visited set guarantees that each gene is processed only once, preventing infinite loops and redundant exploration.

Python Solution

from collections import deque
from typing import List

class Solution:
    def minMutation(self, startGene: str, endGene: str, bank: List[str]) -> int:
        bank_set = set(bank)

        if endGene not in bank_set:
            return -1

        queue = deque([(startGene, 0)])
        visited = {startGene}

        possible_chars = ["A", "C", "G", "T"]

        while queue:
            current_gene, mutations = queue.popleft()

            if current_gene == endGene:
                return mutations

            gene_chars = list(current_gene)

            for i in range(8):
                original_char = gene_chars[i]

                for char in possible_chars:
                    if char == original_char:
                        continue

                    gene_chars[i] = char
                    next_gene = "".join(gene_chars)

                    if next_gene in bank_set and next_gene not in visited:
                        visited.add(next_gene)
                        queue.append((next_gene, mutations + 1))

                gene_chars[i] = original_char

        return -1

The implementation begins by converting the bank into a set for efficient membership checks. This is important because every generated mutation must be validated quickly.

The early return for endGene not in bank_set prevents unnecessary traversal when the target is impossible to reach.

A BFS queue is then initialized with the starting gene and mutation count 0. The queue ensures we explore states in increasing order of distance from the start.

For every dequeued gene, the code generates all possible one-step mutations by iterating through each character position and replacing it with every nucleotide option.

Whenever a generated mutation is both valid and unvisited, it is added to the queue with mutation count incremented by one.

Restoring the original character after each mutation attempt is important because the same character array is reused across iterations.

If BFS exhausts all reachable genes without finding endGene, the function returns -1.

Go Solution

package main

func minMutation(startGene string, endGene string, bank []string) int {
	bankSet := make(map[string]bool)

	for _, gene := range bank {
		bankSet[gene] = true
	}

	if !bankSet[endGene] {
		return -1
	}

	type State struct {
		gene      string
		mutations int
	}

	queue := []State{{startGene, 0}}
	visited := make(map[string]bool)
	visited[startGene] = true

	chars := []byte{'A', 'C', 'G', 'T'}

	for len(queue) > 0 {
		current := queue[0]
		queue = queue[1:]

		if current.gene == endGene {
			return current.mutations
		}

		geneBytes := []byte(current.gene)

		for i := 0; i < 8; i++ {
			originalChar := geneBytes[i]

			for _, ch := range chars {
				if ch == originalChar {
					continue
				}

				geneBytes[i] = ch
				nextGene := string(geneBytes)

				if bankSet[nextGene] && !visited[nextGene] {
					visited[nextGene] = true
					queue = append(queue, State{
						gene:      nextGene,
						mutations: current.mutations + 1,
					})
				}
			}

			geneBytes[i] = originalChar
		}
	}

	return -1
}

The Go implementation follows the same BFS structure as the Python version. Since Go does not provide a built-in deque, a slice is used as the queue.

A small State struct stores both the gene string and mutation count together.

Instead of converting strings repeatedly, the implementation works with byte slices while generating mutations. This is more efficient because Go strings are immutable.

Go maps are used for both the bank set and visited tracking.

Worked Examples

Example 1

startGene = "AACCGGTT"
endGene   = "AACCGGTA"
bank      = ["AACCGGTA"]

Initial state:

Queue	Visited
[("AACCGGTT", 0)]	{"AACCGGTT"}

Process "AACCGGTT":

We try all one-character mutations.

One valid mutation exists:

"AACCGGTA"

It exists in the bank, so enqueue it.

Queue	Visited
[("AACCGGTA", 1)]	{"AACCGGTT", "AACCGGTA"}

Process "AACCGGTA":

This equals endGene, so return 1.

Final answer:

Example 2

startGene = "AACCGGTT"
endGene   = "AAACGGTA"
bank      = ["AACCGGTA","AACCGCTA","AAACGGTA"]

Initial state:

Queue	Visited
[("AACCGGTT", 0)]	{"AACCGGTT"}

Process "AACCGGTT":

Valid mutation found:

"AACCGGTA"

Queue	Visited
[("AACCGGTA", 1)]	{"AACCGGTT", "AACCGGTA"}

Process "AACCGGTA":

Possible valid mutations:

"AAACGGTA"
"AACCGCTA"

Enqueue both.

Queue	Visited
[("AAACGGTA", 2), ("AACCGCTA", 2)]	{"AACCGGTT", "AACCGGTA", "AAACGGTA", "AACCGCTA"}

Process "AAACGGTA":

This matches endGene.

Return:

Complexity Analysis

Measure	Complexity	Explanation
Time	O(n * 8 * 4)	For each gene, we try 8 positions and 4 possible mutations
Space	O(n)	Queue, visited set, and bank set each store at most n genes

The BFS processes each valid gene at most once because of the visited set. For every processed gene, we generate at most 8 * 4 = 32 candidate mutations.

Since the bank size is extremely small, this solution is very efficient in practice.

Test Cases

from typing import List

class Solution:
    def minMutation(self, startGene: str, endGene: str, bank: List[str]) -> int:
        from collections import deque

        bank_set = set(bank)

        if endGene not in bank_set:
            return -1

        queue = deque([(startGene, 0)])
        visited = {startGene}

        possible_chars = ["A", "C", "G", "T"]

        while queue:
            current_gene, mutations = queue.popleft()

            if current_gene == endGene:
                return mutations

            gene_chars = list(current_gene)

            for i in range(8):
                original_char = gene_chars[i]

                for char in possible_chars:
                    if char == original_char:
                        continue

                    gene_chars[i] = char
                    next_gene = "".join(gene_chars)

                    if next_gene in bank_set and next_gene not in visited:
                        visited.add(next_gene)
                        queue.append((next_gene, mutations + 1))

                gene_chars[i] = original_char

        return -1

solution = Solution()

assert solution.minMutation(
    "AACCGGTT",
    "AACCGGTA",
    ["AACCGGTA"]
) == 1  # single mutation

assert solution.minMutation(
    "AACCGGTT",
    "AAACGGTA",
    ["AACCGGTA", "AACCGCTA", "AAACGGTA"]
) == 2  # two-step mutation path

assert solution.minMutation(
    "AAAAACCC",
    "AACCCCCC",
    ["AAAACCCC", "AAACCCCC", "AACCCCCC"]
) == 3  # longer valid chain

assert solution.minMutation(
    "AACCGGTT",
    "AACCGGTA",
    []
) == -1  # empty bank

assert solution.minMutation(
    "AACCGGTT",
    "AAACGGTA",
    ["AACCGCTA"]
) == -1  # end gene missing from bank

assert solution.minMutation(
    "AACCGGTT",
    "AACCGGTT",
    ["AACCGGTT"]
) == 0  # already at target

assert solution.minMutation(
    "AACCGGTT",
    "CCCCCCCC",
    ["CACCGGTT", "CCCCGGTT", "CCCCCGTT"]
) == -1  # unreachable target

assert solution.minMutation(
    "AAAAAAAA",
    "CCCCCCCC",
    [
        "CAAAAAAA",
        "CCAAAAAA",
        "CCCAAAAA",
        "CCCCAAAA",
        "CCCCCAAA",
        "CCCCCCAA",
        "CCCCCCCA",
        "CCCCCCCC"
    ]
) == 8  # maximum-length mutation chain

Test	Why
Single mutation example	Validates the simplest successful transformation
Two-step example	Confirms BFS finds shortest path
Longer chain	Verifies multi-level traversal
Empty bank	Ensures impossible cases return `-1`
Missing end gene	Tests early termination optimization
Start equals end	Verifies zero-mutation scenario
Unreachable target	Confirms BFS handles disconnected graphs
Long mutation chain	Stress-tests repeated BFS expansion

Edge Cases

End Gene Missing from the Bank

One of the most common mistakes is forgetting that every valid mutation must exist in the bank, including the final target gene. If endGene is absent, the transformation is impossible regardless of intermediate states.

The implementation handles this immediately with:

if endGene not in bank_set:
    return -1

This prevents unnecessary BFS traversal.

Start Gene Already Equals End Gene

Another subtle case occurs when the start and end genes are identical. In this situation, no mutations are required, so the correct answer is 0.

The BFS naturally handles this because the first dequeued state is the start gene itself. The equality check succeeds immediately.

Cycles and Repeated Exploration

Mutation graphs can contain cycles. For example:

AACCGGTT -> AACCGGTA -> AACCGGTT

Without a visited set, BFS could revisit the same genes indefinitely or repeatedly process identical states.

The implementation prevents this by adding genes to visited as soon as they are enqueued. Each gene is therefore processed at most once.

Empty Gene Bank

An empty bank means there are no valid mutations available. Unless the target is already reached, the answer must be -1.

The implementation handles this correctly because:

the end gene will fail the bank membership check
BFS will never enqueue additional states

This guarantees correct behavior without requiring special-case logic.