LeetCode 940 - Distinct Subsequences II

The problem asks us to count how many distinct non-empty subsequences can be formed from a given string s. A subsequence is formed by deleting zero or more characters while keeping the remaining characters in their original order.

LeetCode Problem 940

Difficulty: 🔴 Hard
Topics: String, Dynamic Programming

Solution

Problem Understanding

The problem asks us to count how many distinct non-empty subsequences can be formed from a given string s.

A subsequence is formed by deleting zero or more characters while keeping the remaining characters in their original order. For example, from "abc" we can form subsequences like "a", "ac", and "bc", but not "ca" because that changes the order.

The important detail is that we only count distinct subsequences. If the same subsequence can be formed in multiple ways, it should only be counted once.

For example, consider "aba":

  • Using the first 'a' gives subsequence "a"
  • Using the last 'a' also gives subsequence "a"

Even though there are multiple ways to produce "a", it is counted only once.

The input is a string s consisting only of lowercase English letters, with length up to 2000. Since the number of subsequences grows exponentially, we clearly cannot generate all subsequences directly for the upper bound.

The answer can become extremely large, so the problem requires returning the result modulo 10^9 + 7.

Several edge cases are important:

  • Strings with all unique characters, such as "abcde", maximize the number of subsequences.
  • Strings with many repeated characters, such as "aaaaa", create many duplicate subsequences that must not be double counted.
  • Very long strings require an efficient dynamic programming solution.
  • Single-character strings should correctly return 1.

Approaches

Brute Force Approach

The brute force solution generates every possible subsequence using recursion or bitmask enumeration.

For a string of length n, each character has two choices:

  • Include it
  • Exclude it

This produces 2^n total subsequences, including the empty subsequence.

After generating them, we could insert each subsequence into a set to remove duplicates, then return the size of the set minus one for the empty subsequence.

This approach is correct because it explicitly enumerates every subsequence and uses a set to ensure uniqueness.

However, it is far too slow for n = 2000. Even for n = 30, 2^30 is already over one billion subsequences.

Optimal Dynamic Programming Approach

The key observation is that when we process a new character, every existing subsequence can either:

  • Stay unchanged
  • Append the new character

If all characters were unique, each new character would simply double the number of subsequences.

However, repeated characters create duplicates.

For example:

  • Suppose we already processed "ab"
  • Existing subsequences are: "", "a", "b", "ab"

Now we process another 'a'.

Appending 'a' creates:

  • "a"
  • "aa"
  • "ba"
  • "aba"

But "a" already existed before, so we must avoid double counting.

The main insight is:

When processing character c, we double the number of subsequences, then subtract the subsequences that were already counted the last time we saw c.

We track:

  • total, the number of distinct subsequences including the empty subsequence
  • last[c], the number of subsequences that existed before the previous occurrence of character c

This allows us to remove duplicates efficiently.

Approach Time Complexity Space Complexity Notes
Brute Force O(2^n * n) O(2^n * n) Generates all subsequences explicitly
Optimal O(n) O(1) Dynamic programming with last occurrence tracking

Algorithm Walkthrough

  1. Initialize total = 1.

This represents the number of distinct subsequences including the empty subsequence. Initially, only the empty subsequence exists. 2. Create an array last of size 26 initialized to 0.

last[i] stores the value of total before the previous occurrence of character i. 3. Iterate through each character ch in the string. 4. Before updating total, store the current value in previous_total.

We need this because after processing the current character, future duplicates must know what the total count was before this occurrence. 5. Compute the new total:

new_total = total * 2 - last[ch]

The multiplication by 2 happens because every existing subsequence can either:

  • Ignore the current character
  • Append the current character

But this creates duplicate subsequences for repeated characters. The duplicates are exactly the subsequences counted before the previous occurrence of this character, so we subtract last[ch]. 6. Apply modulo arithmetic.

Since subtraction can temporarily produce negative values, we use:

total = (new_total + MOD) % MOD
  1. Update last[ch].

Set:

last[ch] = previous_total

This records how many subsequences existed before this occurrence of the character. 8. After processing all characters, subtract one to exclude the empty subsequence.

return total - 1

Why it works

The algorithm maintains the invariant that total always equals the number of distinct subsequences including the empty subsequence for the processed prefix of the string.

When a new character appears, every existing subsequence can either include or exclude it, which doubles the count. However, if the character appeared before, appending it to subsequences that already existed before the previous occurrence creates duplicates. Subtracting last[ch] removes exactly those duplicates, ensuring every subsequence is counted once.

Python Solution

class Solution:
    def distinctSubseqII(self, s: str) -> int:
        MOD = 10**9 + 7
        
        # total includes the empty subsequence
        total = 1
        
        # last[i] stores the total count before the previous
        # occurrence of character i
        last = [0] * 26
        
        for ch in s:
            index = ord(ch) - ord('a')
            
            previous_total = total
            
            total = (2 * total - last[index]) % MOD
            
            last[index] = previous_total
        
        # remove the empty subsequence
        return (total - 1) % MOD

The implementation follows the dynamic programming idea directly.

total tracks the number of distinct subsequences including the empty subsequence. Starting from 1 simplifies the recurrence because every step can uniformly double the count.

The last array tracks duplicate-producing states for each character. Since the string only contains lowercase English letters, an array of size 26 is sufficient and more efficient than a hash map.

Inside the loop, we first save previous_total because that value becomes the new last entry for the current character.

The recurrence:

total = (2 * total - last[index]) % MOD

implements the core logic:

  • Double all subsequences
  • Remove duplicates caused by repeated characters

Finally, we subtract one to exclude the empty subsequence from the final answer.

Go Solution

func distinctSubseqII(s string) int {
	const MOD int = 1000000007

	total := 1

	last := make([]int, 26)

	for _, ch := range s {
		index := ch - 'a'

		previousTotal := total

		total = (2*total - last[index] + MOD) % MOD

		last[index] = previousTotal
	}

	return (total - 1 + MOD) % MOD
}

The Go implementation mirrors the Python solution closely.

Go requires explicit handling of modulo subtraction because negative values can appear temporarily. Adding MOD before applying % MOD ensures the result remains non-negative.

The last structure is implemented as a fixed-size integer slice of length 26, since only lowercase English letters are possible.

Go uses rune iteration in the for _, ch := range s loop, but because all characters are lowercase English letters, indexing remains straightforward.

Worked Examples

Example 1: s = "abc"

Initial state:

Step Character Previous Total New Total last Update
Start - - 1 all zeros

Process 'a':

Step Character Previous Total Calculation New Total
1 a 1 2 × 1 - 0 2

Update:

last['a'] = 1

Process 'b':

Step Character Previous Total Calculation New Total
2 b 2 2 × 2 - 0 4

Update:

last['b'] = 2

Process 'c':

Step Character Previous Total Calculation New Total
3 c 4 2 × 4 - 0 8

Final answer:

8 - 1 = 7

Distinct subsequences:

a, b, c, ab, ac, bc, abc

Example 2: s = "aba"

Initial:

total = 1

Process 'a':

Character Previous Total Calculation New Total
a 1 2 × 1 - 0 2

Update:

last['a'] = 1

Process 'b':

Character Previous Total Calculation New Total
b 2 2 × 2 - 0 4

Update:

last['b'] = 2

Process second 'a':

Character Previous Total Calculation New Total
a 4 2 × 4 - 1 7

Why subtract 1?

The previous occurrence of 'a' already created subsequences beginning from the old state, so those duplicates must be removed.

Final answer:

7 - 1 = 6

Distinct subsequences:

a, b, ab, aa, ba, aba

Example 3: s = "aaa"

Initial:

total = 1

Process first 'a':

Character Previous Total Calculation New Total
a 1 2 × 1 - 0 2

Update:

last['a'] = 1

Process second 'a':

Character Previous Total Calculation New Total
a 2 2 × 2 - 1 3

Update:

last['a'] = 2

Process third 'a':

Character Previous Total Calculation New Total
a 3 2 × 3 - 2 4

Final answer:

4 - 1 = 3

Distinct subsequences:

a, aa, aaa

Complexity Analysis

Measure Complexity Explanation
Time O(n) Each character is processed exactly once
Space O(1) The last array always has fixed size 26

The algorithm performs a constant amount of work for every character in the string, producing linear time complexity.

The space complexity is constant because the auxiliary storage does not depend on input size. The last array always contains exactly 26 entries.

Test Cases

solution = Solution()

assert solution.distinctSubseqII("abc") == 7       # all unique characters
assert solution.distinctSubseqII("aba") == 6       # repeated character creates duplicates
assert solution.distinctSubseqII("aaa") == 3       # all characters identical

assert solution.distinctSubseqII("a") == 1         # single character
assert solution.distinctSubseqII("ab") == 3        # simple two-character case
assert solution.distinctSubseqII("aa") == 2        # duplicate removal

assert solution.distinctSubseqII("abcd") == 15     # 2^4 - 1 for unique chars
assert solution.distinctSubseqII("zzzz") == 4      # repeated same character

assert solution.distinctSubseqII("abab") == 11     # alternating duplicates
assert solution.distinctSubseqII("leetcode") > 0   # larger realistic string

long_unique = "abcdefghijklmnopqrstuvwxyz"
assert solution.distinctSubseqII(long_unique) == (2**26 - 1) % (10**9 + 7)
# maximum unique lowercase letters

long_repeat = "a" * 2000
assert solution.distinctSubseqII(long_repeat) == 2000
# only subsequences are "a", "aa", ..., repeated length
Test Why
"abc" Basic unique-character example
"aba" Verifies duplicate removal
"aaa" Heavy repetition case
"a" Smallest valid input
"ab" Minimal unique expansion
"aa" Minimal duplicate scenario
"abcd" Pure doubling behavior
"zzzz" Repeated single character
"abab" Interleaved duplicate handling
"leetcode" General realistic input
"abcdefghijklmnopqrstuvwxyz" Maximum unique lowercase letters
"a" * 2000 Large repeated-character stress test

Edge Cases

All Characters Are Identical

A string like "aaaaaa" is deceptively tricky because the number of possible subsequences grows slowly despite many positions existing in the string.

A naive implementation may accidentally count the same subsequence many times. For example, "a" can be formed from any individual position.

The dynamic programming solution handles this correctly because each repeated 'a' subtracts previously duplicated subsequences using last['a'].

All Characters Are Unique

For a string like "abcdef", every new character doubles the number of subsequences because no duplicates are introduced.

This is the ideal growth scenario:

2^n - 1

The algorithm naturally handles this because every last[ch] value remains zero for first occurrences.

Very Long Input

The constraint allows strings up to length 2000. A brute force solution would require processing 2^2000 subsequences, which is computationally impossible.

The dynamic programming approach processes each character once and uses constant extra space, making it efficient even at the maximum constraint size.