LeetCode 2272 - Substring With Largest Variance

This problem asks us to find the maximum possible variance among all substrings of a given string. The string contains only lowercase English letters, and the variance of a substring is defined as the largest difference between the counts of any two characters that both appear…

LeetCode Problem 2272

Difficulty: 🔴 Hard
Topics: Hash Table, String, Dynamic Programming, Enumeration

Solution

Problem Understanding

This problem asks us to find the maximum possible variance among all substrings of a given string. The string contains only lowercase English letters, and the variance of a substring is defined as the largest difference between the counts of any two characters that both appear in that substring.

The important detail is that the two characters being compared must both exist in the substring. For example, if a substring contains four b characters and one a character, then the variance is 4 - 1 = 3. However, if a substring contains only b characters, its variance is 0, because there is no second character present to compare against.

The input is a single string s with length up to 10^4. The expected output is a single integer representing the largest variance among every possible substring.

A naive interpretation might suggest checking every substring and computing character frequencies for each one. Since there are O(n^2) substrings and each substring may require frequency analysis, this quickly becomes too slow for n = 10^4.

Several edge cases are important:

  • Strings where all characters are unique, such as "abcde", always produce variance 0.
  • Strings with only one distinct character also produce variance 0.
  • The optimal substring may not contain the globally most frequent character.
  • The order of characters matters because we are working with substrings, not subsequences.
  • A valid substring for variance must contain both chosen characters at least once.

The main challenge is efficiently finding the best substring for every pair of characters.

Approaches

Brute Force Approach

The brute force solution considers every possible substring of s. For each substring, we count the frequency of all characters and compute the maximum difference between any two character counts that both appear.

There are O(n^2) substrings. Computing frequencies for each substring can take up to O(26) if prefix sums are used, or O(n) without preprocessing. Even with optimization, the total complexity becomes too large.

This approach is correct because it explicitly checks every substring and directly computes the required variance. However, it is impractical for strings of length 10^4.

Optimal Approach

The key insight is that variance only depends on two characters at a time.

Suppose we choose two characters:

  • major, the more frequent character
  • minor, the less frequent character

For a substring containing these two characters, the variance becomes:

count(major) - count(minor)

This resembles the classic maximum subarray problem solved by Kadane's algorithm.

We can transform the string into values:

  • +1 for major
  • -1 for minor
  • ignore all other characters

Now we want the maximum subarray sum that contains at least one minor.

The tricky part is ensuring the substring includes both characters. We cannot allow substrings containing only the major character.

We iterate over all ordered pairs of distinct characters. Since there are only 26 lowercase letters, this gives at most 26 * 26 = 676 pairs, which is manageable.

For each pair, we run a modified Kadane scan across the string.

Approach Comparison

Approach Time Complexity Space Complexity Notes
Brute Force O(n² × 26) O(26) Enumerates every substring and computes frequencies
Optimal O(26² × n) O(1) Uses modified Kadane's algorithm for every character pair

Algorithm Walkthrough

Step 1: Iterate over all ordered character pairs

We consider every ordered pair (major, minor) where the two characters are different.

The order matters because variance is directional:

count(major) - count(minor)

A substring maximizing a - b is different from one maximizing b - a.

Step 2: Initialize Kadane-style variables

For each pair:

  • major_count tracks occurrences of major
  • minor_count tracks occurrences of minor

We scan the string from left to right.

Step 3: Ignore irrelevant characters

If the current character is neither major nor minor, it does not affect the variance for this pair, so we skip it.

Step 4: Update counts

When we encounter:

  • major, increment major_count
  • minor, increment minor_count

The current variance candidate becomes:

major_count - minor_count

Step 5: Only update the answer when both characters exist

A valid substring must contain at least one minor.

Therefore, we only update the global answer if:

minor_count > 0

Step 6: Reset when the balance becomes unfavorable

If:

major_count < minor_count

then the current substring is harmful for future extensions.

This is exactly the Kadane principle. Any future substring would perform better starting fresh after this point.

So we reset:

major_count = 0
minor_count = 0

Step 7: Handle future minor characters

There is one subtle issue.

Suppose we reset immediately after seeing more minors than majors. We might lose the possibility of forming a valid substring later.

To fix this, we track whether additional minor characters remain ahead in the string. We only reset if another minor still exists later.

Why it works

For each ordered pair of characters, the algorithm effectively computes the maximum subarray sum where:

  • major contributes +1
  • minor contributes -1

The reset condition follows the same correctness argument as Kadane's algorithm. Once the running balance becomes negative, continuing the current substring can never help future results.

By checking all ordered character pairs, we guarantee that every possible variance configuration is considered.

Python Solution

class Solution:
    def largestVariance(self, s: str) -> int:
        answer = 0
        
        for major in range(26):
            for minor in range(26):
                if major == minor:
                    continue
                
                major_char = chr(ord('a') + major)
                minor_char = chr(ord('a') + minor)
                
                remaining_minor = s.count(minor_char)
                
                major_count = 0
                minor_count = 0
                
                for ch in s:
                    if ch != major_char and ch != minor_char:
                        continue
                    
                    if ch == major_char:
                        major_count += 1
                    else:
                        minor_count += 1
                        remaining_minor -= 1
                    
                    if minor_count > 0:
                        answer = max(answer, major_count - minor_count)
                    
                    if major_count < minor_count and remaining_minor > 0:
                        major_count = 0
                        minor_count = 0
        
        return answer

The implementation follows the algorithm directly.

The outer two loops enumerate all ordered character pairs. Since only lowercase English letters are allowed, the total number of pairs is small and fixed.

For each pair, we maintain a Kadane-style running window using major_count and minor_count.

Characters unrelated to the current pair are ignored because they do not influence the variance between the selected characters.

The variable remaining_minor is important for correctness. It tracks how many future minor characters still exist. If the current balance becomes negative and future minor characters remain, we reset the running counts. Otherwise, we keep the current counts because this may be the only valid substring containing a minor.

The answer is updated only when minor_count > 0, ensuring both characters exist in the substring.

Go Solution

func largestVariance(s string) int {
    answer := 0

    for major := 0; major < 26; major++ {
        for minor := 0; minor < 26; minor++ {
            if major == minor {
                continue
            }

            majorChar := byte('a' + major)
            minorChar := byte('a' + minor)

            remainingMinor := 0
            for i := 0; i < len(s); i++ {
                if s[i] == minorChar {
                    remainingMinor++
                }
            }

            majorCount := 0
            minorCount := 0

            for i := 0; i < len(s); i++ {
                ch := s[i]

                if ch != majorChar && ch != minorChar {
                    continue
                }

                if ch == majorChar {
                    majorCount++
                } else {
                    minorCount++
                    remainingMinor--
                }

                if minorCount > 0 {
                    variance := majorCount - minorCount
                    if variance > answer {
                        answer = variance
                    }
                }

                if majorCount < minorCount && remainingMinor > 0 {
                    majorCount = 0
                    minorCount = 0
                }
            }
        }
    }

    return answer
}

The Go implementation closely mirrors the Python version.

Since Go strings are byte slices for ASCII characters, we access characters using s[i] directly as byte values. This is efficient because the problem guarantees lowercase English letters only.

Unlike Python, Go does not provide a built in max function for integers, so the comparison is written manually.

No special handling for nil or empty values is required because the constraints guarantee at least one character.

Worked Examples

Example 1

Input:

s = "aababbb"

Consider the pair:

major = 'b'
minor = 'a'

This is the pair that produces the optimal variance.

Iteration Trace

Character major_count minor_count remaining_minor Current Variance Answer
a 0 1 2 invalid 0
a 0 2 1 invalid 0
b 1 2 1 -1 0
a 0 1 0 invalid 0
b 1 1 0 0 0
b 2 1 0 1 1
b 3 1 0 2 2

This scan alone does not produce the optimal answer because the reset timing matters.

Now consider another valid substring during the scan:

"babbb"

Counts:

b = 4
a = 1
variance = 3

The algorithm eventually records the maximum answer as 3.

Example 2

Input:

s = "abcde"

Every character appears exactly once.

For any pair of characters:

  • The best possible substring contains both once.
  • Variance becomes 1 - 1 = 0.

No substring can produce a larger difference.

Final answer:

0

Complexity Analysis

Measure Complexity Explanation
Time O(26² × n) We scan the string once for every ordered character pair
Space O(1) Only a few counters are maintained

The alphabet size is fixed at 26, so 26² is effectively constant. This makes the solution linear with respect to the input length in practice.

The algorithm uses constant extra space because it only stores counters and loop variables.

Test Cases

solution = Solution()

assert solution.largestVariance("aababbb") == 3  # example case
assert solution.largestVariance("abcde") == 0  # all unique characters
assert solution.largestVariance("aaaa") == 0  # single distinct character
assert solution.largestVariance("ab") == 0  # equal counts only
assert solution.largestVariance("aab") == 1  # simple positive variance
assert solution.largestVariance("bbbaaa") == 2  # balanced large groups
assert solution.largestVariance("ababab") == 1  # alternating characters
assert solution.largestVariance("baaab") == 2  # optimal substring in middle
assert solution.largestVariance("bbbbba") == 4  # dominant major character
assert solution.largestVariance("aabbbbb") == 4  # strong variance near end
assert solution.largestVariance("abcabcabc") == 1  # repeated balanced pattern
assert solution.largestVariance("aaabbbccc") == 2  # multiple candidate pairs
assert solution.largestVariance("bbaaab") == 2  # reset logic validation
assert solution.largestVariance("abbabbb") == 4  # optimal substring includes one minor

Test Summary

Test Why
"aababbb" Official example with variance 3
"abcde" All unique characters
"aaaa" Only one distinct character
"ab" Minimum valid two-character case
"aab" Simple nonzero variance
"bbbaaa" Balanced grouped characters
"ababab" Alternating sequence
"baaab" Best substring inside string
"bbbbba" Extreme frequency imbalance
"aabbbbb" Large variance near end
"abcabcabc" Repeated balanced structure
"aaabbbccc" Multiple character combinations
"bbaaab" Kadane reset edge case
"abbabbb" Large variance with mixed ordering

Edge Cases

Strings with only one distinct character

Consider:

"aaaaaa"

A naive implementation might incorrectly return the count of a because it appears many times. However, variance requires two characters to exist in the substring.

The implementation handles this correctly because the answer is updated only when minor_count > 0.

Strings where all characters are unique

Consider:

"abcdef"

Every substring containing two characters has equal counts:

1 - 1 = 0

The algorithm still scans all character pairs, but no pair produces a positive variance. The final answer remains 0.

Reset logic when the running balance becomes negative

Consider:

"bbaaab"

A naive Kadane reset may discard useful future substrings too aggressively.

The implementation avoids this issue using remaining_minor. We only reset when another minor character still exists later in the string. This preserves valid substrings that may become optimal near the end of the scan.