LeetCode 393 - UTF-8 Validation

Difficulty: 🟡 Medium
Topics: Array, Bit Manipulation

Solution

Problem Understanding

The problem asks us to determine whether a sequence of integers represents a valid UTF-8 encoded byte stream.

Each integer in the input array represents one byte, meaning only its lowest 8 bits matter. UTF-8 is a variable-length encoding system where a character may occupy between 1 and 4 bytes. The structure of the leading bits determines how many bytes belong to a character.

A valid UTF-8 encoding follows these rules:

A 1-byte character starts with 0
A 2-byte character starts with 110
A 3-byte character starts with 1110
A 4-byte character starts with 11110
Every continuation byte must begin with 10

The task is to scan the array and verify that every byte sequence matches one of these legal UTF-8 patterns.

For example:

11000101 10000010 is valid because the first byte indicates a 2-byte character and the second byte correctly starts with 10
11101011 10001100 00000100 is invalid because the third byte should also begin with 10, but it begins with 00

The constraints tell us that the input length can be up to 2 * 10^4, which is large enough that we should aim for a linear-time solution. Fortunately, UTF-8 validation only requires examining bit patterns locally, so we can process the data in a single pass.

Several edge cases are important:

A multi-byte character may end prematurely before enough continuation bytes appear
A continuation byte may appear without a valid leading byte
A leading byte may indicate an invalid length, such as five leading 1 bits
A continuation byte may not begin with 10
Single-byte characters and multi-byte characters may appear mixed together

The problem guarantees that every integer is already within the valid byte range 0 to 255, so we do not need to worry about masking larger integers.

Approaches

Brute Force Approach

A brute-force strategy would explicitly convert every integer into an 8-character binary string and then validate the UTF-8 structure using string operations.

For each byte:

Convert it into binary form
Count the leading 1s
Determine whether it starts a valid UTF-8 character
Verify the required continuation bytes

This works because UTF-8 validity depends entirely on bit prefixes. By checking those prefixes directly as strings, we can determine correctness.

However, this approach is inefficient because:

Converting integers to binary strings repeatedly creates extra overhead
String slicing and comparisons are slower than direct bit operations
Additional memory is allocated for every conversion

Although the asymptotic complexity is still linear, the implementation is less elegant and significantly less efficient than using bitwise operations directly.

Optimal Approach

The key insight is that UTF-8 validation only depends on the highest few bits of each byte. Bit manipulation allows us to inspect those bits directly without converting anything to strings.

We process the array from left to right while tracking how many continuation bytes are still expected.

When we encounter a new leading byte:

We determine whether it represents a 1-byte, 2-byte, 3-byte, or 4-byte character
We record how many continuation bytes must follow

When processing continuation bytes:

We verify that the byte starts with 10

This works efficiently because every byte is processed exactly once and each validation step uses constant-time bit operations.

Approach	Time Complexity	Space Complexity	Notes
Brute Force	O(n)	O(n)	Uses binary string conversion and string processing
Optimal	O(n)	O(1)	Uses direct bit manipulation in a single pass

Algorithm Walkthrough

Initialize a variable called remaining_bytes to 0.

This variable tracks how many continuation bytes are still required for the current UTF-8 character. 2. Iterate through each byte in the input array.

We process the bytes sequentially because UTF-8 characters are encoded in order. 3. If remaining_bytes == 0, the current byte must begin a new character.

We inspect the leading bits to determine the character length. 4. Check the byte pattern using bit masks.

The patterns are:

0xxxxxxx → single-byte character
110xxxxx → expect 1 continuation byte
1110xxxx → expect 2 continuation bytes
11110xxx → expect 3 continuation bytes

Any other leading pattern is invalid. 5. If the byte starts a multi-byte character, set remaining_bytes accordingly.

For example:

110xxxxx means one continuation byte is required
1110xxxx means two continuation bytes are required

If remaining_bytes > 0, the current byte must be a continuation byte.

A continuation byte must match:

$10xxxxxx$

We verify this using bit masks. 7. If a continuation byte is valid, decrement remaining_bytes.

This means one expected continuation byte has been successfully consumed. 8. After processing all bytes, ensure remaining_bytes == 0.

If continuation bytes are still expected, the encoding ended prematurely and is invalid.

Why it works

The algorithm maintains a simple invariant:

remaining_bytes always represents the exact number of continuation bytes still required to complete the current UTF-8 character.

Whenever we encounter a leading byte, we determine the required continuation count. Every continuation byte reduces that count by one. If any byte violates the UTF-8 format rules, we immediately return False.

Because every UTF-8 character must strictly follow these bit-prefix rules, successfully processing the entire array with remaining_bytes == 0 guarantees the encoding is valid.

Python Solution

from typing import List

class Solution:
    def validUtf8(self, data: List[int]) -> bool:
        remaining_bytes = 0

        for byte in data:
            if remaining_bytes == 0:
                # 1-byte character: 0xxxxxxx
                if (byte >> 7) == 0:
                    continue

                # 2-byte character: 110xxxxx
                elif (byte >> 5) == 0b110:
                    remaining_bytes = 1

                # 3-byte character: 1110xxxx
                elif (byte >> 4) == 0b1110:
                    remaining_bytes = 2

                # 4-byte character: 11110xxx
                elif (byte >> 3) == 0b11110:
                    remaining_bytes = 3

                else:
                    return False

            else:
                # Continuation byte must start with 10
                if (byte >> 6) != 0b10:
                    return False

                remaining_bytes -= 1

        return remaining_bytes == 0

The implementation directly mirrors the algorithm described earlier.

The variable remaining_bytes tracks how many continuation bytes are still needed.

When remaining_bytes == 0, the current byte must start a new UTF-8 character. The code uses right shifts to inspect the leading bits:

byte >> 7 == 0 checks for a single-byte character
byte >> 5 == 0b110 checks for a 2-byte sequence
byte >> 4 == 0b1110 checks for a 3-byte sequence
byte >> 3 == 0b11110 checks for a 4-byte sequence

If none of these patterns match, the byte is invalid.

When continuation bytes are expected, the implementation checks:

$(byte >> 6) = 0b10$

This confirms the byte begins with the required 10 prefix.

Finally, the algorithm ensures no unfinished multi-byte sequence remains after processing the entire array.

Go Solution

func validUtf8(data []int) bool {
    remainingBytes := 0

    for _, b := range data {
        if remainingBytes == 0 {
            // 1-byte character: 0xxxxxxx
            if (b >> 7) == 0 {
                continue
            }

            // 2-byte character: 110xxxxx
            if (b >> 5) == 0b110 {
                remainingBytes = 1

            // 3-byte character: 1110xxxx
            } else if (b >> 4) == 0b1110 {
                remainingBytes = 2

            // 4-byte character: 11110xxx
            } else if (b >> 3) == 0b11110 {
                remainingBytes = 3

            } else {
                return false
            }

        } else {
            // Continuation byte must start with 10
            if (b >> 6) != 0b10 {
                return false
            }

            remainingBytes--
        }
    }

    return remainingBytes == 0
}

The Go implementation is almost identical to the Python version because the logic is fundamentally simple and language-independent.

One small difference is that Go uses explicit integer types and := variable initialization syntax. Go also uses slices instead of Python lists, but iteration works similarly using range.

No special handling for integer overflow is required because all input values are guaranteed to be between 0 and 255.

Worked Examples

Example 1

Input:

data = [197, 130, 1]

Binary representation:

197 = 11000101
130 = 10000010
1   = 00000001

Index	Byte	Binary	remaining_bytes Before	Action	remaining_bytes After
0	197	11000101	0	Start of 2-byte character	1
1	130	10000010	1	Valid continuation byte	0
2	1	00000001	0	Valid 1-byte character	0

At the end, remaining_bytes == 0, so the encoding is valid.

Result:

true

Example 2

Input:

data = [235, 140, 4]

Binary representation:

235 = 11101011
140 = 10001100
4   = 00000100

Index	Byte	Binary	remaining_bytes Before	Action	remaining_bytes After
0	235	11101011	0	Start of 3-byte character	2
1	140	10001100	2	Valid continuation byte	1
2	4	00000100	1	Invalid continuation byte	-

The last byte does not begin with 10, so the encoding is invalid.

Result:

false

Complexity Analysis

Measure	Complexity	Explanation
Time	O(n)	Each byte is processed exactly once
Space	O(1)	Only a few integer variables are used

The algorithm performs a constant amount of work per byte. No auxiliary data structures proportional to the input size are created, so the space usage remains constant.

Test Cases

sol = Solution()

assert sol.validUtf8([197, 130, 1]) == True   # valid 2-byte character followed by 1-byte
assert sol.validUtf8([235, 140, 4]) == False  # invalid continuation byte

assert sol.validUtf8([0]) == True             # single valid ASCII byte
assert sol.validUtf8([255]) == False          # invalid leading byte

assert sol.validUtf8([240, 162, 138, 147]) == True
# valid 4-byte character

assert sol.validUtf8([240, 162, 138]) == False
# incomplete 4-byte character

assert sol.validUtf8([145]) == False
# continuation byte without leading byte

assert sol.validUtf8([226, 130, 172]) == True
# valid 3-byte UTF-8 character

assert sol.validUtf8([226, 130, 1]) == False
# invalid final continuation byte

assert sol.validUtf8([65, 66, 67]) == True
# multiple ASCII characters

assert sol.validUtf8([248, 130, 130, 130]) == False
# invalid 5-byte prefix

assert sol.validUtf8([230, 136, 145]) == True
# another valid 3-byte sequence

assert sol.validUtf8([194, 155, 194, 155]) == True
# multiple valid 2-byte sequences

Test	Why
`[197,130,1]`	Standard valid mixed encoding
`[235,140,4]`	Invalid continuation byte
`[0]`	Smallest valid ASCII byte
`[255]`	Invalid leading pattern
`[240,162,138,147]`	Valid 4-byte character
`[240,162,138]`	Incomplete multi-byte sequence
`[145]`	Standalone continuation byte
`[226,130,172]`	Valid 3-byte character
`[226,130,1]`	Incorrect continuation termination
`[65,66,67]`	Multiple ASCII bytes
`[248,130,130,130]`	Invalid UTF-8 prefix length
`[230,136,145]`	Additional valid multi-byte sequence
`[194,155,194,155]`	Multiple consecutive UTF-8 characters

Edge Cases

Continuation Byte Appearing First

A byte beginning with 10 cannot start a new UTF-8 character. This is a common bug source because continuation bytes are only legal after a valid leading byte.

For example:

[145]

Binary:

10010001

The implementation correctly rejects this because when remaining_bytes == 0, the byte does not match any valid leading-byte pattern.

Incomplete Multi-Byte Character

A multi-byte character may begin correctly but end before all continuation bytes appear.

For example:

[240, 162, 138]

The first byte indicates a 4-byte character, meaning three continuation bytes are required. Only two continuation bytes follow.

The implementation handles this by checking whether remaining_bytes == 0 after processing the entire array.

Invalid Continuation Byte

A continuation byte must always start with 10.

For example:

[226, 130, 1]

The last byte begins with 00 instead of 10.

This edge case is important because a naive implementation might only count bytes without validating the continuation-byte prefix. The algorithm explicitly checks (byte >> 6) == 0b10, ensuring correctness.