LeetCode 393 - UTF-8 Validation
The problem asks us to determine whether a sequence of integers represents a valid UTF-8 encoded byte stream. Each integer in the input array represents one byte, meaning only its lowest 8 bits matter.
Difficulty: 🟡 Medium
Topics: Array, Bit Manipulation
Solution
Problem Understanding
The problem asks us to determine whether a sequence of integers represents a valid UTF-8 encoded byte stream.
Each integer in the input array represents one byte, meaning only its lowest 8 bits matter. UTF-8 is a variable-length encoding system where a character may occupy between 1 and 4 bytes. The structure of the leading bits determines how many bytes belong to a character.
A valid UTF-8 encoding follows these rules:
- A 1-byte character starts with
0 - A 2-byte character starts with
110 - A 3-byte character starts with
1110 - A 4-byte character starts with
11110 - Every continuation byte must begin with
10
The task is to scan the array and verify that every byte sequence matches one of these legal UTF-8 patterns.
For example:
11000101 10000010is valid because the first byte indicates a 2-byte character and the second byte correctly starts with1011101011 10001100 00000100is invalid because the third byte should also begin with10, but it begins with00
The constraints tell us that the input length can be up to 2 * 10^4, which is large enough that we should aim for a linear-time solution. Fortunately, UTF-8 validation only requires examining bit patterns locally, so we can process the data in a single pass.
Several edge cases are important:
- A multi-byte character may end prematurely before enough continuation bytes appear
- A continuation byte may appear without a valid leading byte
- A leading byte may indicate an invalid length, such as five leading
1bits - A continuation byte may not begin with
10 - Single-byte characters and multi-byte characters may appear mixed together
The problem guarantees that every integer is already within the valid byte range 0 to 255, so we do not need to worry about masking larger integers.
Approaches
Brute Force Approach
A brute-force strategy would explicitly convert every integer into an 8-character binary string and then validate the UTF-8 structure using string operations.
For each byte:
- Convert it into binary form
- Count the leading
1s - Determine whether it starts a valid UTF-8 character
- Verify the required continuation bytes
This works because UTF-8 validity depends entirely on bit prefixes. By checking those prefixes directly as strings, we can determine correctness.
However, this approach is inefficient because:
- Converting integers to binary strings repeatedly creates extra overhead
- String slicing and comparisons are slower than direct bit operations
- Additional memory is allocated for every conversion
Although the asymptotic complexity is still linear, the implementation is less elegant and significantly less efficient than using bitwise operations directly.
Optimal Approach
The key insight is that UTF-8 validation only depends on the highest few bits of each byte. Bit manipulation allows us to inspect those bits directly without converting anything to strings.
We process the array from left to right while tracking how many continuation bytes are still expected.
When we encounter a new leading byte:
- We determine whether it represents a 1-byte, 2-byte, 3-byte, or 4-byte character
- We record how many continuation bytes must follow
When processing continuation bytes:
- We verify that the byte starts with
10
This works efficiently because every byte is processed exactly once and each validation step uses constant-time bit operations.
| Approach | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Brute Force | O(n) | O(n) | Uses binary string conversion and string processing |
| Optimal | O(n) | O(1) | Uses direct bit manipulation in a single pass |
Algorithm Walkthrough
- Initialize a variable called
remaining_bytesto0.
This variable tracks how many continuation bytes are still required for the current UTF-8 character. 2. Iterate through each byte in the input array.
We process the bytes sequentially because UTF-8 characters are encoded in order.
3. If remaining_bytes == 0, the current byte must begin a new character.
We inspect the leading bits to determine the character length. 4. Check the byte pattern using bit masks.
The patterns are:
0xxxxxxx→ single-byte character110xxxxx→ expect 1 continuation byte1110xxxx→ expect 2 continuation bytes11110xxx→ expect 3 continuation bytes
Any other leading pattern is invalid.
5. If the byte starts a multi-byte character, set remaining_bytes accordingly.
For example:
110xxxxxmeans one continuation byte is required1110xxxxmeans two continuation bytes are required
- If
remaining_bytes > 0, the current byte must be a continuation byte.
A continuation byte must match:
$10xxxxxx$
We verify this using bit masks.
7. If a continuation byte is valid, decrement remaining_bytes.
This means one expected continuation byte has been successfully consumed.
8. After processing all bytes, ensure remaining_bytes == 0.
If continuation bytes are still expected, the encoding ended prematurely and is invalid.
Why it works
The algorithm maintains a simple invariant:
remaining_bytesalways represents the exact number of continuation bytes still required to complete the current UTF-8 character.
Whenever we encounter a leading byte, we determine the required continuation count. Every continuation byte reduces that count by one. If any byte violates the UTF-8 format rules, we immediately return False.
Because every UTF-8 character must strictly follow these bit-prefix rules, successfully processing the entire array with remaining_bytes == 0 guarantees the encoding is valid.
Python Solution
from typing import List
class Solution:
def validUtf8(self, data: List[int]) -> bool:
remaining_bytes = 0
for byte in data:
if remaining_bytes == 0:
# 1-byte character: 0xxxxxxx
if (byte >> 7) == 0:
continue
# 2-byte character: 110xxxxx
elif (byte >> 5) == 0b110:
remaining_bytes = 1
# 3-byte character: 1110xxxx
elif (byte >> 4) == 0b1110:
remaining_bytes = 2
# 4-byte character: 11110xxx
elif (byte >> 3) == 0b11110:
remaining_bytes = 3
else:
return False
else:
# Continuation byte must start with 10
if (byte >> 6) != 0b10:
return False
remaining_bytes -= 1
return remaining_bytes == 0
The implementation directly mirrors the algorithm described earlier.
The variable remaining_bytes tracks how many continuation bytes are still needed.
When remaining_bytes == 0, the current byte must start a new UTF-8 character. The code uses right shifts to inspect the leading bits:
byte >> 7 == 0checks for a single-byte characterbyte >> 5 == 0b110checks for a 2-byte sequencebyte >> 4 == 0b1110checks for a 3-byte sequencebyte >> 3 == 0b11110checks for a 4-byte sequence
If none of these patterns match, the byte is invalid.
When continuation bytes are expected, the implementation checks:
$(byte >> 6) = 0b10$
This confirms the byte begins with the required 10 prefix.
Finally, the algorithm ensures no unfinished multi-byte sequence remains after processing the entire array.
Go Solution
func validUtf8(data []int) bool {
remainingBytes := 0
for _, b := range data {
if remainingBytes == 0 {
// 1-byte character: 0xxxxxxx
if (b >> 7) == 0 {
continue
}
// 2-byte character: 110xxxxx
if (b >> 5) == 0b110 {
remainingBytes = 1
// 3-byte character: 1110xxxx
} else if (b >> 4) == 0b1110 {
remainingBytes = 2
// 4-byte character: 11110xxx
} else if (b >> 3) == 0b11110 {
remainingBytes = 3
} else {
return false
}
} else {
// Continuation byte must start with 10
if (b >> 6) != 0b10 {
return false
}
remainingBytes--
}
}
return remainingBytes == 0
}
The Go implementation is almost identical to the Python version because the logic is fundamentally simple and language-independent.
One small difference is that Go uses explicit integer types and := variable initialization syntax. Go also uses slices instead of Python lists, but iteration works similarly using range.
No special handling for integer overflow is required because all input values are guaranteed to be between 0 and 255.
Worked Examples
Example 1
Input:
data = [197, 130, 1]
Binary representation:
197 = 11000101
130 = 10000010
1 = 00000001
| Index | Byte | Binary | remaining_bytes Before | Action | remaining_bytes After |
|---|---|---|---|---|---|
| 0 | 197 | 11000101 | 0 | Start of 2-byte character | 1 |
| 1 | 130 | 10000010 | 1 | Valid continuation byte | 0 |
| 2 | 1 | 00000001 | 0 | Valid 1-byte character | 0 |
At the end, remaining_bytes == 0, so the encoding is valid.
Result:
true
Example 2
Input:
data = [235, 140, 4]
Binary representation:
235 = 11101011
140 = 10001100
4 = 00000100
| Index | Byte | Binary | remaining_bytes Before | Action | remaining_bytes After |
|---|---|---|---|---|---|
| 0 | 235 | 11101011 | 0 | Start of 3-byte character | 2 |
| 1 | 140 | 10001100 | 2 | Valid continuation byte | 1 |
| 2 | 4 | 00000100 | 1 | Invalid continuation byte | - |
The last byte does not begin with 10, so the encoding is invalid.
Result:
false
Complexity Analysis
| Measure | Complexity | Explanation |
|---|---|---|
| Time | O(n) | Each byte is processed exactly once |
| Space | O(1) | Only a few integer variables are used |
The algorithm performs a constant amount of work per byte. No auxiliary data structures proportional to the input size are created, so the space usage remains constant.
Test Cases
sol = Solution()
assert sol.validUtf8([197, 130, 1]) == True # valid 2-byte character followed by 1-byte
assert sol.validUtf8([235, 140, 4]) == False # invalid continuation byte
assert sol.validUtf8([0]) == True # single valid ASCII byte
assert sol.validUtf8([255]) == False # invalid leading byte
assert sol.validUtf8([240, 162, 138, 147]) == True
# valid 4-byte character
assert sol.validUtf8([240, 162, 138]) == False
# incomplete 4-byte character
assert sol.validUtf8([145]) == False
# continuation byte without leading byte
assert sol.validUtf8([226, 130, 172]) == True
# valid 3-byte UTF-8 character
assert sol.validUtf8([226, 130, 1]) == False
# invalid final continuation byte
assert sol.validUtf8([65, 66, 67]) == True
# multiple ASCII characters
assert sol.validUtf8([248, 130, 130, 130]) == False
# invalid 5-byte prefix
assert sol.validUtf8([230, 136, 145]) == True
# another valid 3-byte sequence
assert sol.validUtf8([194, 155, 194, 155]) == True
# multiple valid 2-byte sequences
| Test | Why |
|---|---|
[197,130,1] |
Standard valid mixed encoding |
[235,140,4] |
Invalid continuation byte |
[0] |
Smallest valid ASCII byte |
[255] |
Invalid leading pattern |
[240,162,138,147] |
Valid 4-byte character |
[240,162,138] |
Incomplete multi-byte sequence |
[145] |
Standalone continuation byte |
[226,130,172] |
Valid 3-byte character |
[226,130,1] |
Incorrect continuation termination |
[65,66,67] |
Multiple ASCII bytes |
[248,130,130,130] |
Invalid UTF-8 prefix length |
[230,136,145] |
Additional valid multi-byte sequence |
[194,155,194,155] |
Multiple consecutive UTF-8 characters |
Edge Cases
Continuation Byte Appearing First
A byte beginning with 10 cannot start a new UTF-8 character. This is a common bug source because continuation bytes are only legal after a valid leading byte.
For example:
[145]
Binary:
10010001
The implementation correctly rejects this because when remaining_bytes == 0, the byte does not match any valid leading-byte pattern.
Incomplete Multi-Byte Character
A multi-byte character may begin correctly but end before all continuation bytes appear.
For example:
[240, 162, 138]
The first byte indicates a 4-byte character, meaning three continuation bytes are required. Only two continuation bytes follow.
The implementation handles this by checking whether remaining_bytes == 0 after processing the entire array.
Invalid Continuation Byte
A continuation byte must always start with 10.
For example:
[226, 130, 1]
The last byte begins with 00 instead of 10.
This edge case is important because a naive implementation might only count bytes without validating the continuation-byte prefix. The algorithm explicitly checks (byte >> 6) == 0b10, ensuring correctness.