UTF-8 Validation
Problem
Given an integer array data where each element represents one byte (only the lowest 8 bits matter), return true if it is a valid UTF-8 encoding. UTF-8 rules: a leading byte is 0xxxxxxx (1-byte), 110xxxxx (2-byte), 1110xxxx (3-byte), or 11110xxx (4-byte); every continuation byte must be 10xxxxxx.
data = [197, 130, 1]truedef valid_utf8(data):
cont = 0
for b in data:
if cont == 0:
if (b >> 7) == 0: cont = 0
elif (b >> 5) == 0b110: cont = 1
elif (b >> 4) == 0b1110: cont = 2
elif (b >> 3) == 0b11110:cont = 3
else: return False
else:
if (b >> 6) != 0b10: return False
cont -= 1
return cont == 0
function validUtf8(data) {
let cont = 0;
for (const b of data) {
if (cont === 0) {
if ((b >> 7) === 0) cont = 0;
else if ((b >> 5) === 0b110) cont = 1;
else if ((b >> 4) === 0b1110) cont = 2;
else if ((b >> 3) === 0b11110) cont = 3;
else return false;
} else {
if ((b >> 6) !== 0b10) return false;
cont--;
}
}
return cont === 0;
}
class Solution {
public boolean validUtf8(int[] data) {
int cont = 0;
for (int b : data) {
if (cont == 0) {
if ((b >> 7) == 0) cont = 0;
else if ((b >> 5) == 0b110) cont = 1;
else if ((b >> 4) == 0b1110) cont = 2;
else if ((b >> 3) == 0b11110) cont = 3;
else return false;
} else {
if ((b >> 6) != 0b10) return false;
cont--;
}
}
return cont == 0;
}
}
bool validUtf8(vector<int>& data) {
int cont = 0;
for (int b : data) {
if (cont == 0) {
if ((b >> 7) == 0) cont = 0;
else if ((b >> 5) == 0b110) cont = 1;
else if ((b >> 4) == 0b1110) cont = 2;
else if ((b >> 3) == 0b11110) cont = 3;
else return false;
} else {
if ((b >> 6) != 0b10) return false;
cont--;
}
}
return cont == 0;
}
Explanation
UTF-8 encodes a character in 1 to 4 bytes, and the rules are all about the top bits of each byte. The clever part of this solution is that you never need to decode anything — you just check the high bits and keep a tiny counter of how many bytes you still owe.
The counter is cont, the number of continuation bytes still expected. When cont == 0 we are looking at a leading byte, so we peek at its top bits: 0xxxxxxx means a 1-byte character (cont stays 0), 110… means 2 bytes so we owe 1 more, 1110… owes 2, and 11110… owes 3. Anything else is illegal.
When cont > 0 we must be on a continuation byte, which has to start with 10. The check (b >> 6) != 0b10 rejects any byte that doesn't, and otherwise we do cont -= 1 to cross one expected byte off the list.
At the very end we return cont == 0. If the counter isn't zero, a multi-byte character was promised but never finished, so the input is invalid.
Example: [197, 130, 1]. 197 = 11000101 starts with 110, so cont = 1. 130 = 10000010 starts with 10, a valid continuation, so cont = 0. 1 = 00000001 starts with 0, a clean 1-byte character. We finish with cont == 0, so the answer is true.