UTF-8 Validation

medium array bit manipulation

Problem

Given an integer array data where each element represents one byte (only the lowest 8 bits matter), return true if it is a valid UTF-8 encoding. UTF-8 rules: a leading byte is 0xxxxxxx (1-byte), 110xxxxx (2-byte), 1110xxxx (3-byte), or 11110xxx (4-byte); every continuation byte must be 10xxxxxx.

Inputdata = [197, 130, 1]

Outputtrue

197 = 11000101 (start of 2-byte), 130 = 10000010 (continuation), 1 = 00000001 (1-byte).

data (comma-separated bytes)

def valid_utf8(data):
    cont = 0
    for b in data:
        if cont == 0:
            if   (b >> 7) == 0:    cont = 0
            elif (b >> 5) == 0b110:  cont = 1
            elif (b >> 4) == 0b1110: cont = 2
            elif (b >> 3) == 0b11110:cont = 3
            else: return False
        else:
            if (b >> 6) != 0b10: return False
            cont -= 1
    return cont == 0

function validUtf8(data) {
  let cont = 0;
  for (const b of data) {
    if (cont === 0) {
      if      ((b >> 7) === 0)      cont = 0;
      else if ((b >> 5) === 0b110)   cont = 1;
      else if ((b >> 4) === 0b1110)  cont = 2;
      else if ((b >> 3) === 0b11110) cont = 3;
      else return false;
    } else {
      if ((b >> 6) !== 0b10) return false;
      cont--;
    }
  }
  return cont === 0;
}

class Solution {
    public boolean validUtf8(int[] data) {
        int cont = 0;
        for (int b : data) {
            if (cont == 0) {
                if      ((b >> 7) == 0)       cont = 0;
                else if ((b >> 5) == 0b110)   cont = 1;
                else if ((b >> 4) == 0b1110)  cont = 2;
                else if ((b >> 3) == 0b11110) cont = 3;
                else return false;
            } else {
                if ((b >> 6) != 0b10) return false;
                cont--;
            }
        }
        return cont == 0;
    }
}

bool validUtf8(vector<int>& data) {
    int cont = 0;
    for (int b : data) {
        if (cont == 0) {
            if      ((b >> 7) == 0)       cont = 0;
            else if ((b >> 5) == 0b110)   cont = 1;
            else if ((b >> 4) == 0b1110)  cont = 2;
            else if ((b >> 3) == 0b11110) cont = 3;
            else return false;
        } else {
            if ((b >> 6) != 0b10) return false;
            cont--;
        }
    }
    return cont == 0;
}

Explanation

UTF-8 encodes a character in 1 to 4 bytes, and the rules are all about the top bits of each byte. The clever part of this solution is that you never need to decode anything — you just check the high bits and keep a tiny counter of how many bytes you still owe.

The counter is cont, the number of continuation bytes still expected. When cont == 0 we are looking at a leading byte, so we peek at its top bits: 0xxxxxxx means a 1-byte character (cont stays 0), 110… means 2 bytes so we owe 1 more, 1110… owes 2, and 11110… owes 3. Anything else is illegal.

When cont > 0 we must be on a continuation byte, which has to start with 10. The check (b >> 6) != 0b10 rejects any byte that doesn't, and otherwise we do cont -= 1 to cross one expected byte off the list.

At the very end we return cont == 0. If the counter isn't zero, a multi-byte character was promised but never finished, so the input is invalid.

Example: [197, 130, 1]. 197 = 11000101 starts with 110, so cont = 1. 130 = 10000010 starts with 10, a valid continuation, so cont = 0. 1 = 00000001 starts with 0, a clean 1-byte character. We finish with cont == 0, so the answer is true.

Time: O(n) Space: O(1)