Short Encoding of Words

medium array hash set string trie

Problem

An encoding of a list of words is a string s plus an array of indices so each word is the substring of s up to the next '#'. Find the minimum length of such an encoding string.

Inputwords = ["time", "me", "bell"]

Output10

"time#bell#" works: "me" is a suffix of "time", so it's free. Length = 4 + 1 + 4 + 1 = 10.

words (comma-separated)

def minimum_length_encoding(words):
    s = set(words)
    for w in words:
        for k in range(1, len(w)):
            s.discard(w[k:])
    return sum(len(w) + 1 for w in s)

function minimumLengthEncoding(words) {
  const s = new Set(words);
  for (const w of words) {
    for (let k = 1; k < w.length; k++) s.delete(w.slice(k));
  }
  let total = 0;
  for (const w of s) total += w.length + 1;
  return total;
}

class Solution {
    public int minimumLengthEncoding(String[] words) {
        Set<String> s = new HashSet<>(Arrays.asList(words));
        for (String w : words) {
            for (int k = 1; k < w.length(); k++) s.remove(w.substring(k));
        }
        int total = 0;
        for (String w : s) total += w.length() + 1;
        return total;
    }
}

class Solution {
public:
    int minimumLengthEncoding(vector<string>& words) {
        unordered_set<string> s(words.begin(), words.end());
        for (auto& w : words) {
            for (int k = 1; k < (int)w.size(); k++) s.erase(w.substr(k));
        }
        int total = 0;
        for (auto& w : s) total += w.size() + 1;
        return total;
    }
};

Explanation

The key insight is that a shorter word is free if it is a suffix (the ending) of a longer word — both share the same spot in the encoded string. So our job is simply to throw away every word that is a suffix of another and add up what is left.

We start by putting all words in a set. Then for each word w, we look at every one of its suffixes w[1:], w[2:], … and discard them from the set. If a suffix happens to be one of our words, it disappears.

Whatever survives in the set are the root words that actually need to be written out. Each costs its own length plus one for the '#' separator, so the answer is sum(len(w) + 1).

Example: ["time", "me", "bell"]. The suffixes of time include me, so me is discarded. Left with time and bell: (4+1) + (4+1) = 10.

This works because the #-separated encoding lets any word reuse the tail of a longer one, and a set makes the "is this a word?" check instant.

Time: O(Σ|wᵢ|²) Space: O(Σ|wᵢ|)