Repeated DNA Sequences
Problem
Given a DNA string s (over {A, C, G, T}), return every 10-letter substring that appears more than once. Order doesn't matter; each repeated substring appears once in the result.
s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTT"["AAAAACCCCC", "CCCCCAAAAA"]def find_repeated_dna_sequences(s):
seen, out = set(), set()
for i in range(len(s) - 9):
sub = s[i:i + 10]
if sub in seen:
out.add(sub)
else:
seen.add(sub)
return list(out)
function findRepeatedDnaSequences(s) {
const seen = new Set(), out = new Set();
for (let i = 0; i + 10 <= s.length; i++) {
const sub = s.slice(i, i + 10);
if (seen.has(sub)) out.add(sub);
else seen.add(sub);
}
return [...out];
}
class Solution {
public List<String> findRepeatedDnaSequences(String s) {
Set<String> seen = new HashSet<>(), out = new HashSet<>();
for (int i = 0; i + 10 <= s.length(); i++) {
String sub = s.substring(i, i + 10);
if (!seen.add(sub)) out.add(sub);
}
return new ArrayList<>(out);
}
}
vector<string> findRepeatedDnaSequences(string s) {
unordered_set<string> seen, out;
for (int i = 0; i + 10 <= (int) s.size(); i++) {
string sub = s.substr(i, 10);
if (!seen.insert(sub).second) out.insert(sub);
}
return { out.begin(), out.end() };
}
Explanation
We need every 10-letter chunk that shows up more than once. The simple, fast idea is to slide a 10-character window across the string and remember each chunk we have already seen in a hash set.
We keep two sets: seen for every substring encountered so far, and out for the ones we have confirmed as repeats. As the window moves one step at a time, we take the current chunk sub.
If sub is already in seen, it is a duplicate, so we add it to out. Otherwise we record it in seen. Using a set for out means even a chunk appearing three or more times is reported only once.
Example: in "AAAAACCCCCAAAAACCCCCCAAAAAGGGTT", the chunk "AAAAACCCCC" appears earlier and then again later, so the second time it lands in out. The same happens for "CCCCCAAAAA", giving the final answer.
Because set lookups are effectively instant, we only pass through the string once, which keeps it efficient.