Delete Duplicate Emails

easy hash map database deduplication

Problem

Write a SQL statement to delete all duplicate emails from the Person table, keeping only the row with the smallest id for each duplicate email.

SQL: DELETE p1 FROM Person p1, Person p2 WHERE p1.email = p2.email AND p1.id > p2.id;

InputPerson = [(1,a@b),(2,c@d),(3,a@b)]

Output[(1,a@b),(2,c@d)]

Person (id|email; …)

DELETE p1 FROM Person p1, Person p2
WHERE p1.email = p2.email AND p1.id > p2.id;

Explanation

For every duplicated email we must keep just one row — the one with the smallest id — and delete the others. The trick is a self-join: pair the table with itself and delete a row whenever a "better" copy exists.

We alias the same table twice as p1 and p2. The condition p1.email = p2.email lines up rows that share an email, and p1.id > p2.id means p1 has a larger id than some other row with the same email.

Any row that has such a partner is not the minimum, so DELETE p1 removes it. The row holding the smallest id for each email never has a partner with a smaller id, so it survives.

Example: Person = [(1,a@b),(2,c@d),(3,a@b)]. Row 3 shares a@b with row 1 and has the bigger id, so it gets deleted. Rows 1 and 2 stay, giving [(1,a@b),(2,c@d)].

Conceptually the database groups matching emails (often with a hash on email), so each row is checked against its same-email peers efficiently.

Time: O(n²) (self-join; O(n log n) with an index on email) Space: O(n)