Delete Duplicate Emails
Problem
Write a SQL statement to delete all duplicate emails from the Person table, keeping only the row with the smallest id for each duplicate email.
SQL: DELETE p1 FROM Person p1, Person p2 WHERE p1.email = p2.email AND p1.id > p2.id;
Person = [(1,a@b),(2,c@d),(3,a@b)][(1,a@b),(2,c@d)]DELETE p1 FROM Person p1, Person p2
WHERE p1.email = p2.email AND p1.id > p2.id;
Explanation
For every duplicated email we must keep just one row — the one with the smallest id — and delete the others. The trick is a self-join: pair the table with itself and delete a row whenever a "better" copy exists.
We alias the same table twice as p1 and p2. The condition p1.email = p2.email lines up rows that share an email, and p1.id > p2.id means p1 has a larger id than some other row with the same email.
Any row that has such a partner is not the minimum, so DELETE p1 removes it. The row holding the smallest id for each email never has a partner with a smaller id, so it survives.
Example: Person = [(1,a@b),(2,c@d),(3,a@b)]. Row 3 shares a@b with row 1 and has the bigger id, so it gets deleted. Rows 1 and 2 stay, giving [(1,a@b),(2,c@d)].
Conceptually the database groups matching emails (often with a hash on email), so each row is checked against its same-email peers efficiently.