LeetCode 196 - Delete Duplicate Emails

Difficulty: 🟢 Easy
Topics: Database

Solution

Problem Understanding

This problem provides a database table named Person with two columns:

id      int
email   varchar

The id column is unique because it is the primary key. The email column may contain duplicate values, meaning multiple rows can share the same email address.

The task is to delete duplicate email rows while keeping exactly one row for each unique email address. The row that remains must be the one with the smallest id.

In other words, for every group of rows that share the same email:

Keep the row whose id is the minimum
Delete all other rows in that group

The final ordering of the rows does not matter.

The important detail is that the problem specifically asks for a DELETE statement, not a SELECT query. This means the table itself must be modified.

For the example:

+----+------------------+
| id | email            |
+----+------------------+
| 1  | [email protected] |
| 2  | [email protected]  |
| 3  | [email protected] |
+----+------------------+

The email [email protected] appears twice. Since row 1 has the smaller id, row 3 must be deleted.

The resulting table becomes:

+----+------------------+
| id | email            |
+----+------------------+
| 1  | [email protected] |
| 2  | [email protected]  |
+----+------------------+

The constraints are relatively small because this is an SQL database problem, but correctness is more important than raw performance. Still, an inefficient solution could perform unnecessary repeated scans of the table.

Several edge cases are important:

A table with no duplicate emails should remain unchanged.
A table where all emails are identical should keep only the row with the smallest id.
Duplicate emails may appear many times, not just twice.
The rows are not guaranteed to be sorted by id.

A naive implementation might accidentally delete the smallest id instead of preserving it, so careful comparison logic is required.

Approaches

Brute Force Approach

A brute force solution would compare every row against every other row.

For each row:

Scan the entire table again.
Find another row with the same email and a smaller id.
If such a row exists, delete the current row.

This approach works because every duplicate row eventually discovers that another row with the same email and a smaller id already exists.

However, this method repeatedly scans the table for every row, leading to quadratic time complexity. On large datasets, this becomes inefficient because the same comparisons are performed many times.

Optimal Approach

The key observation is that for each email, only the smallest id should survive.

Instead of repeatedly comparing rows pairwise, we can first identify the minimum id for every email group. Any row whose id is not that minimum should be deleted.

In SQL, this is commonly implemented using a self join:

Join the table with itself on matching emails
Identify rows where one row has a larger id
Delete the larger id

This works because if two rows share the same email, the row with the larger id is always the duplicate that should be removed.

Another valid approach uses a subquery with MIN(id), but the self join solution is concise and widely used.

Approach	Time Complexity	Space Complexity	Notes
Brute Force	O(n²)	O(1)	Compares every row against every other row
Optimal	O(n log n) typical database execution	O(1) extra query space	Uses SQL join logic to remove larger duplicate ids

Algorithm Walkthrough

Optimal Self Join Deletion Algorithm

Start by treating the Person table as two logical copies of itself, often called p1 and p2.
Compare rows between these two copies. We are interested only in rows where:

The email addresses are the same
The ids are different

If two rows share the same email, determine which one has the larger id.
The row with the larger id is considered the duplicate because the problem requires keeping the smallest id.
Delete every row that has a matching row with:

The same email
A smaller id

After all such rows are removed, exactly one row per email remains, specifically the one with the minimum id.

This approach is efficient because the database engine can optimize joins internally, avoiding unnecessary repeated scans.

Python Solution

# Write your MySQL query statement below

DELETE p1
FROM Person p1, Person p2
WHERE p1.email = p2.email
AND p1.id > p2.id;

The solution uses a self join between two references to the same table.

The aliases p1 and p2 represent two logical views of the Person table. The condition:

p1.email = p2.email

finds rows that share the same email.

The condition:

p1.id > p2.id

ensures that only the row with the larger id is deleted.

Suppose two rows exist:

(1, [email protected])
(3, [email protected])

The comparison identifies that 3 > 1, so row 3 is deleted.

The smallest id never satisfies the condition p1.id > p2.id, so it is preserved automatically.

Go Solution

// Write your MySQL query statement below

DELETE p1
FROM Person p1, Person p2
WHERE p1.email = p2.email
AND p1.id > p2.id;

Because this is a database problem, the Go solution is identical to the Python solution. LeetCode database problems do not require language-specific implementations. Instead, the platform expects a valid SQL query.

There are no Go-specific concerns such as slices, pointers, or integer overflow because the execution occurs entirely within the SQL database engine.

Worked Examples

Example 1

Initial table:

id	email
1	[email protected]
2	[email protected]
3	[email protected]

The query compares rows pairwise.

Step 1: Compare Row 1 and Row 3

p1.id	p1.email	p2.id	p2.email	Same Email	p1.id > p2.id
3	[email protected]	1	[email protected]	Yes	Yes

Since both conditions are true, row 3 is deleted.

Step 2: Compare Remaining Rows

p1.id	p1.email	p2.id	p2.email	Same Email	p1.id > p2.id
1	[email protected]	2	[email protected]	No	No

No deletion occurs because the emails differ.

Final table:

id	email
1	[email protected]
2	[email protected]

Complexity Analysis

Measure	Complexity	Explanation
Time	O(n²) worst case	Self joins may compare many row pairs
Space	O(1)	No extra data structures are created

Although the logical comparison resembles quadratic behavior, modern database engines optimize joins heavily using indexes and query planners. In practice, performance is usually much better than a literal nested loop implementation.

The solution uses constant extra space because it modifies rows directly without storing additional collections.

Edge Cases

No Duplicate Emails

If every email in the table is unique, the join condition:

p1.email = p2.email

never finds matching rows with different ids.

As a result, no rows are deleted and the table remains unchanged.

Multiple Duplicates of the Same Email

Consider:

(1, [email protected])
(2, [email protected])
(3, [email protected])
(4, [email protected])

Every row except 1 finds another row with the same email and a smaller id.

Rows 2, 3, and 4 are deleted, leaving only row 1.

This confirms that the algorithm handles more than two duplicates correctly.

Rows Not Sorted by ID

The table may appear as:

(5, [email protected])
(1, [email protected])
(3, [email protected])

The algorithm does not depend on row ordering.

The comparison logic directly checks numeric id values, ensuring that the smallest id survives regardless of physical table order.

Single Row Table

If the table contains only one row, there cannot be duplicates.

The self join conditions never produce a deletable match, so the row remains intact.