LeetCode 196 - Delete Duplicate Emails
This problem provides a database table named Person with two columns: The id column is unique because it is the primary key. The email column may contain duplicate values, meaning multiple rows can share the same email address.
Difficulty: 🟢 Easy
Topics: Database
Solution
Problem Understanding
This problem provides a database table named Person with two columns:
id int
email varchar
The id column is unique because it is the primary key. The email column may contain duplicate values, meaning multiple rows can share the same email address.
The task is to delete duplicate email rows while keeping exactly one row for each unique email address. The row that remains must be the one with the smallest id.
In other words, for every group of rows that share the same email:
- Keep the row whose
idis the minimum - Delete all other rows in that group
The final ordering of the rows does not matter.
The important detail is that the problem specifically asks for a DELETE statement, not a SELECT query. This means the table itself must be modified.
For the example:
+----+------------------+
| id | email |
+----+------------------+
| 1 | [email protected] |
| 2 | [email protected] |
| 3 | [email protected] |
+----+------------------+
The email [email protected] appears twice. Since row 1 has the smaller id, row 3 must be deleted.
The resulting table becomes:
+----+------------------+
| id | email |
+----+------------------+
| 1 | [email protected] |
| 2 | [email protected] |
+----+------------------+
The constraints are relatively small because this is an SQL database problem, but correctness is more important than raw performance. Still, an inefficient solution could perform unnecessary repeated scans of the table.
Several edge cases are important:
- A table with no duplicate emails should remain unchanged.
- A table where all emails are identical should keep only the row with the smallest
id. - Duplicate emails may appear many times, not just twice.
- The rows are not guaranteed to be sorted by
id.
A naive implementation might accidentally delete the smallest id instead of preserving it, so careful comparison logic is required.
Approaches
Brute Force Approach
A brute force solution would compare every row against every other row.
For each row:
- Scan the entire table again.
- Find another row with the same email and a smaller
id. - If such a row exists, delete the current row.
This approach works because every duplicate row eventually discovers that another row with the same email and a smaller id already exists.
However, this method repeatedly scans the table for every row, leading to quadratic time complexity. On large datasets, this becomes inefficient because the same comparisons are performed many times.
Optimal Approach
The key observation is that for each email, only the smallest id should survive.
Instead of repeatedly comparing rows pairwise, we can first identify the minimum id for every email group. Any row whose id is not that minimum should be deleted.
In SQL, this is commonly implemented using a self join:
- Join the table with itself on matching emails
- Identify rows where one row has a larger
id - Delete the larger
id
This works because if two rows share the same email, the row with the larger id is always the duplicate that should be removed.
Another valid approach uses a subquery with MIN(id), but the self join solution is concise and widely used.
| Approach | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Brute Force | O(n²) | O(1) | Compares every row against every other row |
| Optimal | O(n log n) typical database execution | O(1) extra query space | Uses SQL join logic to remove larger duplicate ids |
Algorithm Walkthrough
Optimal Self Join Deletion Algorithm
- Start by treating the
Persontable as two logical copies of itself, often calledp1andp2. - Compare rows between these two copies. We are interested only in rows where:
- The email addresses are the same
- The ids are different
- If two rows share the same email, determine which one has the larger
id. - The row with the larger
idis considered the duplicate because the problem requires keeping the smallestid. - Delete every row that has a matching row with:
- The same email
- A smaller
id
- After all such rows are removed, exactly one row per email remains, specifically the one with the minimum
id.
This approach is efficient because the database engine can optimize joins internally, avoiding unnecessary repeated scans.
Python Solution
# Write your MySQL query statement below
DELETE p1
FROM Person p1, Person p2
WHERE p1.email = p2.email
AND p1.id > p2.id;
The solution uses a self join between two references to the same table.
The aliases p1 and p2 represent two logical views of the Person table. The condition:
p1.email = p2.email
finds rows that share the same email.
The condition:
p1.id > p2.id
ensures that only the row with the larger id is deleted.
Suppose two rows exist:
(1, [email protected])
(3, [email protected])
The comparison identifies that 3 > 1, so row 3 is deleted.
The smallest id never satisfies the condition p1.id > p2.id, so it is preserved automatically.
Go Solution
// Write your MySQL query statement below
DELETE p1
FROM Person p1, Person p2
WHERE p1.email = p2.email
AND p1.id > p2.id;
Because this is a database problem, the Go solution is identical to the Python solution. LeetCode database problems do not require language-specific implementations. Instead, the platform expects a valid SQL query.
There are no Go-specific concerns such as slices, pointers, or integer overflow because the execution occurs entirely within the SQL database engine.
Worked Examples
Example 1
Initial table:
| id | |
|---|---|
| 1 | [email protected] |
| 2 | [email protected] |
| 3 | [email protected] |
The query compares rows pairwise.
Step 1: Compare Row 1 and Row 3
| p1.id | p1.email | p2.id | p2.email | Same Email | p1.id > p2.id |
|---|---|---|---|---|---|
| 3 | [email protected] | 1 | [email protected] | Yes | Yes |
Since both conditions are true, row 3 is deleted.
Step 2: Compare Remaining Rows
| p1.id | p1.email | p2.id | p2.email | Same Email | p1.id > p2.id |
|---|---|---|---|---|---|
| 1 | [email protected] | 2 | [email protected] | No | No |
No deletion occurs because the emails differ.
Final table:
| id | |
|---|---|
| 1 | [email protected] |
| 2 | [email protected] |
Complexity Analysis
| Measure | Complexity | Explanation |
|---|---|---|
| Time | O(n²) worst case | Self joins may compare many row pairs |
| Space | O(1) | No extra data structures are created |
Although the logical comparison resembles quadratic behavior, modern database engines optimize joins heavily using indexes and query planners. In practice, performance is usually much better than a literal nested loop implementation.
The solution uses constant extra space because it modifies rows directly without storing additional collections.
Edge Cases
No Duplicate Emails
If every email in the table is unique, the join condition:
p1.email = p2.email
never finds matching rows with different ids.
As a result, no rows are deleted and the table remains unchanged.
Multiple Duplicates of the Same Email
Consider:
(1, [email protected])
(2, [email protected])
(3, [email protected])
(4, [email protected])
Every row except 1 finds another row with the same email and a smaller id.
Rows 2, 3, and 4 are deleted, leaving only row 1.
This confirms that the algorithm handles more than two duplicates correctly.
Rows Not Sorted by ID
The table may appear as:
(5, [email protected])
(1, [email protected])
(3, [email protected])
The algorithm does not depend on row ordering.
The comparison logic directly checks numeric id values, ensuring that the smallest id survives regardless of physical table order.
Single Row Table
If the table contains only one row, there cannot be duplicates.
The self join conditions never produce a deletable match, so the row remains intact.