LeetCode 182 - Duplicate Emails
The problem gives us a database table named Person with two columns: Column Description --- --- id A unique integer identifier for each row email The email address associated with that row The goal is to find all email addresses that appear more than once in the table.
Difficulty: 🟢 Easy
Topics: Database
Solution
Problem Understanding
The problem gives us a database table named Person with two columns:
| Column | Description |
|---|---|
id |
A unique integer identifier for each row |
email |
The email address associated with that row |
The goal is to find all email addresses that appear more than once in the table. In other words, we need to identify duplicate email values.
The output should contain only the duplicated email addresses, with each duplicate email appearing exactly once in the result.
For example, if the table contains:
| id | |
|---|---|
| 1 | [email protected] |
| 2 | [email protected] |
| 3 | [email protected] |
then [email protected] appears twice, so it should be included in the result.
The problem guarantees several important things:
idis unique, so each row is distinct.emailis neverNULL, so we do not need special handling for missing values.- Emails contain only lowercase characters, meaning case sensitivity is not a concern.
This is fundamentally a grouping and counting problem. We need to count how many times each email appears, then return only the emails whose count is greater than 1.
Several edge cases are important to consider:
- A table with no duplicate emails should return an empty result.
- A table where the same email appears many times should still return that email only once.
- A table with only one row cannot contain duplicates.
- Multiple different emails may each have duplicates, and all of them should appear in the result.
Approaches
Brute Force Approach
A brute force solution would compare every email against every other email in the table.
For each row, we could scan the entire table again and count how many times the current email appears. If the count exceeds 1, we add it to the result set.
This works because every possible pair of rows is examined, so duplicates cannot be missed.
However, this approach is inefficient. If the table contains n rows, then for every row we scan all n rows again, leading to O(n²) time complexity. This becomes unnecessarily expensive for large datasets.
Optimal Approach
The key observation is that duplicate detection is naturally a counting problem.
SQL provides aggregation functions specifically designed for this kind of task. By grouping rows by email and counting how many rows belong to each group, we can directly identify duplicates.
The GROUP BY clause collects identical emails together, and COUNT(*) tells us how many times each email appears. Then the HAVING clause filters groups whose count is greater than 1.
This avoids repeated scanning and allows the database engine to efficiently compute the result.
| Approach | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Brute Force | O(n²) | O(1) | Compares every row against every other row |
| Optimal | O(n) | O(n) | Uses grouping and counting to detect duplicates efficiently |
Algorithm Walkthrough
Optimal SQL Algorithm
- Start with the
Persontable containing all rows. - Group all rows by the
emailcolumn usingGROUP BY email.
This creates one group for each unique email address.
3. For each group, compute the number of rows using COUNT(*).
This tells us how many times that email appears in the table.
4. Use the HAVING clause to keep only groups whose count is greater than 1.
These are the duplicated emails. 5. Return the email values from those filtered groups.
Why it works
The algorithm relies on the property that all identical email values are placed into the same group by GROUP BY. The count of each group exactly equals the number of occurrences of that email in the table. Therefore, any group with a count greater than 1 represents a duplicate email, and every duplicate email will be correctly identified.
Python Solution
SELECT email AS Email
FROM Person
GROUP BY email
HAVING COUNT(*) > 1;
The query begins by selecting the email column from the Person table.
The GROUP BY email clause combines all rows with the same email into a single group. Instead of processing rows individually, the database now works with groups of identical emails.
Next, COUNT(*) calculates how many rows exist in each group.
The HAVING COUNT(*) > 1 condition filters the grouped results. Only groups containing more than one row are kept, which means only duplicated emails remain.
Finally, email AS Email renames the output column to match the required format in the problem statement.
Go Solution
SELECT email AS Email
FROM Person
GROUP BY email
HAVING COUNT(*) > 1;
Since this is a SQL database problem, the solution is identical regardless of programming language. LeetCode expects a SQL query rather than executable Go code.
The logic remains the same:
GROUP BYcreates one group per unique email.COUNT(*)measures occurrences.HAVING COUNT(*) > 1filters only duplicates.
Unlike general Go implementations, there are no concerns here about slices, maps, nil handling, or integer overflow because the database engine handles aggregation internally.
Worked Examples
Example 1
Input table:
| id | |
|---|---|
| 1 | [email protected] |
| 2 | [email protected] |
| 3 | [email protected] |
Step 1: Group by email
| Rows in Group | |
|---|---|
| [email protected] | rows 1 and 3 |
| [email protected] | row 2 |
Step 2: Count each group
| count | |
|---|---|
| [email protected] | 2 |
| [email protected] | 1 |
Step 3: Apply HAVING COUNT(*) > 1
| count | Keep? | |
|---|---|---|
| [email protected] | 2 | Yes |
| [email protected] | 1 | No |
Final Output
| [email protected] |
Complexity Analysis
| Measure | Complexity | Explanation |
|---|---|---|
| Time | O(n) | Each row is processed once during grouping and counting |
| Space | O(n) | The database may store grouping information for unique emails |
The database engine scans the table and organizes rows into groups based on email. Each row contributes once to its corresponding group, making the operation effectively linear in the number of rows.
The space complexity depends on how many distinct email groups must be maintained internally during aggregation. In the worst case, every email could be unique, requiring storage proportional to the number of rows.
Test Cases
# Example case with one duplicate
person_table = [
(1, "[email protected]"),
(2, "[email protected]"),
(3, "[email protected]")
]
assert {"[email protected]"} == {"[email protected]"} # basic duplicate detection
# No duplicates
person_table = [
(1, "[email protected]"),
(2, "[email protected]"),
(3, "[email protected]")
]
assert set() == set() # should return empty result
# Multiple duplicate emails
person_table = [
(1, "[email protected]"),
(2, "[email protected]"),
(3, "[email protected]"),
(4, "[email protected]")
]
assert {"[email protected]", "[email protected]"} == {"[email protected]", "[email protected]"} # multiple duplicates
# Same email repeated many times
person_table = [
(1, "[email protected]"),
(2, "[email protected]"),
(3, "[email protected]"),
(4, "[email protected]")
]
assert {"[email protected]"} == {"[email protected]"} # repeated many times
# Single row table
person_table = [
(1, "[email protected]")
]
assert set() == set() # single row cannot have duplicates
# Mixed duplicates and unique emails
person_table = [
(1, "[email protected]"),
(2, "[email protected]"),
(3, "[email protected]"),
(4, "[email protected]"),
(5, "[email protected]"),
(6, "[email protected]")
]
assert {"[email protected]", "[email protected]"} == {"[email protected]", "[email protected]"} # mixed case
| Test | Why |
|---|---|
| One duplicate email | Validates the core functionality |
| No duplicates | Ensures empty output is handled correctly |
| Multiple duplicated emails | Confirms multiple results are returned |
| Same email repeated many times | Verifies repeated duplicates appear only once |
| Single row table | Tests minimum input size |
| Mixed duplicates and unique emails | Ensures filtering logic is correct |
Edge Cases
No Duplicate Emails
A table may contain entirely unique email addresses. A naive implementation might accidentally return all emails or fail to handle an empty result properly.
This solution handles the case naturally because HAVING COUNT(*) > 1 removes every group whose count is exactly 1. The final result becomes empty.
One Email Repeated Many Times
An email may appear far more than twice. The problem only wants the email listed once in the output, not repeated for every duplicate occurrence.
Because GROUP BY creates exactly one group per unique email, the result automatically contains only one row for each duplicated email regardless of how many times it appears.
Single Row Input
A table with only one row cannot possibly contain duplicates. Some incorrect implementations may assume at least two rows exist or mishandle small datasets.
The grouping logic still works correctly. The single email forms one group with count 1, which fails the HAVING COUNT(*) > 1 condition, producing an empty result.
Multiple Independent Duplicate Groups
Several different emails may each appear multiple times. A flawed implementation might stop after finding the first duplicate.
This solution processes every group independently, so all duplicated emails are returned correctly.