LeetCode 2882 - Drop Duplicate Rows
The problem asks us to process a DataFrame representing customers and remove rows that have duplicate email addresses. Specifically, if multiple rows share the same email value, only the first occurrence should be kept, and all subsequent duplicates should be discarded.
Difficulty: 🟢 Easy
Topics: —
Solution
Problem Understanding
The problem asks us to process a DataFrame representing customers and remove rows that have duplicate email addresses. Specifically, if multiple rows share the same email value, only the first occurrence should be kept, and all subsequent duplicates should be discarded.
The input is a structured table with columns customer_id, name, and email. Each row represents a unique customer, but the email column may contain duplicates. The expected output is the same table with duplicate emails removed, preserving the order of the first appearance.
Constraints are fairly straightforward: the DataFrame can have any number of rows, and emails can be duplicated multiple times. Important edge cases include situations where every email is unique, all emails are the same, or the DataFrame is empty. The problem guarantees that email is a valid field for comparison and that we only need to remove duplicates based on this single column.
Approaches
A brute-force approach would iterate over each row and check if its email has been seen before. This can be implemented using nested loops or by maintaining a list of already-seen emails and checking membership on each iteration. While this works, the repeated membership checks in a list have O(n) complexity per check, leading to an overall O(n²) time complexity. This is inefficient for large datasets.
The optimal approach leverages the fact that we only need to track whether an email has been seen. Using a hash set to store encountered emails allows O(1) average lookup and insertion time per row. By iterating through the DataFrame once and adding rows with unseen emails to the result, we achieve linear time complexity and minimal extra space.
| Approach | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Brute Force | O(n²) | O(n) | Iterates over each row and checks all previously seen emails in a list |
| Optimal | O(n) | O(n) | Uses a hash set to track seen emails and preserves first occurrence |
Algorithm Walkthrough
- Initialize an empty hash set called
seen_emailsto keep track of emails that have already been encountered. - Initialize an empty list called
filtered_rowsto store rows that will be part of the output. - Iterate through each row in the DataFrame.
- For each row, check if the
emailis present inseen_emails. - If the email is not in
seen_emails, add it toseen_emailsand append the row tofiltered_rows. - If the email is already in
seen_emails, skip the row to avoid duplicates. - After completing the iteration, return a DataFrame created from
filtered_rows, preserving the original column order.
Why it works: By using a hash set to track seen emails, we ensure that only the first occurrence of each email is included. The order of rows is preserved because we append rows to the result list in the original iteration order, and duplicates are skipped immediately when detected.
Python Solution
import pandas as pd
class Solution:
def dropDuplicateRows(self, customers: pd.DataFrame) -> pd.DataFrame:
seen_emails = set()
filtered_rows = []
for _, row in customers.iterrows():
email = row['email']
if email not in seen_emails:
seen_emails.add(email)
filtered_rows.append(row)
return pd.DataFrame(filtered_rows, columns=customers.columns)
This implementation iterates over each row using iterrows(). For each row, it checks if the email has been seen. If not, the email is added to the set and the row to the list. Finally, a new DataFrame is created from the filtered rows while preserving column order.
Go Solution
package main
import (
"fmt"
)
type Customer struct {
CustomerID int
Name string
Email string
}
func DropDuplicateRows(customers []Customer) []Customer {
seenEmails := make(map[string]bool)
filtered := make([]Customer, 0, len(customers))
for _, customer := range customers {
if !seenEmails[customer.Email] {
seenEmails[customer.Email] = true
filtered = append(filtered, customer)
}
}
return filtered
}
In Go, we use a map to track seen emails. The slice filtered stores the rows to be returned. Maps in Go provide O(1) average lookup, similar to a Python set. Slice capacity is pre-allocated for efficiency, though not strictly necessary.
Worked Examples
Example 1:
Input:
| customer_id | name | |
|---|---|---|
| 1 | Ella | [email protected] |
| 2 | David | [email protected] |
| 3 | Zachary | [email protected] |
| 4 | Alice | [email protected] |
| 5 | Finn | [email protected] |
| 6 | Violet | [email protected] |
Step-by-step state of seen_emails and filtered_rows:
| Iteration | Current Email | Action | seen_emails | filtered_rows |
|---|---|---|---|---|
| 1 | [email protected] | Add | {[email protected]} | [row 1] |
| 2 | [email protected] | Add | {[email protected], [email protected]} | [row 1, row 2] |
| 3 | [email protected] | Add | {[email protected], [email protected], [email protected]} | [row 1, row 2, row 3] |
| 4 | [email protected] | Add | {..., [email protected]} | [row 1, row 2, row 3, row 4] |
| 5 | [email protected] | Skip | unchanged | [row 1, row 2, row 3, row 4] |
| 6 | [email protected] | Add | {..., [email protected]} | [row 1, row 2, row 3, row 4, row 6] |
Output:
| customer_id | name | |
|---|---|---|
| 1 | Ella | [email protected] |
| 2 | David | [email protected] |
| 3 | Zachary | [email protected] |
| 4 | Alice | [email protected] |
| 6 | Violet | [email protected] |
Complexity Analysis
| Measure | Complexity | Explanation |
|---|---|---|
| Time | O(n) | Each row is visited once, and hash set operations are O(1) on average |
| Space | O(n) | Storing seen emails and filtered rows requires additional memory proportional to the number of unique emails |
The time complexity is linear because every row is checked exactly once, and the space complexity scales with the number of unique emails rather than the total number of rows if duplicates exist.
Test Cases
import pandas as pd
# Provided example
df1 = pd.DataFrame([
[1, "Ella", "[email protected]"],
[2, "David", "[email protected]"],
[3, "Zachary", "[email protected]"],
[4, "Alice", "[email protected]"],
[5, "Finn", "[email protected]"],
[6, "Violet", "[email protected]"]
], columns=["customer_id", "name", "email"])
solution = Solution()
result1 = solution.dropDuplicateRows(df1)
assert len(result1) == 5 # basic removal of duplicates
assert result1.iloc[3]['name'] == "Alice" # first occurrence kept
# All unique emails
df2 = pd.DataFrame([
[1, "A", "[email protected]"],
[2, "B", "[email protected]"]
], columns=["customer_id", "name", "email"])
assert len(solution.dropDuplicateRows(df2)) == 2
# All duplicates
df3 = pd.DataFrame([
[1, "A", "[email protected]"],
[2, "B", "[email protected]"],
[3, "C", "[email protected]"]
], columns=["customer_id", "name", "email"])
assert len(solution.dropDuplicateRows(df3)) == 1
# Empty DataFrame
df4 = pd.DataFrame(columns=["customer_id", "name", "email"])
assert solution.dropDuplicateRows(df4).empty
| Test | Why |
|---|---|
| Provided example | Checks normal duplicate removal |
| All unique emails | Verifies no rows are removed when no duplicates exist |
| All duplicates | Verifies only the first row is kept when all are duplicates |
| Empty DataFrame | Ensures function handles empty input gracefully |
Edge Cases
One important edge case is when the DataFrame is empty. A naive iteration without checking could cause errors or return None, but our implementation correctly returns an empty DataFrame.
Another edge case is when all emails are identical. The algorithm must only keep the first row and discard all others. Using a hash set ensures this occurs efficiently without repeatedly scanning previous rows.
A third edge case is when the first occurrence of a duplicate email is not at the top. This tests whether the algorithm preserves the original order and does not remove the first occurrence mistakenly. Our approach guarantees this because it only skips rows whose