LeetCode 2882 - Drop Duplicate Rows

Difficulty: 🟢 Easy
Topics: —

Solution

Problem Understanding

The problem asks us to process a DataFrame representing customers and remove rows that have duplicate email addresses. Specifically, if multiple rows share the same email value, only the first occurrence should be kept, and all subsequent duplicates should be discarded.

The input is a structured table with columns customer_id, name, and email. Each row represents a unique customer, but the email column may contain duplicates. The expected output is the same table with duplicate emails removed, preserving the order of the first appearance.

Constraints are fairly straightforward: the DataFrame can have any number of rows, and emails can be duplicated multiple times. Important edge cases include situations where every email is unique, all emails are the same, or the DataFrame is empty. The problem guarantees that email is a valid field for comparison and that we only need to remove duplicates based on this single column.

Approaches

A brute-force approach would iterate over each row and check if its email has been seen before. This can be implemented using nested loops or by maintaining a list of already-seen emails and checking membership on each iteration. While this works, the repeated membership checks in a list have O(n) complexity per check, leading to an overall O(n²) time complexity. This is inefficient for large datasets.

The optimal approach leverages the fact that we only need to track whether an email has been seen. Using a hash set to store encountered emails allows O(1) average lookup and insertion time per row. By iterating through the DataFrame once and adding rows with unseen emails to the result, we achieve linear time complexity and minimal extra space.

Approach	Time Complexity	Space Complexity	Notes
Brute Force	O(n²)	O(n)	Iterates over each row and checks all previously seen emails in a list
Optimal	O(n)	O(n)	Uses a hash set to track seen emails and preserves first occurrence

Algorithm Walkthrough

Initialize an empty hash set called seen_emails to keep track of emails that have already been encountered.
Initialize an empty list called filtered_rows to store rows that will be part of the output.
Iterate through each row in the DataFrame.
For each row, check if the email is present in seen_emails.
If the email is not in seen_emails, add it to seen_emails and append the row to filtered_rows.
If the email is already in seen_emails, skip the row to avoid duplicates.
After completing the iteration, return a DataFrame created from filtered_rows, preserving the original column order.

Why it works: By using a hash set to track seen emails, we ensure that only the first occurrence of each email is included. The order of rows is preserved because we append rows to the result list in the original iteration order, and duplicates are skipped immediately when detected.

Python Solution

import pandas as pd

class Solution:
    def dropDuplicateRows(self, customers: pd.DataFrame) -> pd.DataFrame:
        seen_emails = set()
        filtered_rows = []

        for _, row in customers.iterrows():
            email = row['email']
            if email not in seen_emails:
                seen_emails.add(email)
                filtered_rows.append(row)

        return pd.DataFrame(filtered_rows, columns=customers.columns)

This implementation iterates over each row using iterrows(). For each row, it checks if the email has been seen. If not, the email is added to the set and the row to the list. Finally, a new DataFrame is created from the filtered rows while preserving column order.

Go Solution

package main

import (
    "fmt"
)

type Customer struct {
    CustomerID int
    Name       string
    Email      string
}

func DropDuplicateRows(customers []Customer) []Customer {
    seenEmails := make(map[string]bool)
    filtered := make([]Customer, 0, len(customers))

    for _, customer := range customers {
        if !seenEmails[customer.Email] {
            seenEmails[customer.Email] = true
            filtered = append(filtered, customer)
        }
    }
    return filtered
}

In Go, we use a map to track seen emails. The slice filtered stores the rows to be returned. Maps in Go provide O(1) average lookup, similar to a Python set. Slice capacity is pre-allocated for efficiency, though not strictly necessary.

Worked Examples

Example 1:

Input:

customer_id	name	email
1	Ella	[email protected]
2	David	[email protected]
3	Zachary	[email protected]
4	Alice	[email protected]
5	Finn	[email protected]
6	Violet	[email protected]

Step-by-step state of seen_emails and filtered_rows:

Iteration	Current Email	Action	seen_emails	filtered_rows
1	[email protected]	Add	{[email protected]}	[row 1]
2	[email protected]	Add	{[email protected], [email protected]}	[row 1, row 2]
3	[email protected]	Add	{[email protected], [email protected], [email protected]}	[row 1, row 2, row 3]
4	[email protected]	Add	{..., [email protected]}	[row 1, row 2, row 3, row 4]
5	[email protected]	Skip	unchanged	[row 1, row 2, row 3, row 4]
6	[email protected]	Add	{..., [email protected]}	[row 1, row 2, row 3, row 4, row 6]

Output:

customer_id	name	email
1	Ella	[email protected]
2	David	[email protected]
3	Zachary	[email protected]
4	Alice	[email protected]
6	Violet	[email protected]

Complexity Analysis

Measure	Complexity	Explanation
Time	O(n)	Each row is visited once, and hash set operations are O(1) on average
Space	O(n)	Storing seen emails and filtered rows requires additional memory proportional to the number of unique emails

The time complexity is linear because every row is checked exactly once, and the space complexity scales with the number of unique emails rather than the total number of rows if duplicates exist.

Test Cases

import pandas as pd

# Provided example
df1 = pd.DataFrame([
    [1, "Ella", "[email protected]"],
    [2, "David", "[email protected]"],
    [3, "Zachary", "[email protected]"],
    [4, "Alice", "[email protected]"],
    [5, "Finn", "[email protected]"],
    [6, "Violet", "[email protected]"]
], columns=["customer_id", "name", "email"])
solution = Solution()
result1 = solution.dropDuplicateRows(df1)
assert len(result1) == 5  # basic removal of duplicates
assert result1.iloc[3]['name'] == "Alice"  # first occurrence kept

# All unique emails
df2 = pd.DataFrame([
    [1, "A", "[email protected]"],
    [2, "B", "[email protected]"]
], columns=["customer_id", "name", "email"])
assert len(solution.dropDuplicateRows(df2)) == 2

# All duplicates
df3 = pd.DataFrame([
    [1, "A", "[email protected]"],
    [2, "B", "[email protected]"],
    [3, "C", "[email protected]"]
], columns=["customer_id", "name", "email"])
assert len(solution.dropDuplicateRows(df3)) == 1

# Empty DataFrame
df4 = pd.DataFrame(columns=["customer_id", "name", "email"])
assert solution.dropDuplicateRows(df4).empty

Test	Why
Provided example	Checks normal duplicate removal
All unique emails	Verifies no rows are removed when no duplicates exist
All duplicates	Verifies only the first row is kept when all are duplicates
Empty DataFrame	Ensures function handles empty input gracefully

Edge Cases

One important edge case is when the DataFrame is empty. A naive iteration without checking could cause errors or return None, but our implementation correctly returns an empty DataFrame.

Another edge case is when all emails are identical. The algorithm must only keep the first row and discard all others. Using a hash set ensures this occurs efficiently without repeatedly scanning previous rows.

A third edge case is when the first occurrence of a duplicate email is not at the top. This tests whether the algorithm preserves the original order and does not remove the first occurrence mistakenly. Our approach guarantees this because it only skips rows whose