LeetCode 609 - Find Duplicate File in System

Difficulty: 🟡 Medium
Topics: Array, Hash Table, String

Solution

Problem Understanding

The problem gives us a list of directory descriptions, where each string represents a directory and the files inside it. Each file entry includes both the filename and its content.

For example:

"root/a 1.txt(abcd) 2.txt(efgh)"

means:

The directory path is root/a
Inside this directory:
1.txt contains "abcd"
2.txt contains "efgh"

Our task is to identify files that have identical content and return groups of their full file paths.

The important detail is that duplicate files are determined only by content, not filename. Two files with completely different names or in different directories are duplicates if their content matches.

For instance:

root/a/1.txt -> "abcd"
root/c/3.txt -> "abcd"

These belong in the same duplicate group because the contents are identical.

The output should contain only groups with at least two files, because a single file with unique content is not considered a duplicate.

The constraints give useful hints about scale:

paths.length <= 2 * 10^4
Total input size is at most 5 * 10^5 characters.

This means we should avoid expensive pairwise comparisons between files. Since string operations and hash maps are efficient, an approach based on grouping file contents is appropriate.

A few important guarantees simplify the implementation:

Directory names are unique.
Files inside the same directory never share the same name.
The input format is valid.
Every directory contains at least one file.

Important edge cases include directories containing only one file, multiple duplicate groups existing simultaneously, all files being unique, or many files sharing the exact same content. A naive implementation could accidentally include singleton groups or incorrectly parse filenames and contents.

Approaches

Brute Force Approach

A straightforward solution is to parse every file and compare it against every other file.

We would first extract all files into a list of:

(full_path, content)

Then, for every file, compare its content with every other file to determine duplicates.

For example:

[
    ("root/a/1.txt", "abcd"),
    ("root/a/2.txt", "efgh"),
    ("root/c/3.txt", "abcd")
]

We compare:

File 1 vs File 2
File 1 vs File 3
File 2 vs File 3

If contents match, group them together.

This works because eventually every pair of files is checked. However, it is too slow. If there are n files, pairwise comparison requires O(n²) comparisons, which becomes expensive for large inputs.

Optimal Approach, Hash Map Grouping

The key insight is that duplicate detection is essentially a grouping problem.

Instead of comparing files pairwise, we can use a hash map:

content -> list of file paths

As we parse each file:

Extract its content.
Build its full file path.
Store the path in the hash map under that content.

For example:

{
    "abcd": ["root/a/1.txt", "root/c/3.txt"],
    "efgh": ["root/a/2.txt", "root/c/d/4.txt"]
}

After processing all files, we simply collect values whose list length is greater than 1.

This avoids redundant comparisons entirely. Hash map insertion and lookup are O(1) on average, making the solution linear relative to input size.

Approach	Time Complexity	Space Complexity	Notes
Brute Force	O(n² × k)	O(n)	Compare every file pair, `k` is average content length
Optimal	O(S)	O(S)	Hash map grouping, `S` is total input size

Here, S represents the total number of characters across all input strings.

Algorithm Walkthrough

Optimal Algorithm

Create a hash map where:

key = file content
value = list of file paths

This lets us immediately group files with identical content. 2. Iterate through every directory string in paths. 3. Split the string by spaces.

The first token is the directory path:

root/a

The remaining tokens are file descriptions:

1.txt(abcd)
2.txt(efgh)

For each file description:

Find the index of '('
Extract filename before '('
Extract content inside parentheses

Example:

"1.txt(abcd)"

becomes:

filename = "1.txt"
content = "abcd"

Construct the full file path:

directory + "/" + filename

Store the path inside:

content_to_paths[content]

After processing all directories, iterate through the hash map.
Add only groups with length greater than 1 to the result.
Return the final result.

Why it works

The algorithm relies on the invariant that every processed file is placed into exactly one bucket corresponding to its content. Since duplicates are defined solely by identical content, all duplicate files naturally accumulate in the same hash map entry. By returning only entries containing at least two file paths, we guarantee that every returned group consists exclusively of duplicates, and no duplicate file is missed.

Python Solution

from collections import defaultdict
from typing import List

class Solution:
    def findDuplicate(self, paths: List[str]) -> List[List[str]]:
        content_to_paths = defaultdict(list)

        for path_info in paths:
            parts = path_info.split(" ")
            directory = parts[0]

            for file_info in parts[1:]:
                left_paren_index = file_info.index("(")

                file_name = file_info[:left_paren_index]
                content = file_info[left_paren_index + 1:-1]

                full_path = f"{directory}/{file_name}"
                content_to_paths[content].append(full_path)

        result = []

        for file_paths in content_to_paths.values():
            if len(file_paths) > 1:
                result.append(file_paths)

        return result

The implementation follows the algorithm directly.

We first create a defaultdict(list) so we can append file paths without manually checking if a key already exists.

For every directory string, we split by spaces. The first element becomes the directory path, while the remaining elements represent file descriptions.

For each file description, we find the opening parenthesis to separate the filename from content. Since the problem guarantees valid formatting, we can safely slice the string.

The full file path is created by concatenating:

directory/file_name

We then group paths by content inside the hash map.

Finally, we iterate through the grouped values and only keep lists containing more than one file path, since those represent duplicates.

Go Solution

func findDuplicate(paths []string) [][]string {
	contentToPaths := make(map[string][]string)

	for _, pathInfo := range paths {
		parts := strings.Split(pathInfo, " ")
		directory := parts[0]

		for _, fileInfo := range parts[1:] {
			leftParenIndex := strings.Index(fileInfo, "(")

			fileName := fileInfo[:leftParenIndex]
			content := fileInfo[leftParenIndex+1 : len(fileInfo)-1]

			fullPath := directory + "/" + fileName
			contentToPaths[content] = append(contentToPaths[content], fullPath)
		}
	}

	result := [][]string{}

	for _, filePaths := range contentToPaths {
		if len(filePaths) > 1 {
			result = append(result, filePaths)
		}
	}

	return result
}

The Go implementation closely mirrors the Python solution.

The main difference is that Go requires explicit use of map[string][]string instead of defaultdict. Appending to a non-existent map key works naturally because a missing slice defaults to nil, and append handles it correctly.

String slicing is also slightly more manual. We use strings.Index() to find the opening parenthesis and extract substrings using index ranges.

Do not forget to include the import:

import "strings"

Worked Examples

Example 1

Input

[
    "root/a 1.txt(abcd) 2.txt(efgh)",
    "root/c 3.txt(abcd)",
    "root/c/d 4.txt(efgh)",
    "root 4.txt(efgh)"
]

Step-by-step execution

Step	Current File	Content	Hash Map State
1	`root/a/1.txt`	`abcd`	`{abcd: [root/a/1.txt]}`
2	`root/a/2.txt`	`efgh`	`{abcd: [root/a/1.txt], efgh: [root/a/2.txt]}`
3	`root/c/3.txt`	`abcd`	`{abcd: [root/a/1.txt, root/c/3.txt], efgh: [root/a/2.txt]}`
4	`root/c/d/4.txt`	`efgh`	`{abcd: [...], efgh: [root/a/2.txt, root/c/d/4.txt]}`
5	`root/4.txt`	`efgh`	`{abcd: [...], efgh: [root/a/2.txt, root/c/d/4.txt, root/4.txt]}`

Final duplicate groups:

[
    ["root/a/2.txt", "root/c/d/4.txt", "root/4.txt"],
    ["root/a/1.txt", "root/c/3.txt"]
]

Example 2

Input

[
    "root/a 1.txt(abcd) 2.txt(efgh)",
    "root/c 3.txt(abcd)",
    "root/c/d 4.txt(efgh)"
]

Step-by-step execution

Step	Current File	Content	Hash Map State
1	`root/a/1.txt`	`abcd`	`{abcd: [root/a/1.txt]}`
2	`root/a/2.txt`	`efgh`	`{abcd: [root/a/1.txt], efgh: [root/a/2.txt]}`
3	`root/c/3.txt`	`abcd`	`{abcd: [root/a/1.txt, root/c/3.txt], efgh: [root/a/2.txt]}`
4	`root/c/d/4.txt`	`efgh`	`{abcd: [...], efgh: [root/a/2.txt, root/c/d/4.txt]}`

Final result:

[
    ["root/a/2.txt", "root/c/d/4.txt"],
    ["root/a/1.txt", "root/c/3.txt"]
]

Complexity Analysis

Measure	Complexity	Explanation
Time	O(S)	We process every character in the input once
Space	O(S)	Hash map stores file contents and paths

The runtime is linear in the total input size S, because each directory string is parsed once, and each file entry is processed once. Hash map insertions and lookups are constant time on average.

The space complexity is also O(S) because we store file contents as keys and file paths inside lists.

Test Cases

def normalize(result):
    return sorted(sorted(group) for group in result)

solution = Solution()

# Example 1
assert normalize(solution.findDuplicate([
    "root/a 1.txt(abcd) 2.txt(efgh)",
    "root/c 3.txt(abcd)",
    "root/c/d 4.txt(efgh)",
    "root 4.txt(efgh)"
])) == normalize([
    ["root/a/2.txt", "root/c/d/4.txt", "root/4.txt"],
    ["root/a/1.txt", "root/c/3.txt"]
])  # Multiple duplicate groups

# Example 2
assert normalize(solution.findDuplicate([
    "root/a 1.txt(abcd) 2.txt(efgh)",
    "root/c 3.txt(abcd)",
    "root/c/d 4.txt(efgh)"
])) == normalize([
    ["root/a/2.txt", "root/c/d/4.txt"],
    ["root/a/1.txt", "root/c/3.txt"]
])  # Standard duplicate detection

# No duplicates
assert solution.findDuplicate([
    "root/a 1.txt(abcd)",
    "root/b 2.txt(efgh)"
]) == []  # Unique contents only

# All files have same content
assert normalize(solution.findDuplicate([
    "root/a 1.txt(x)",
    "root/b 2.txt(x)",
    "root/c 3.txt(x)"
])) == normalize([
    ["root/a/1.txt", "root/b/2.txt", "root/c/3.txt"]
])  # Single large duplicate group

# Single directory, multiple duplicates
assert normalize(solution.findDuplicate([
    "root 1.txt(a) 2.txt(a) 3.txt(b)"
])) == normalize([
    ["root/1.txt", "root/2.txt"]
])  # Duplicates within same directory

# Deep directory paths
assert normalize(solution.findDuplicate([
    "root/a/b/c 1.txt(test)",
    "root/x/y/z 2.txt(test)"
])) == normalize([
    ["root/a/b/c/1.txt", "root/x/y/z/2.txt"]
])  # Nested directories

# Minimum input
assert solution.findDuplicate([
    "root 1.txt(unique)"
]) == []  # Smallest valid case

Test	Why
Example 1	Verifies multiple duplicate groups
Example 2	Validates standard functionality
No duplicates	Ensures singleton groups are excluded
All files same content	Tests one large duplicate cluster
Same directory duplicates	Verifies local duplicates work
Deep nested paths	Ensures path construction correctness
Minimum input	Tests smallest valid constraint

Edge Cases

No Duplicate Files

A common bug is accidentally returning groups containing only one file. For example:

["root/a 1.txt(abcd)"]

Since there is only one file with content "abcd", it should not appear in the output. Our implementation explicitly filters groups using:

if len(file_paths) > 1

which guarantees singleton groups are excluded.

When many files share identical content, the algorithm must correctly place them all in a single group rather than creating separate duplicate pairs.

Example:

root/a/1.txt(x)
root/b/2.txt(x)
root/c/3.txt(x)

Our hash map naturally handles this because all paths are appended to the same content key.

Deeply Nested Directories

Directory paths may contain multiple nested folders:

root/a/b/c/d

A fragile implementation might incorrectly reconstruct paths. Our solution safely builds file paths using:

directory + "/" + filename

which works regardless of directory depth.

Multiple Duplicate Groups

Different content values may each form separate duplicate groups. A buggy implementation might merge unrelated groups.

Example:

"a" -> files 1, 2
"b" -> files 3, 4

Because the hash map uses file content as the key, unrelated contents remain completely separated, guaranteeing correct grouping.

Follow Up Discussion

In a real file system, both DFS and BFS can work for traversal. DFS is often preferred because it uses less memory when traversing deeply nested directories, especially when implemented recursively or with an explicit stack. BFS may consume substantial memory if a directory contains many sibling folders.

For very large file contents, comparing raw content directly becomes expensive. Instead, we can first group files by file size, since files with different sizes cannot be duplicates. Then we compute hashes such as SHA-256 for files with matching sizes. This avoids loading huge files unnecessarily.

If files can only be read in 1 KB chunks, we compute the hash incrementally while streaming the file. Modern hash functions support chunked updates, so the entire file never needs to be loaded into memory.

In the modified large-file solution, the most time-consuming part becomes disk I/O and hashing, not traversal. Memory usage improves because files are streamed rather than fully loaded. To reduce false positives, especially from hash collisions, we can perform a final byte-by-byte comparison between files that share the same hash before declaring them duplicates.

LeetCode 609 - Find Duplicate File in System

Solution

Problem Understanding

Approaches

Brute Force Approach

Optimal Approach, Hash Map Grouping

Algorithm Walkthrough

Optimal Algorithm

Why it works

Python Solution

Go Solution

Worked Examples

Example 1

Step-by-step execution

Example 2

Step-by-step execution

Complexity Analysis

Test Cases

Edge Cases

No Duplicate Files

Multiple Files Sharing the Same Content

Deeply Nested Directories

Multiple Duplicate Groups

Follow Up Discussion