LeetCode 609 - Find Duplicate File in System
The problem gives us a list of directory descriptions, where each string represents a directory and the files inside it. Each file entry includes both the filename and its content. For example: means: - The directory path is root/a - Inside this directory: - 1.
Difficulty: 🟡 Medium
Topics: Array, Hash Table, String
Solution
Problem Understanding
The problem gives us a list of directory descriptions, where each string represents a directory and the files inside it. Each file entry includes both the filename and its content.
For example:
"root/a 1.txt(abcd) 2.txt(efgh)"
means:
-
The directory path is
root/a -
Inside this directory:
-
1.txtcontains"abcd" -
2.txtcontains"efgh"
Our task is to identify files that have identical content and return groups of their full file paths.
The important detail is that duplicate files are determined only by content, not filename. Two files with completely different names or in different directories are duplicates if their content matches.
For instance:
root/a/1.txt -> "abcd"
root/c/3.txt -> "abcd"
These belong in the same duplicate group because the contents are identical.
The output should contain only groups with at least two files, because a single file with unique content is not considered a duplicate.
The constraints give useful hints about scale:
paths.length <= 2 * 10^4- Total input size is at most
5 * 10^5characters.
This means we should avoid expensive pairwise comparisons between files. Since string operations and hash maps are efficient, an approach based on grouping file contents is appropriate.
A few important guarantees simplify the implementation:
- Directory names are unique.
- Files inside the same directory never share the same name.
- The input format is valid.
- Every directory contains at least one file.
Important edge cases include directories containing only one file, multiple duplicate groups existing simultaneously, all files being unique, or many files sharing the exact same content. A naive implementation could accidentally include singleton groups or incorrectly parse filenames and contents.
Approaches
Brute Force Approach
A straightforward solution is to parse every file and compare it against every other file.
We would first extract all files into a list of:
(full_path, content)
Then, for every file, compare its content with every other file to determine duplicates.
For example:
[
("root/a/1.txt", "abcd"),
("root/a/2.txt", "efgh"),
("root/c/3.txt", "abcd")
]
We compare:
- File 1 vs File 2
- File 1 vs File 3
- File 2 vs File 3
If contents match, group them together.
This works because eventually every pair of files is checked. However, it is too slow. If there are n files, pairwise comparison requires O(n²) comparisons, which becomes expensive for large inputs.
Optimal Approach, Hash Map Grouping
The key insight is that duplicate detection is essentially a grouping problem.
Instead of comparing files pairwise, we can use a hash map:
content -> list of file paths
As we parse each file:
- Extract its content.
- Build its full file path.
- Store the path in the hash map under that content.
For example:
{
"abcd": ["root/a/1.txt", "root/c/3.txt"],
"efgh": ["root/a/2.txt", "root/c/d/4.txt"]
}
After processing all files, we simply collect values whose list length is greater than 1.
This avoids redundant comparisons entirely. Hash map insertion and lookup are O(1) on average, making the solution linear relative to input size.
| Approach | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Brute Force | O(n² × k) | O(n) | Compare every file pair, k is average content length |
| Optimal | O(S) | O(S) | Hash map grouping, S is total input size |
Here, S represents the total number of characters across all input strings.
Algorithm Walkthrough
Optimal Algorithm
- Create a hash map where:
- key = file content
- value = list of file paths
This lets us immediately group files with identical content.
2. Iterate through every directory string in paths.
3. Split the string by spaces.
The first token is the directory path:
root/a
The remaining tokens are file descriptions:
1.txt(abcd)
2.txt(efgh)
- For each file description:
- Find the index of
'(' - Extract filename before
'(' - Extract content inside parentheses
Example:
"1.txt(abcd)"
becomes:
filename = "1.txt"
content = "abcd"
- Construct the full file path:
directory + "/" + filename
- Store the path inside:
content_to_paths[content]
- After processing all directories, iterate through the hash map.
- Add only groups with length greater than
1to the result. - Return the final result.
Why it works
The algorithm relies on the invariant that every processed file is placed into exactly one bucket corresponding to its content. Since duplicates are defined solely by identical content, all duplicate files naturally accumulate in the same hash map entry. By returning only entries containing at least two file paths, we guarantee that every returned group consists exclusively of duplicates, and no duplicate file is missed.
Python Solution
from collections import defaultdict
from typing import List
class Solution:
def findDuplicate(self, paths: List[str]) -> List[List[str]]:
content_to_paths = defaultdict(list)
for path_info in paths:
parts = path_info.split(" ")
directory = parts[0]
for file_info in parts[1:]:
left_paren_index = file_info.index("(")
file_name = file_info[:left_paren_index]
content = file_info[left_paren_index + 1:-1]
full_path = f"{directory}/{file_name}"
content_to_paths[content].append(full_path)
result = []
for file_paths in content_to_paths.values():
if len(file_paths) > 1:
result.append(file_paths)
return result
The implementation follows the algorithm directly.
We first create a defaultdict(list) so we can append file paths without manually checking if a key already exists.
For every directory string, we split by spaces. The first element becomes the directory path, while the remaining elements represent file descriptions.
For each file description, we find the opening parenthesis to separate the filename from content. Since the problem guarantees valid formatting, we can safely slice the string.
The full file path is created by concatenating:
directory/file_name
We then group paths by content inside the hash map.
Finally, we iterate through the grouped values and only keep lists containing more than one file path, since those represent duplicates.
Go Solution
func findDuplicate(paths []string) [][]string {
contentToPaths := make(map[string][]string)
for _, pathInfo := range paths {
parts := strings.Split(pathInfo, " ")
directory := parts[0]
for _, fileInfo := range parts[1:] {
leftParenIndex := strings.Index(fileInfo, "(")
fileName := fileInfo[:leftParenIndex]
content := fileInfo[leftParenIndex+1 : len(fileInfo)-1]
fullPath := directory + "/" + fileName
contentToPaths[content] = append(contentToPaths[content], fullPath)
}
}
result := [][]string{}
for _, filePaths := range contentToPaths {
if len(filePaths) > 1 {
result = append(result, filePaths)
}
}
return result
}
The Go implementation closely mirrors the Python solution.
The main difference is that Go requires explicit use of map[string][]string instead of defaultdict. Appending to a non-existent map key works naturally because a missing slice defaults to nil, and append handles it correctly.
String slicing is also slightly more manual. We use strings.Index() to find the opening parenthesis and extract substrings using index ranges.
Do not forget to include the import:
import "strings"
Worked Examples
Example 1
Input
[
"root/a 1.txt(abcd) 2.txt(efgh)",
"root/c 3.txt(abcd)",
"root/c/d 4.txt(efgh)",
"root 4.txt(efgh)"
]
Step-by-step execution
| Step | Current File | Content | Hash Map State |
|---|---|---|---|
| 1 | root/a/1.txt |
abcd |
{abcd: [root/a/1.txt]} |
| 2 | root/a/2.txt |
efgh |
{abcd: [root/a/1.txt], efgh: [root/a/2.txt]} |
| 3 | root/c/3.txt |
abcd |
{abcd: [root/a/1.txt, root/c/3.txt], efgh: [root/a/2.txt]} |
| 4 | root/c/d/4.txt |
efgh |
{abcd: [...], efgh: [root/a/2.txt, root/c/d/4.txt]} |
| 5 | root/4.txt |
efgh |
{abcd: [...], efgh: [root/a/2.txt, root/c/d/4.txt, root/4.txt]} |
Final duplicate groups:
[
["root/a/2.txt", "root/c/d/4.txt", "root/4.txt"],
["root/a/1.txt", "root/c/3.txt"]
]
Example 2
Input
[
"root/a 1.txt(abcd) 2.txt(efgh)",
"root/c 3.txt(abcd)",
"root/c/d 4.txt(efgh)"
]
Step-by-step execution
| Step | Current File | Content | Hash Map State |
|---|---|---|---|
| 1 | root/a/1.txt |
abcd |
{abcd: [root/a/1.txt]} |
| 2 | root/a/2.txt |
efgh |
{abcd: [root/a/1.txt], efgh: [root/a/2.txt]} |
| 3 | root/c/3.txt |
abcd |
{abcd: [root/a/1.txt, root/c/3.txt], efgh: [root/a/2.txt]} |
| 4 | root/c/d/4.txt |
efgh |
{abcd: [...], efgh: [root/a/2.txt, root/c/d/4.txt]} |
Final result:
[
["root/a/2.txt", "root/c/d/4.txt"],
["root/a/1.txt", "root/c/3.txt"]
]
Complexity Analysis
| Measure | Complexity | Explanation |
|---|---|---|
| Time | O(S) | We process every character in the input once |
| Space | O(S) | Hash map stores file contents and paths |
The runtime is linear in the total input size S, because each directory string is parsed once, and each file entry is processed once. Hash map insertions and lookups are constant time on average.
The space complexity is also O(S) because we store file contents as keys and file paths inside lists.
Test Cases
def normalize(result):
return sorted(sorted(group) for group in result)
solution = Solution()
# Example 1
assert normalize(solution.findDuplicate([
"root/a 1.txt(abcd) 2.txt(efgh)",
"root/c 3.txt(abcd)",
"root/c/d 4.txt(efgh)",
"root 4.txt(efgh)"
])) == normalize([
["root/a/2.txt", "root/c/d/4.txt", "root/4.txt"],
["root/a/1.txt", "root/c/3.txt"]
]) # Multiple duplicate groups
# Example 2
assert normalize(solution.findDuplicate([
"root/a 1.txt(abcd) 2.txt(efgh)",
"root/c 3.txt(abcd)",
"root/c/d 4.txt(efgh)"
])) == normalize([
["root/a/2.txt", "root/c/d/4.txt"],
["root/a/1.txt", "root/c/3.txt"]
]) # Standard duplicate detection
# No duplicates
assert solution.findDuplicate([
"root/a 1.txt(abcd)",
"root/b 2.txt(efgh)"
]) == [] # Unique contents only
# All files have same content
assert normalize(solution.findDuplicate([
"root/a 1.txt(x)",
"root/b 2.txt(x)",
"root/c 3.txt(x)"
])) == normalize([
["root/a/1.txt", "root/b/2.txt", "root/c/3.txt"]
]) # Single large duplicate group
# Single directory, multiple duplicates
assert normalize(solution.findDuplicate([
"root 1.txt(a) 2.txt(a) 3.txt(b)"
])) == normalize([
["root/1.txt", "root/2.txt"]
]) # Duplicates within same directory
# Deep directory paths
assert normalize(solution.findDuplicate([
"root/a/b/c 1.txt(test)",
"root/x/y/z 2.txt(test)"
])) == normalize([
["root/a/b/c/1.txt", "root/x/y/z/2.txt"]
]) # Nested directories
# Minimum input
assert solution.findDuplicate([
"root 1.txt(unique)"
]) == [] # Smallest valid case
| Test | Why |
|---|---|
| Example 1 | Verifies multiple duplicate groups |
| Example 2 | Validates standard functionality |
| No duplicates | Ensures singleton groups are excluded |
| All files same content | Tests one large duplicate cluster |
| Same directory duplicates | Verifies local duplicates work |
| Deep nested paths | Ensures path construction correctness |
| Minimum input | Tests smallest valid constraint |
Edge Cases
No Duplicate Files
A common bug is accidentally returning groups containing only one file. For example:
["root/a 1.txt(abcd)"]
Since there is only one file with content "abcd", it should not appear in the output. Our implementation explicitly filters groups using:
if len(file_paths) > 1
which guarantees singleton groups are excluded.
Multiple Files Sharing the Same Content
When many files share identical content, the algorithm must correctly place them all in a single group rather than creating separate duplicate pairs.
Example:
root/a/1.txt(x)
root/b/2.txt(x)
root/c/3.txt(x)
Our hash map naturally handles this because all paths are appended to the same content key.
Deeply Nested Directories
Directory paths may contain multiple nested folders:
root/a/b/c/d
A fragile implementation might incorrectly reconstruct paths. Our solution safely builds file paths using:
directory + "/" + filename
which works regardless of directory depth.
Multiple Duplicate Groups
Different content values may each form separate duplicate groups. A buggy implementation might merge unrelated groups.
Example:
"a" -> files 1, 2
"b" -> files 3, 4
Because the hash map uses file content as the key, unrelated contents remain completely separated, guaranteeing correct grouping.
Follow Up Discussion
In a real file system, both DFS and BFS can work for traversal. DFS is often preferred because it uses less memory when traversing deeply nested directories, especially when implemented recursively or with an explicit stack. BFS may consume substantial memory if a directory contains many sibling folders.
For very large file contents, comparing raw content directly becomes expensive. Instead, we can first group files by file size, since files with different sizes cannot be duplicates. Then we compute hashes such as SHA-256 for files with matching sizes. This avoids loading huge files unnecessarily.
If files can only be read in 1 KB chunks, we compute the hash incrementally while streaming the file. Modern hash functions support chunked updates, so the entire file never needs to be loaded into memory.
In the modified large-file solution, the most time-consuming part becomes disk I/O and hashing, not traversal. Memory usage improves because files are streamed rather than fully loaded. To reduce false positives, especially from hash collisions, we can perform a final byte-by-byte comparison between files that share the same hash before declaring them duplicates.