LeetCode 1948 - Delete Duplicate Folders in System
The problem gives a hierarchical file system represented as a list of absolute paths, where each path is an array of folder names from the root to a leaf folder.
Difficulty: 🔴 Hard
Topics: Array, Hash Table, String, Trie, Hash Function
Solution
Problem Understanding
The problem gives a hierarchical file system represented as a list of absolute paths, where each path is an array of folder names from the root to a leaf folder. From these paths, we must reconstruct the implicit tree structure of folders, where each node represents a folder and edges represent parent-child relationships.
Two folders are considered identical if their entire subtree structures are identical, meaning they contain the same set of child folders with the same structure recursively. If two or more folders are identical, then all of them, including all folders in their subtrees, must be deleted. Importantly, deletion is based on the initial snapshot of the tree, meaning that once identical folders are identified and marked, deletion happens once, and we do not recompute identities after removal.
The output must be the remaining folder paths after deletion, reconstructed as lists of strings.
The constraints indicate that the input can be large, with up to 20,000 paths and a total character sum up to 200,000, so any solution must be close to linear or near-linear in total nodes. A naive subtree comparison approach would be too slow.
Important edge cases include trees with no duplicates, deeply nested single chains, multiple identical subtrees at different levels, and cases where deletion affects only part of a subtree but not its ancestors unless explicitly marked. It is also important that folder identity depends on structure, not just names at a single level.
A key guarantee is that no two paths lead to the same folder, and every non-root folder has its parent present, which ensures a valid tree can be constructed directly.
Approaches
The brute-force approach would build the full tree and, for each node, compare its subtree with every other node’s subtree by serializing or recursively checking structure equality. This leads to repeated traversal of large subtrees many times, resulting in exponential or quadratic behavior in the worst case.
The key insight for optimization is that subtree identity can be represented uniquely using serialization of structure, and identical subtrees will share the same serialization string. By computing a canonical signature for each subtree bottom-up and using a hash map to count occurrences, we can identify duplicate subtrees in a single traversal. Once duplicates are identified, a second traversal can mark nodes for deletion.
| Approach | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Brute Force | O(N² · S) | O(N) | Compare each subtree against all others |
| Optimal (Trie + Serialization + Hashing) | O(N · S) | O(N · S) | Build tree, serialize subtrees, hash duplicates |
Here, N is number of nodes and S is average size of subtree representation.
Algorithm Walkthrough
The optimal solution relies on constructing a tree (trie) from the folder paths, then computing a unique structural signature for each subtree.
- First, we build a trie-like tree where each node represents a folder and edges represent subfolder relationships. Each node stores its children in a dictionary keyed by folder name. This structure allows us to reconstruct hierarchy efficiently from the path list.
- Once the tree is built, we perform a post-order traversal to compute a serialization string for each subtree. The serialization encodes both the structure and the identity of children. For example, a node with children "a" and "b" will have a signature based on sorted child signatures to ensure deterministic representation.
- During this traversal, we maintain a hash map that counts how many times each subtree serialization appears across the entire tree. This is crucial because only subtrees appearing more than once are considered duplicates.
- After computing all subtree signatures, we perform a second DFS traversal. Any node whose serialization has a frequency greater than 1 is marked as deleted. When a node is marked deleted, its entire subtree is excluded from output.
- Finally, we perform another traversal to collect all remaining valid paths from root to leaf nodes that are not marked deleted, building the output list.
Why it works: the serialization ensures that two subtrees are identical if and only if their structural representation is identical. Because we compute this bottom-up, every subtree is uniquely identified by the structure of its children, making duplicates directly detectable via hashing.
Python Solution
from typing import List, Dict
from collections import defaultdict
class Solution:
def deleteDuplicateFolder(self, paths: List[List[str]]) -> List[List[str]]:
class Node:
def __init__(self):
self.children: Dict[str, "Node"] = {}
self.deleted = False
self.sig = ""
root = Node()
for path in paths:
cur = root
for folder in path:
if folder not in cur.children:
cur.children[folder] = Node()
cur = cur.children[folder]
freq = defaultdict(int)
def build_sig(node: Node) -> str:
if not node.children:
node.sig = "()"
freq[node.sig] += 1
return node.sig
parts = []
for name in sorted(node.children.keys()):
child_sig = build_sig(node.children[name])
parts.append(name + child_sig)
node.sig = "(" + "".join(parts) + ")"
freq[node.sig] += 1
return node.sig
build_sig(root)
def mark(node: Node):
if node != root and freq[node.sig] > 1:
node.deleted = True
return
for child in node.children.values():
mark(child)
mark(root)
res = []
def collect(node: Node, path: List[str]):
if node.deleted:
return
if path:
res.append(path[:])
for name, child in node.children.items():
collect(child, path + [name])
collect(root, [])
return res
The code first builds a trie representing the folder structure. It then computes subtree signatures using a post-order DFS. Each signature is a string encoding of sorted children, ensuring structural uniqueness. A frequency map counts how many times each signature appears. Then a second DFS marks nodes with duplicate signatures as deleted. Finally, a third traversal collects remaining valid paths.
Go Solution
package main
import (
"sort"
"strings"
)
type Node struct {
children map[string]*Node
sig string
deleted bool
}
func newNode() *Node {
return &Node{
children: make(map[string]*Node),
}
}
func deleteDuplicateFolder(paths [][]string) [][]string {
root := newNode()
for _, path := range paths {
cur := root
for _, folder := range path {
if _, ok := cur.children[folder]; !ok {
cur.children[folder] = newNode()
}
cur = cur.children[folder]
}
}
freq := make(map[string]int)
var buildSig func(*Node) string
buildSig = func(node *Node) string {
if len(node.children) == 0 {
node.sig = "()"
freq[node.sig]++
return node.sig
}
keys := make([]string, 0, len(node.children))
for k := range node.children {
keys = append(keys, k)
}
sort.Strings(keys)
var sb strings.Builder
sb.WriteString("(")
for _, k := range keys {
childSig := buildSig(node.children[k])
sb.WriteString(k)
sb.WriteString(childSig)
}
sb.WriteString(")")
node.sig = sb.String()
freq[node.sig]++
return node.sig
}
buildSig(root)
var mark func(*Node)
mark = func(node *Node) {
if node != root && freq[node.sig] > 1 {
node.deleted = true
return
}
for _, child := range node.children {
mark(child)
}
}
mark(root)
var res [][]string
var collect func(*Node, []string)
collect = func(node *Node, path []string) {
if node.deleted {
return
}
if len(path) > 0 {
tmp := make([]string, len(path))
copy(tmp, path)
res = append(res, tmp)
}
for name, child := range node.children {
collect(child, append(path, name))
}
}
collect(root, []string{})
return res
}
In Go, we explicitly manage maps and sorting of keys because map iteration is nondeterministic. We use strings.Builder for efficient string concatenation. Unlike Python, we must manually copy slices when storing results to avoid mutation issues.
Worked Examples
Example 1
Input:
[["a"],["c"],["d"],["a","b"],["c","b"],["d","a"]]
Tree construction creates:
- a -> b
- c -> b
- d -> a
Subtree signatures:
- "b" appears under "a" and "c" → identical subtrees detected
- Therefore, a and c subtrees are deleted
Traversal marking:
- sig("b") = "()"
- sig("a") = "(b())"
- sig("c") = "(b())"
- freq("(b())") = 2 → delete a and c
Remaining:
- d
- d/a
Output:
[["d"],["d","a"]]
Example 2
We compute subtree signatures bottom-up. The key duplicate pattern is:
- subtree rooted at "/a/b/x" matches subtree rooted at "/w"
Thus both are marked.
After marking, recomputation is not performed, so newly identical structures are not deleted unless originally marked.
Remaining folders:
- c, c/b
- a, a/b
Example 3
All folders produce unique subtree signatures, so frequency of every signature is 1. No deletions occur, and all paths remain.
Complexity Analysis
| Measure | Complexity | Explanation |
|---|---|---|
| Time | O(N · log N + N) | Sorting children per node plus DFS traversal |
| Space | O(N · S) | Storage of trie nodes and subtree signatures |
The dominant cost is building subtree signatures and sorting children at each node. Since each node is processed once and total characters are bounded, the solution stays efficient.
Test Cases
assert sorted(Solution().deleteDuplicateFolder([["a"],["c"],["d"],["a","b"],["c","b"],["d","a"]])) == sorted([["d"],["d","a"]]) # basic duplicate deletion
assert sorted(Solution().deleteDuplicateFolder([["a"],["c"],["a","b"],["c","b"],["a","b","x"],["a","b","x","y"],["w"],["w","y"]])) == sorted([["c"],["c","b"],["a"],["a","b"]]) # nested duplicate deletion
assert sorted(Solution().deleteDuplicateFolder([["a","b"],["c","d"],["c"],["a"]])) == sorted([["a"],["a","b"],["c"],["c","d"]]) # no duplicates
assert Solution().deleteDuplicateFolder([["a"]]) == [["a"]] # single node
assert Solution().deleteDuplicateFolder([]) == [] # empty input edge case
assert sorted(Solution().deleteDuplicateFolder([["a"],["b"],["c"]])) == sorted([["a"],["b"],["c"]]) # all independent roots
| Test | Why |
|---|---|
| duplicate sibling subtrees | verifies detection of identical structure |
| nested duplicates | verifies deep subtree hashing |
| no duplicates | ensures no false positives |
| single node | minimal valid tree |
| empty input | robustness |
| independent roots | multiple root-level trees |
Edge Cases
One important edge case is when folders have identical names but different structures. The algorithm handles this correctly because identity is based on subtree serialization, not names alone.
Another edge case is deep nesting where only leaf nodes are identical. Since serialization is computed bottom-up, leaf nodes are compared first and propagate upward correctly, ensuring accurate detection.
A third edge case is when deletion of a subtree would make parent folders identical afterward. The problem explicitly states deletion happens only once based on initial marking, and this implementation respects that by separating marking from collection without recomputation.