LeetCode 1948 - Delete Duplicate Folders in System

LeetCode Problem 1948

Difficulty: 🔴 Hard
Topics: Array, Hash Table, String, Trie, Hash Function

Solution

Problem Understanding

The problem gives a hierarchical file system represented as a list of absolute paths, where each path is an array of folder names from the root to a leaf folder. From these paths, we must reconstruct the implicit tree structure of folders, where each node represents a folder and edges represent parent-child relationships.

Two folders are considered identical if their entire subtree structures are identical, meaning they contain the same set of child folders with the same structure recursively. If two or more folders are identical, then all of them, including all folders in their subtrees, must be deleted. Importantly, deletion is based on the initial snapshot of the tree, meaning that once identical folders are identified and marked, deletion happens once, and we do not recompute identities after removal.

The output must be the remaining folder paths after deletion, reconstructed as lists of strings.

The constraints indicate that the input can be large, with up to 20,000 paths and a total character sum up to 200,000, so any solution must be close to linear or near-linear in total nodes. A naive subtree comparison approach would be too slow.

Important edge cases include trees with no duplicates, deeply nested single chains, multiple identical subtrees at different levels, and cases where deletion affects only part of a subtree but not its ancestors unless explicitly marked. It is also important that folder identity depends on structure, not just names at a single level.

A key guarantee is that no two paths lead to the same folder, and every non-root folder has its parent present, which ensures a valid tree can be constructed directly.

Approaches

The brute-force approach would build the full tree and, for each node, compare its subtree with every other node’s subtree by serializing or recursively checking structure equality. This leads to repeated traversal of large subtrees many times, resulting in exponential or quadratic behavior in the worst case.

The key insight for optimization is that subtree identity can be represented uniquely using serialization of structure, and identical subtrees will share the same serialization string. By computing a canonical signature for each subtree bottom-up and using a hash map to count occurrences, we can identify duplicate subtrees in a single traversal. Once duplicates are identified, a second traversal can mark nodes for deletion.

Approach	Time Complexity	Space Complexity	Notes
Brute Force	O(N² · S)	O(N)	Compare each subtree against all others
Optimal (Trie + Serialization + Hashing)	O(N · S)	O(N · S)	Build tree, serialize subtrees, hash duplicates

Here, N is number of nodes and S is average size of subtree representation.

Algorithm Walkthrough

The optimal solution relies on constructing a tree (trie) from the folder paths, then computing a unique structural signature for each subtree.

First, we build a trie-like tree where each node represents a folder and edges represent subfolder relationships. Each node stores its children in a dictionary keyed by folder name. This structure allows us to reconstruct hierarchy efficiently from the path list.
Once the tree is built, we perform a post-order traversal to compute a serialization string for each subtree. The serialization encodes both the structure and the identity of children. For example, a node with children "a" and "b" will have a signature based on sorted child signatures to ensure deterministic representation.
During this traversal, we maintain a hash map that counts how many times each subtree serialization appears across the entire tree. This is crucial because only subtrees appearing more than once are considered duplicates.
After computing all subtree signatures, we perform a second DFS traversal. Any node whose serialization has a frequency greater than 1 is marked as deleted. When a node is marked deleted, its entire subtree is excluded from output.
Finally, we perform another traversal to collect all remaining valid paths from root to leaf nodes that are not marked deleted, building the output list.

Why it works: the serialization ensures that two subtrees are identical if and only if their structural representation is identical. Because we compute this bottom-up, every subtree is uniquely identified by the structure of its children, making duplicates directly detectable via hashing.

Python Solution

from typing import List, Dict
from collections import defaultdict

class Solution:
    def deleteDuplicateFolder(self, paths: List[List[str]]) -> List[List[str]]:
        class Node:
            def __init__(self):
                self.children: Dict[str, "Node"] = {}
                self.deleted = False
                self.sig = ""

        root = Node()

        for path in paths:
            cur = root
            for folder in path:
                if folder not in cur.children:
                    cur.children[folder] = Node()
                cur = cur.children[folder]

        freq = defaultdict(int)

        def build_sig(node: Node) -> str:
            if not node.children:
                node.sig = "()"
                freq[node.sig] += 1
                return node.sig

            parts = []
            for name in sorted(node.children.keys()):
                child_sig = build_sig(node.children[name])
                parts.append(name + child_sig)

            node.sig = "(" + "".join(parts) + ")"
            freq[node.sig] += 1
            return node.sig

        build_sig(root)

        def mark(node: Node):
            if node != root and freq[node.sig] > 1:
                node.deleted = True
                return
            for child in node.children.values():
                mark(child)

        mark(root)

        res = []

        def collect(node: Node, path: List[str]):
            if node.deleted:
                return
            if path:
                res.append(path[:])
            for name, child in node.children.items():
                collect(child, path + [name])

        collect(root, [])

        return res

The code first builds a trie representing the folder structure. It then computes subtree signatures using a post-order DFS. Each signature is a string encoding of sorted children, ensuring structural uniqueness. A frequency map counts how many times each signature appears. Then a second DFS marks nodes with duplicate signatures as deleted. Finally, a third traversal collects remaining valid paths.

Go Solution

package main

import (
	"sort"
	"strings"
)

type Node struct {
	children map[string]*Node
	sig      string
	deleted  bool
}

func newNode() *Node {
	return &Node{
		children: make(map[string]*Node),
	}
}

func deleteDuplicateFolder(paths [][]string) [][]string {
	root := newNode()

	for _, path := range paths {
		cur := root
		for _, folder := range path {
			if _, ok := cur.children[folder]; !ok {
				cur.children[folder] = newNode()
			}
			cur = cur.children[folder]
		}
	}

	freq := make(map[string]int)

	var buildSig func(*Node) string
	buildSig = func(node *Node) string {
		if len(node.children) == 0 {
			node.sig = "()"
			freq[node.sig]++
			return node.sig
		}

		keys := make([]string, 0, len(node.children))
		for k := range node.children {
			keys = append(keys, k)
		}
		sort.Strings(keys)

		var sb strings.Builder
		sb.WriteString("(")
		for _, k := range keys {
			childSig := buildSig(node.children[k])
			sb.WriteString(k)
			sb.WriteString(childSig)
		}
		sb.WriteString(")")

		node.sig = sb.String()
		freq[node.sig]++
		return node.sig
	}

	buildSig(root)

	var mark func(*Node)
	mark = func(node *Node) {
		if node != root && freq[node.sig] > 1 {
			node.deleted = true
			return
		}
		for _, child := range node.children {
			mark(child)
		}
	}

	mark(root)

	var res [][]string
	var collect func(*Node, []string)
	collect = func(node *Node, path []string) {
		if node.deleted {
			return
		}
		if len(path) > 0 {
			tmp := make([]string, len(path))
			copy(tmp, path)
			res = append(res, tmp)
		}
		for name, child := range node.children {
			collect(child, append(path, name))
		}
	}

	collect(root, []string{})

	return res
}

In Go, we explicitly manage maps and sorting of keys because map iteration is nondeterministic. We use strings.Builder for efficient string concatenation. Unlike Python, we must manually copy slices when storing results to avoid mutation issues.

Worked Examples

Example 1

Input:

[["a"],["c"],["d"],["a","b"],["c","b"],["d","a"]]

Tree construction creates:

a -> b
c -> b
d -> a

Subtree signatures:

"b" appears under "a" and "c" → identical subtrees detected
Therefore, a and c subtrees are deleted

Traversal marking:

sig("b") = "()"
sig("a") = "(b())"
sig("c") = "(b())"
freq("(b())") = 2 → delete a and c

Remaining:

Output:

[["d"],["d","a"]]

Example 2

We compute subtree signatures bottom-up. The key duplicate pattern is:

subtree rooted at "/a/b/x" matches subtree rooted at "/w"

Thus both are marked.

After marking, recomputation is not performed, so newly identical structures are not deleted unless originally marked.

Remaining folders:

c, c/b
a, a/b

Example 3

All folders produce unique subtree signatures, so frequency of every signature is 1. No deletions occur, and all paths remain.

Complexity Analysis

Measure	Complexity	Explanation
Time	O(N · log N + N)	Sorting children per node plus DFS traversal
Space	O(N · S)	Storage of trie nodes and subtree signatures

The dominant cost is building subtree signatures and sorting children at each node. Since each node is processed once and total characters are bounded, the solution stays efficient.

Test Cases

assert sorted(Solution().deleteDuplicateFolder([["a"],["c"],["d"],["a","b"],["c","b"],["d","a"]])) == sorted([["d"],["d","a"]])  # basic duplicate deletion

assert sorted(Solution().deleteDuplicateFolder([["a"],["c"],["a","b"],["c","b"],["a","b","x"],["a","b","x","y"],["w"],["w","y"]])) == sorted([["c"],["c","b"],["a"],["a","b"]])  # nested duplicate deletion

assert sorted(Solution().deleteDuplicateFolder([["a","b"],["c","d"],["c"],["a"]])) == sorted([["a"],["a","b"],["c"],["c","d"]])  # no duplicates

assert Solution().deleteDuplicateFolder([["a"]]) == [["a"]]  # single node

assert Solution().deleteDuplicateFolder([]) == []  # empty input edge case

assert sorted(Solution().deleteDuplicateFolder([["a"],["b"],["c"]])) == sorted([["a"],["b"],["c"]])  # all independent roots

Test	Why
duplicate sibling subtrees	verifies detection of identical structure
nested duplicates	verifies deep subtree hashing
no duplicates	ensures no false positives
single node	minimal valid tree
empty input	robustness
independent roots	multiple root-level trees

Edge Cases

One important edge case is when folders have identical names but different structures. The algorithm handles this correctly because identity is based on subtree serialization, not names alone.

Another edge case is deep nesting where only leaf nodes are identical. Since serialization is computed bottom-up, leaf nodes are compared first and propagate upward correctly, ensuring accurate detection.

A third edge case is when deletion of a subtree would make parent folders identical afterward. The problem explicitly states deletion happens only once based on initial marking, and this implementation respects that by separating marking from collection without recomputation.