LeetCode 1410 - HTML Entity Parser

Here is a complete, detailed technical guide for LeetCode 1410 following your formatting and style requirements. The pro

LeetCode Problem 1410

Difficulty: 🟡 Medium
Topics: Hash Table, String

Solution

Here is a complete, detailed technical guide for LeetCode 1410 following your formatting and style requirements.

Problem Understanding

The problem asks us to implement an HTML entity parser. We are given a string text which may contain HTML entities such as " for " or & for &. The goal is to replace all valid entities in the string with their corresponding special characters, leaving any invalid or unrecognized entities intact. For example, &ambassador; is not a valid HTML entity in this context, so it should remain as-is.

The input is a single string, and the output is a transformed string where recognized entities are replaced by the characters they represent. The constraints indicate that text can be as long as 100,000 characters and may include any ASCII character, so efficiency is important. Edge cases include entities at the start or end of the string, multiple consecutive entities, or partial/invalid entities that resemble valid ones but are not.

The key challenge is efficiently scanning and replacing entities without repeatedly searching the string, which could be costly for long inputs. Using a hash map (dictionary) for entity lookup allows for O(1) replacement checks. We also need to carefully handle overlapping or partially matched entities.

Approaches

A brute-force approach would iterate through the string, checking at each position if a substring matches any valid entity. This involves repeatedly comparing substrings and could be inefficient, especially for long strings, because each comparison might involve up to 7 characters (the longest entity ⁄ is 7 characters).

The optimal approach is to scan the string once from left to right, building a result string as we go. Whenever we encounter &, we attempt to match a valid entity starting at that position using a hash map. If a match is found, we append the corresponding character and advance the pointer by the length of the entity. Otherwise, we append the current character and move forward. This guarantees linear time complexity O(n), as each character is processed at most once.

Approach Time Complexity Space Complexity Notes
Brute Force O(n * m) O(1) Check each substring against all entities repeatedly; m is max entity length.
Optimal O(n) O(1) Scan once, use hash map for entity replacement, build result string incrementally.

Algorithm Walkthrough

  1. Initialize a hash map entity_map containing all HTML entities as keys and their corresponding characters as values.

  2. Initialize an empty list result to efficiently build the output string.

  3. Initialize a pointer i = 0 to scan the input string text.

  4. While i < len(text), check the current character:

  5. If text[i] != '&', append text[i] to result and increment i by 1.

  6. If text[i] == '&', attempt to match an entity:

  7. Iterate through all possible entities in entity_map.

  8. For each entity, check if text[i:i+len(entity)] == entity.

  9. If a match is found, append the mapped character to result, increment i by the entity length, and break the inner loop.

  10. If no entity matches, append & to result and increment i by 1.

  11. After the loop, join result into a string and return it.

Why it works: The algorithm maintains the invariant that result always contains the correctly parsed substring of text up to position i. By scanning left to right and using a hash map lookup for potential entities, we ensure each recognized entity is replaced exactly once, and unmatched characters remain unchanged.

Python Solution

class Solution:
    def entityParser(self, text: str) -> str:
        entity_map = {
            "&quot;": "\"",
            "&apos;": "'",
            "&amp;": "&",
            "&gt;": ">",
            "&lt;": "<",
            "&frasl;": "/"
        }
        
        result = []
        i = 0
        n = len(text)
        
        while i < n:
            if text[i] != '&':
                result.append(text[i])
                i += 1
            else:
                matched = False
                for entity, char in entity_map.items():
                    if text.startswith(entity, i):
                        result.append(char)
                        i += len(entity)
                        matched = True
                        break
                if not matched:
                    result.append('&')
                    i += 1
        
        return ''.join(result)

The Python solution uses a dictionary for O(1) entity lookups and a list to efficiently accumulate characters. The text.startswith(entity, i) function ensures we match entities correctly without slicing unnecessarily. If no entity is matched, the & is preserved, maintaining the correct output.

Go Solution

func entityParser(text string) string {
    entityMap := map[string]string{
        "&quot;": "\"",
        "&apos;": "'",
        "&amp;":  "&",
        "&gt;":   ">",
        "&lt;":   "<",
        "&frasl;": "/",
    }
    
    result := []byte{}
    i := 0
    n := len(text)
    
    for i < n {
        if text[i] != '&' {
            result = append(result, text[i])
            i++
        } else {
            matched := false
            for entity, char := range entityMap {
                if len(text[i:]) >= len(entity) && text[i:i+len(entity)] == entity {
                    result = append(result, char[0])
                    i += len(entity)
                    matched = true
                    break
                }
            }
            if !matched {
                result = append(result, '&')
                i++
            }
        }
    }
    
    return string(result)
}

In Go, we use a map for entity lookups and a byte slice for efficient string building. Since Go strings are immutable, a slice is preferable to repeatedly concatenating strings. The logic is otherwise identical to the Python solution.

Worked Examples

Example 1:

Input: "&amp; is an HTML entity but &ambassador; is not."

Step-by-step:

i text[i] action result
0 & matches "&" &
5 ' ' append &
... ... append rest "& is an HTML entity but &ambassador; is not."

Example 2:

Input: "and I quote: &quot;...&quot;"

Step-by-step:

i text[i] action result
0-12 'and I quote:' append "and I quote:"
13 & matches """ '"'
18 . append "and I quote: ""
21 & matches """ '"'

Output: "and I quote: \"...\""

Complexity Analysis

Measure Complexity Explanation
Time O(n) Each character is processed at most once; entity lookups are O(1) via hash map.
Space O(n) Result string/list accumulates n characters; hash map is constant size.

The algorithm is efficient for the maximum input length of 10^5 characters because it avoids repeated substring scanning.

Test Cases

s = Solution()

# Provided examples
assert s.entityParser("&amp; is an HTML entity but &ambassador; is not.") == "& is an HTML entity but &ambassador; is not."  # normal case
assert s.entityParser("and I quote: &quot;...&quot;") == "and I quote: \"...\""  # quotes

# Edge cases
assert s.entityParser("") == ""  # empty string
assert s.entityParser("No entities here!") == "No entities here!"  # no entities
assert s.entityParser("&lt;&gt;&amp;&quot;&apos;&frasl;") == "<>&\"'/"  # all entities together
assert s.entityParser("&unknown; &amp; &notanentity;") == "&unknown; & &notanentity;"  # mix of valid and invalid
assert s.entityParser("&&&&&") == "&&&&&"  # repeated '&' without valid entities
Test Why
"& is an HTML entity but &ambassador; is not." Valid entity replacement while leaving invalid entities
"and I quote: "..."" Correct parsing of quotes
"" Handles empty string correctly
"No entities here!" No replacement needed
"<>&"'⁄" All entities replaced in sequence
"&unknown; & &notanentity;" Mix of valid and invalid entities handled
"&&&&&" Consecutive '&' characters without valid entities

Edge Cases

1. Entities at the start or end of the string: If an entity appears at the very beginning or end, a naive approach that checks surrounding characters may fail. Our scan handles this naturally because we only look ahead as far as the entity length.

2. Overlapping or partial matches: For example, in "&am", a naive slice