LeetCode 1410 - HTML Entity Parser
Here is a complete, detailed technical guide for LeetCode 1410 following your formatting and style requirements. The pro
Difficulty: 🟡 Medium
Topics: Hash Table, String
Solution
Here is a complete, detailed technical guide for LeetCode 1410 following your formatting and style requirements.
Problem Understanding
The problem asks us to implement an HTML entity parser. We are given a string text which may contain HTML entities such as " for " or & for &. The goal is to replace all valid entities in the string with their corresponding special characters, leaving any invalid or unrecognized entities intact. For example, &ambassador; is not a valid HTML entity in this context, so it should remain as-is.
The input is a single string, and the output is a transformed string where recognized entities are replaced by the characters they represent. The constraints indicate that text can be as long as 100,000 characters and may include any ASCII character, so efficiency is important. Edge cases include entities at the start or end of the string, multiple consecutive entities, or partial/invalid entities that resemble valid ones but are not.
The key challenge is efficiently scanning and replacing entities without repeatedly searching the string, which could be costly for long inputs. Using a hash map (dictionary) for entity lookup allows for O(1) replacement checks. We also need to carefully handle overlapping or partially matched entities.
Approaches
A brute-force approach would iterate through the string, checking at each position if a substring matches any valid entity. This involves repeatedly comparing substrings and could be inefficient, especially for long strings, because each comparison might involve up to 7 characters (the longest entity ⁄ is 7 characters).
The optimal approach is to scan the string once from left to right, building a result string as we go. Whenever we encounter &, we attempt to match a valid entity starting at that position using a hash map. If a match is found, we append the corresponding character and advance the pointer by the length of the entity. Otherwise, we append the current character and move forward. This guarantees linear time complexity O(n), as each character is processed at most once.
| Approach | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Brute Force | O(n * m) | O(1) | Check each substring against all entities repeatedly; m is max entity length. |
| Optimal | O(n) | O(1) | Scan once, use hash map for entity replacement, build result string incrementally. |
Algorithm Walkthrough
-
Initialize a hash map
entity_mapcontaining all HTML entities as keys and their corresponding characters as values. -
Initialize an empty list
resultto efficiently build the output string. -
Initialize a pointer
i = 0to scan the input stringtext. -
While
i < len(text), check the current character: -
If
text[i] != '&', appendtext[i]toresultand incrementiby 1. -
If
text[i] == '&', attempt to match an entity: -
Iterate through all possible entities in
entity_map. -
For each entity, check if
text[i:i+len(entity)] == entity. -
If a match is found, append the mapped character to
result, incrementiby the entity length, and break the inner loop. -
If no entity matches, append
&toresultand incrementiby 1. -
After the loop, join
resultinto a string and return it.
Why it works: The algorithm maintains the invariant that result always contains the correctly parsed substring of text up to position i. By scanning left to right and using a hash map lookup for potential entities, we ensure each recognized entity is replaced exactly once, and unmatched characters remain unchanged.
Python Solution
class Solution:
def entityParser(self, text: str) -> str:
entity_map = {
""": "\"",
"'": "'",
"&": "&",
">": ">",
"<": "<",
"⁄": "/"
}
result = []
i = 0
n = len(text)
while i < n:
if text[i] != '&':
result.append(text[i])
i += 1
else:
matched = False
for entity, char in entity_map.items():
if text.startswith(entity, i):
result.append(char)
i += len(entity)
matched = True
break
if not matched:
result.append('&')
i += 1
return ''.join(result)
The Python solution uses a dictionary for O(1) entity lookups and a list to efficiently accumulate characters. The text.startswith(entity, i) function ensures we match entities correctly without slicing unnecessarily. If no entity is matched, the & is preserved, maintaining the correct output.
Go Solution
func entityParser(text string) string {
entityMap := map[string]string{
""": "\"",
"'": "'",
"&": "&",
">": ">",
"<": "<",
"⁄": "/",
}
result := []byte{}
i := 0
n := len(text)
for i < n {
if text[i] != '&' {
result = append(result, text[i])
i++
} else {
matched := false
for entity, char := range entityMap {
if len(text[i:]) >= len(entity) && text[i:i+len(entity)] == entity {
result = append(result, char[0])
i += len(entity)
matched = true
break
}
}
if !matched {
result = append(result, '&')
i++
}
}
}
return string(result)
}
In Go, we use a map for entity lookups and a byte slice for efficient string building. Since Go strings are immutable, a slice is preferable to repeatedly concatenating strings. The logic is otherwise identical to the Python solution.
Worked Examples
Example 1:
Input: "& is an HTML entity but &ambassador; is not."
Step-by-step:
| i | text[i] | action | result |
|---|---|---|---|
| 0 | & | matches "&" | & |
| 5 | ' ' | append | & |
| ... | ... | append rest | "& is an HTML entity but &ambassador; is not." |
Example 2:
Input: "and I quote: "...""
Step-by-step:
| i | text[i] | action | result |
|---|---|---|---|
| 0-12 | 'and I quote:' | append | "and I quote:" |
| 13 | & | matches """ | '"' |
| 18 | . | append | "and I quote: "" |
| 21 | & | matches """ | '"' |
Output: "and I quote: \"...\""
Complexity Analysis
| Measure | Complexity | Explanation |
|---|---|---|
| Time | O(n) | Each character is processed at most once; entity lookups are O(1) via hash map. |
| Space | O(n) | Result string/list accumulates n characters; hash map is constant size. |
The algorithm is efficient for the maximum input length of 10^5 characters because it avoids repeated substring scanning.
Test Cases
s = Solution()
# Provided examples
assert s.entityParser("& is an HTML entity but &ambassador; is not.") == "& is an HTML entity but &ambassador; is not." # normal case
assert s.entityParser("and I quote: "..."") == "and I quote: \"...\"" # quotes
# Edge cases
assert s.entityParser("") == "" # empty string
assert s.entityParser("No entities here!") == "No entities here!" # no entities
assert s.entityParser("<>&"'⁄") == "<>&\"'/" # all entities together
assert s.entityParser("&unknown; & ¬anentity;") == "&unknown; & ¬anentity;" # mix of valid and invalid
assert s.entityParser("&&&&&") == "&&&&&" # repeated '&' without valid entities
| Test | Why |
|---|---|
| "& is an HTML entity but &ambassador; is not." | Valid entity replacement while leaving invalid entities |
| "and I quote: "..."" | Correct parsing of quotes |
| "" | Handles empty string correctly |
| "No entities here!" | No replacement needed |
| "<>&"'⁄" | All entities replaced in sequence |
| "&unknown; & ¬anentity;" | Mix of valid and invalid entities handled |
| "&&&&&" | Consecutive '&' characters without valid entities |
Edge Cases
1. Entities at the start or end of the string: If an entity appears at the very beginning or end, a naive approach that checks surrounding characters may fail. Our scan handles this naturally because we only look ahead as far as the entity length.
2. Overlapping or partial matches: For example, in "&am", a naive slice