LeetCode 1236 - Web Crawler
The problem asks us to implement a simplified web crawler. We are given a starting URL and access to an HtmlParser interface that can retrieve all URLs linked from a given webpage.
Difficulty: 🟡 Medium
Topics: String, Depth-First Search, Breadth-First Search, Interactive
Solution
Problem Understanding
The problem asks us to implement a simplified web crawler. We are given a starting URL and access to an HtmlParser interface that can retrieve all URLs linked from a given webpage. Our task is to traverse the web graph beginning from startUrl, but with two important restrictions:
- We may only visit URLs that belong to the same hostname as
startUrl. - We must never crawl the same URL more than once.
The final result should contain every reachable URL under the same hostname, returned in any order.
A URL consists of several parts, but the only part relevant to this problem is the hostname. For example:
http://news.yahoo.com/news/topics/
has hostname:
news.yahoo.com
Two URLs are considered part of the same website only if their hostnames match exactly.
The HtmlParser.getUrls(url) method gives all outgoing links from a page. Conceptually, the URLs form a directed graph:
- Each URL is a node
- Each hyperlink is a directed edge
The crawler begins at startUrl and explores reachable nodes while filtering out URLs with different hostnames.
The constraints are relatively small:
- At most 1000 URLs exist
- URL length is at most 300 characters
These limits strongly suggest that graph traversal algorithms such as Depth-First Search (DFS) or Breadth-First Search (BFS) are appropriate. Since the graph may contain cycles, we must track visited URLs to avoid infinite loops.
An important detail is that URLs with and without a trailing slash are considered different:
http://example.com
http://example.com/
These must be treated as separate nodes.
Several edge cases are important:
- The starting URL may not link to any same-host pages
- The graph may contain cycles
- External URLs may appear frequently and must be ignored
- Multiple pages may link to the same URL
- The crawler may revisit a URL through different paths unless we properly track visited nodes
The problem guarantees that URLs are unique in the library, which simplifies duplicate handling.
Approaches
Brute Force Approach
A naive strategy would be to repeatedly crawl every discovered URL without tracking whether it has already been visited. Whenever we encounter a new page, we call getUrls() on it and continue recursively.
This approach is conceptually simple because it blindly explores every outgoing edge. However, web graphs frequently contain cycles:
A -> B
B -> C
C -> A
Without a visited set, the crawler would loop forever.
Even if we artificially limited recursion depth, the same URLs would be processed repeatedly, causing massive redundancy. In the worst case, the same page could be crawled exponentially many times through different paths.
Therefore, the brute-force approach is both inefficient and potentially non-terminating.
Optimal Approach
The key insight is that the URLs form a graph traversal problem. Since we only need to visit each valid URL once, we should use either DFS or BFS together with a hash set of visited URLs.
The algorithm works as follows:
- Extract the hostname from
startUrl - Begin traversal from
startUrl - For each discovered URL:
-
Skip it if already visited
-
Skip it if the hostname differs
-
Otherwise:
-
Mark it visited
-
Continue exploring its outgoing links
Using a hash set guarantees each URL is processed at most once.
Both DFS and BFS work equally well here. DFS is slightly simpler to implement recursively, while BFS avoids recursion depth concerns. Since the constraints are small, either is acceptable.
Approach Comparison
| Approach | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Brute Force | Exponential in worst case | Unbounded | Repeatedly revisits nodes, may loop forever |
| Optimal DFS/BFS | O(N + E) | O(N) | Visits each URL and edge at most once |
Here:
N= number of reachable URLsE= number of links between them
Algorithm Walkthrough
We will use Depth-First Search with a visited set.
Step 1: Extract the hostname
The hostname is the portion after "http://" and before the next "/".
For example:
http://news.yahoo.com/news
becomes:
news.yahoo.com
We store the hostname of startUrl because every crawled URL must match it.
Step 2: Initialize the visited set
We create a hash set named visited.
This serves two purposes:
- Prevents infinite loops in cyclic graphs
- Prevents redundant crawling of the same page
Hash sets provide average O(1) lookup and insertion.
Step 3: Start DFS traversal
We define a recursive DFS function that accepts a URL.
When DFS visits a URL:
- Add it to
visited - Retrieve all outgoing links using
htmlParser.getUrls(url)
Step 4: Filter URLs
For each discovered URL:
- Skip it if already visited
- Skip it if the hostname differs from the starting hostname
This ensures we only crawl valid pages exactly once.
Step 5: Continue recursion
If a URL passes both checks:
- Recursively call DFS on that URL
This continues until every reachable same-host URL has been explored.
Step 6: Return results
Once traversal finishes, all valid URLs are stored in visited.
Return them as a list.
Why it works
The algorithm maintains the invariant that every URL in visited:
- Has the same hostname as
startUrl - Has already been fully explored
Because each URL is visited only once, the traversal terminates even if cycles exist.
DFS systematically explores every reachable valid URL through outgoing links, so no eligible page can be missed.
Python Solution
from typing import List, Set
# """
# This is HtmlParser's API interface.
# You should not implement it, or speculate about its implementation
# """
# class HtmlParser(object):
# def getUrls(self, url):
# """
# :type url: str
# :rtype List[str]
# """
class Solution:
def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
def get_hostname(url: str) -> str:
return url.split("/")[2]
target_hostname = get_hostname(startUrl)
visited: Set[str] = set()
def dfs(url: str) -> None:
visited.add(url)
for next_url in htmlParser.getUrls(url):
if next_url in visited:
continue
if get_hostname(next_url) != target_hostname:
continue
dfs(next_url)
dfs(startUrl)
return list(visited)
The implementation begins with a helper function get_hostname() that extracts the hostname portion of a URL. Since all URLs use the http protocol and no ports are present, splitting by "/" is sufficient.
We then store the hostname of startUrl as target_hostname. Every crawled URL must match this value.
The visited set tracks all URLs that have already been processed. This prevents duplicate crawling and avoids infinite recursion in cyclic graphs.
The recursive dfs() function performs the graph traversal. When a URL is visited, it is immediately added to the set. Then we retrieve all neighboring URLs using htmlParser.getUrls(url).
For each neighbor:
- If already visited, we skip it
- If its hostname differs, we skip it
- Otherwise, we recursively explore it
Finally, after DFS completes, the visited URLs are returned as a list.
Go Solution
/**
* // This is HtmlParser's API interface.
* // You should not implement it, or speculate about its implementation
* type HtmlParser struct {
* func GetUrls(url string) []string {}
* }
*/
func crawl(startUrl string, htmlParser HtmlParser) []string {
getHostname := func(url string) string {
parts := strings.Split(url, "/")
return parts[2]
}
targetHostname := getHostname(startUrl)
visited := make(map[string]bool)
var dfs func(string)
dfs = func(url string) {
visited[url] = true
for _, nextURL := range htmlParser.GetUrls(url) {
if visited[nextURL] {
continue
}
if getHostname(nextURL) != targetHostname {
continue
}
dfs(nextURL)
}
}
dfs(startUrl)
result := make([]string, 0, len(visited))
for url := range visited {
result = append(result, url)
}
return result
}
The Go implementation follows the same logic as the Python version.
The main difference is that Go uses a map[string]bool instead of a hash set. A key exists in the map if the URL has been visited.
The DFS function is implemented as a recursive closure assigned to a variable, which allows self-recursion.
Go slices are dynamically sized, so we build the final result slice by iterating through the map keys.
One important detail is that the solution requires:
import "strings"
because hostname extraction uses strings.Split.
Worked Examples
Example 1
Input:
startUrl = "http://news.yahoo.com/news/topics/"
Target hostname:
news.yahoo.com
Traversal Process
| Step | Current URL | Discovered URLs | Visited Set |
|---|---|---|---|
| 1 | topics/ | homepage, news | {topics/} |
| 2 | homepage | us | {topics/, homepage} |
| 3 | us | none | {topics/, homepage, us} |
| 4 | news | none | {topics/, homepage, us, news} |
Final result:
[
"http://news.yahoo.com",
"http://news.yahoo.com/news",
"http://news.yahoo.com/news/topics/",
"http://news.yahoo.com/us"
]
The Google URL is ignored because its hostname differs.
Example 2
Input:
startUrl = "http://news.google.com"
Target hostname:
news.google.com
Traversal Process
| Step | Current URL | Discovered URLs | Action |
|---|---|---|---|
| 1 | google.com | yahoo URLs | Ignore all |
| 2 | done | none | traversal ends |
Final result:
["http://news.google.com"]
All neighboring URLs are filtered out because they belong to another hostname.
Complexity Analysis
| Measure | Complexity | Explanation |
|---|---|---|
| Time | O(N + E) | Each URL and edge is processed at most once |
| Space | O(N) | Visited set and recursion stack store reachable URLs |
The DFS traversal visits each valid URL only once because of the visited set. Every outgoing edge is examined exactly once during traversal, giving total time complexity proportional to the number of nodes and edges in the reachable subgraph.
The space complexity comes primarily from the visited set. In the worst case, all reachable URLs belong to the same hostname and must be stored. The recursion stack can also grow to O(N) in a deeply connected chain.
Test Cases
from collections import defaultdict
from typing import List
class HtmlParser:
def __init__(self, graph):
self.graph = graph
def getUrls(self, url):
return self.graph[url]
class Solution:
def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
def get_hostname(url: str) -> str:
return url.split("/")[2]
target_hostname = get_hostname(startUrl)
visited = set()
def dfs(url):
visited.add(url)
for next_url in htmlParser.getUrls(url):
if next_url in visited:
continue
if get_hostname(next_url) != target_hostname:
continue
dfs(next_url)
dfs(startUrl)
return list(visited)
solution = Solution()
# Example 1
graph1 = {
"http://news.yahoo.com/news/topics/": [
"http://news.yahoo.com",
"http://news.yahoo.com/news",
],
"http://news.yahoo.com": [
"http://news.yahoo.com/us"
],
"http://news.yahoo.com/news": [],
"http://news.yahoo.com/us": [],
}
result1 = set(solution.crawl(
"http://news.yahoo.com/news/topics/",
HtmlParser(graph1)
))
assert result1 == {
"http://news.yahoo.com",
"http://news.yahoo.com/news",
"http://news.yahoo.com/news/topics/",
"http://news.yahoo.com/us"
}, "Example 1 test"
# Example 2
graph2 = {
"http://news.google.com": [
"http://news.yahoo.com",
"http://news.yahoo.com/news"
]
}
result2 = set(solution.crawl(
"http://news.google.com",
HtmlParser(graph2)
))
assert result2 == {
"http://news.google.com"
}, "Example 2 test"
# Single isolated page
graph3 = {
"http://example.com": []
}
result3 = set(solution.crawl(
"http://example.com",
HtmlParser(graph3)
))
assert result3 == {
"http://example.com"
}, "Single node graph"
# Cycle in graph
graph4 = {
"http://a.com": ["http://a.com/page1"],
"http://a.com/page1": ["http://a.com/page2"],
"http://a.com/page2": ["http://a.com"]
}
result4 = set(solution.crawl(
"http://a.com",
HtmlParser(graph4)
))
assert result4 == {
"http://a.com",
"http://a.com/page1",
"http://a.com/page2"
}, "Cycle handling"
# External links should be ignored
graph5 = {
"http://site.com": [
"http://site.com/page",
"http://other.com/page"
],
"http://site.com/page": [],
}
result5 = set(solution.crawl(
"http://site.com",
HtmlParser(graph5)
))
assert result5 == {
"http://site.com",
"http://site.com/page"
}, "Ignore external hostname"
# URLs with trailing slash are different
graph6 = {
"http://x.com": ["http://x.com/"],
"http://x.com/": []
}
result6 = set(solution.crawl(
"http://x.com",
HtmlParser(graph6)
))
assert result6 == {
"http://x.com",
"http://x.com/"
}, "Trailing slash distinction"
| Test | Why |
|---|---|
| Example 1 | Validates normal traversal across same hostname |
| Example 2 | Validates hostname filtering |
| Single isolated page | Ensures minimal input works |
| Cycle in graph | Confirms visited set prevents infinite loops |
| External links ignored | Verifies hostname filtering logic |
| Trailing slash distinction | Ensures URLs are treated as unique strings |
Edge Cases
Cyclic Links
A major source of bugs in graph traversal problems is cyclic references between pages. For example:
A -> B -> C -> A
Without tracking visited URLs, recursion would never terminate.
This implementation handles cycles safely by adding each URL to the visited set before exploring neighbors. Once a URL has been seen, it is skipped permanently.
External Hostnames
A page may link to many external websites. A naive crawler might continue traversing the entire internet graph.
For example:
http://news.yahoo.com
-> http://google.com
The solution prevents this by extracting hostnames and comparing them against the original hostname before recursion occurs.
Multiple Incoming Links
Different pages may link to the same destination:
A -> C
B -> C
Without a visited set, page C would be crawled multiple times.
The implementation guarantees each URL is processed once because membership in visited is checked before recursion.
Trailing Slash Differences
The problem explicitly states that:
http://example.com
http://example.com/
are different URLs.
Some implementations incorrectly normalize them into the same string. This solution does not modify URLs at all, it compares exact strings, so both are handled correctly as distinct pages.