LeetCode 1236 - Web Crawler

The problem asks us to implement a simplified web crawler. We are given a starting URL and access to an HtmlParser interface that can retrieve all URLs linked from a given webpage.

LeetCode Problem 1236

Difficulty: 🟡 Medium
Topics: String, Depth-First Search, Breadth-First Search, Interactive

Solution

Problem Understanding

The problem asks us to implement a simplified web crawler. We are given a starting URL and access to an HtmlParser interface that can retrieve all URLs linked from a given webpage. Our task is to traverse the web graph beginning from startUrl, but with two important restrictions:

  1. We may only visit URLs that belong to the same hostname as startUrl.
  2. We must never crawl the same URL more than once.

The final result should contain every reachable URL under the same hostname, returned in any order.

A URL consists of several parts, but the only part relevant to this problem is the hostname. For example:

http://news.yahoo.com/news/topics/

has hostname:

news.yahoo.com

Two URLs are considered part of the same website only if their hostnames match exactly.

The HtmlParser.getUrls(url) method gives all outgoing links from a page. Conceptually, the URLs form a directed graph:

  • Each URL is a node
  • Each hyperlink is a directed edge

The crawler begins at startUrl and explores reachable nodes while filtering out URLs with different hostnames.

The constraints are relatively small:

  • At most 1000 URLs exist
  • URL length is at most 300 characters

These limits strongly suggest that graph traversal algorithms such as Depth-First Search (DFS) or Breadth-First Search (BFS) are appropriate. Since the graph may contain cycles, we must track visited URLs to avoid infinite loops.

An important detail is that URLs with and without a trailing slash are considered different:

http://example.com
http://example.com/

These must be treated as separate nodes.

Several edge cases are important:

  • The starting URL may not link to any same-host pages
  • The graph may contain cycles
  • External URLs may appear frequently and must be ignored
  • Multiple pages may link to the same URL
  • The crawler may revisit a URL through different paths unless we properly track visited nodes

The problem guarantees that URLs are unique in the library, which simplifies duplicate handling.

Approaches

Brute Force Approach

A naive strategy would be to repeatedly crawl every discovered URL without tracking whether it has already been visited. Whenever we encounter a new page, we call getUrls() on it and continue recursively.

This approach is conceptually simple because it blindly explores every outgoing edge. However, web graphs frequently contain cycles:

A -> B
B -> C
C -> A

Without a visited set, the crawler would loop forever.

Even if we artificially limited recursion depth, the same URLs would be processed repeatedly, causing massive redundancy. In the worst case, the same page could be crawled exponentially many times through different paths.

Therefore, the brute-force approach is both inefficient and potentially non-terminating.

Optimal Approach

The key insight is that the URLs form a graph traversal problem. Since we only need to visit each valid URL once, we should use either DFS or BFS together with a hash set of visited URLs.

The algorithm works as follows:

  1. Extract the hostname from startUrl
  2. Begin traversal from startUrl
  3. For each discovered URL:
  • Skip it if already visited

  • Skip it if the hostname differs

  • Otherwise:

  • Mark it visited

  • Continue exploring its outgoing links

Using a hash set guarantees each URL is processed at most once.

Both DFS and BFS work equally well here. DFS is slightly simpler to implement recursively, while BFS avoids recursion depth concerns. Since the constraints are small, either is acceptable.

Approach Comparison

Approach Time Complexity Space Complexity Notes
Brute Force Exponential in worst case Unbounded Repeatedly revisits nodes, may loop forever
Optimal DFS/BFS O(N + E) O(N) Visits each URL and edge at most once

Here:

  • N = number of reachable URLs
  • E = number of links between them

Algorithm Walkthrough

We will use Depth-First Search with a visited set.

Step 1: Extract the hostname

The hostname is the portion after "http://" and before the next "/".

For example:

http://news.yahoo.com/news

becomes:

news.yahoo.com

We store the hostname of startUrl because every crawled URL must match it.

Step 2: Initialize the visited set

We create a hash set named visited.

This serves two purposes:

  • Prevents infinite loops in cyclic graphs
  • Prevents redundant crawling of the same page

Hash sets provide average O(1) lookup and insertion.

Step 3: Start DFS traversal

We define a recursive DFS function that accepts a URL.

When DFS visits a URL:

  • Add it to visited
  • Retrieve all outgoing links using htmlParser.getUrls(url)

Step 4: Filter URLs

For each discovered URL:

  • Skip it if already visited
  • Skip it if the hostname differs from the starting hostname

This ensures we only crawl valid pages exactly once.

Step 5: Continue recursion

If a URL passes both checks:

  • Recursively call DFS on that URL

This continues until every reachable same-host URL has been explored.

Step 6: Return results

Once traversal finishes, all valid URLs are stored in visited.

Return them as a list.

Why it works

The algorithm maintains the invariant that every URL in visited:

  1. Has the same hostname as startUrl
  2. Has already been fully explored

Because each URL is visited only once, the traversal terminates even if cycles exist.

DFS systematically explores every reachable valid URL through outgoing links, so no eligible page can be missed.

Python Solution

from typing import List, Set

# """
# This is HtmlParser's API interface.
# You should not implement it, or speculate about its implementation
# """
# class HtmlParser(object):
#     def getUrls(self, url):
#         """
#         :type url: str
#         :rtype List[str]
#         """

class Solution:
    def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
        def get_hostname(url: str) -> str:
            return url.split("/")[2]

        target_hostname = get_hostname(startUrl)
        visited: Set[str] = set()

        def dfs(url: str) -> None:
            visited.add(url)

            for next_url in htmlParser.getUrls(url):
                if next_url in visited:
                    continue

                if get_hostname(next_url) != target_hostname:
                    continue

                dfs(next_url)

        dfs(startUrl)

        return list(visited)

The implementation begins with a helper function get_hostname() that extracts the hostname portion of a URL. Since all URLs use the http protocol and no ports are present, splitting by "/" is sufficient.

We then store the hostname of startUrl as target_hostname. Every crawled URL must match this value.

The visited set tracks all URLs that have already been processed. This prevents duplicate crawling and avoids infinite recursion in cyclic graphs.

The recursive dfs() function performs the graph traversal. When a URL is visited, it is immediately added to the set. Then we retrieve all neighboring URLs using htmlParser.getUrls(url).

For each neighbor:

  • If already visited, we skip it
  • If its hostname differs, we skip it
  • Otherwise, we recursively explore it

Finally, after DFS completes, the visited URLs are returned as a list.

Go Solution

/**
 * // This is HtmlParser's API interface.
 * // You should not implement it, or speculate about its implementation
 * type HtmlParser struct {
 *     func GetUrls(url string) []string {}
 * }
 */

func crawl(startUrl string, htmlParser HtmlParser) []string {
	getHostname := func(url string) string {
		parts := strings.Split(url, "/")
		return parts[2]
	}

	targetHostname := getHostname(startUrl)

	visited := make(map[string]bool)

	var dfs func(string)

	dfs = func(url string) {
		visited[url] = true

		for _, nextURL := range htmlParser.GetUrls(url) {
			if visited[nextURL] {
				continue
			}

			if getHostname(nextURL) != targetHostname {
				continue
			}

			dfs(nextURL)
		}
	}

	dfs(startUrl)

	result := make([]string, 0, len(visited))

	for url := range visited {
		result = append(result, url)
	}

	return result
}

The Go implementation follows the same logic as the Python version.

The main difference is that Go uses a map[string]bool instead of a hash set. A key exists in the map if the URL has been visited.

The DFS function is implemented as a recursive closure assigned to a variable, which allows self-recursion.

Go slices are dynamically sized, so we build the final result slice by iterating through the map keys.

One important detail is that the solution requires:

import "strings"

because hostname extraction uses strings.Split.

Worked Examples

Example 1

Input:

startUrl = "http://news.yahoo.com/news/topics/"

Target hostname:

news.yahoo.com

Traversal Process

Step Current URL Discovered URLs Visited Set
1 topics/ homepage, news {topics/}
2 homepage us {topics/, homepage}
3 us none {topics/, homepage, us}
4 news none {topics/, homepage, us, news}

Final result:

[
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.yahoo.com/us"
]

The Google URL is ignored because its hostname differs.

Example 2

Input:

startUrl = "http://news.google.com"

Target hostname:

news.google.com

Traversal Process

Step Current URL Discovered URLs Action
1 google.com yahoo URLs Ignore all
2 done none traversal ends

Final result:

["http://news.google.com"]

All neighboring URLs are filtered out because they belong to another hostname.

Complexity Analysis

Measure Complexity Explanation
Time O(N + E) Each URL and edge is processed at most once
Space O(N) Visited set and recursion stack store reachable URLs

The DFS traversal visits each valid URL only once because of the visited set. Every outgoing edge is examined exactly once during traversal, giving total time complexity proportional to the number of nodes and edges in the reachable subgraph.

The space complexity comes primarily from the visited set. In the worst case, all reachable URLs belong to the same hostname and must be stored. The recursion stack can also grow to O(N) in a deeply connected chain.

Test Cases

from collections import defaultdict
from typing import List

class HtmlParser:
    def __init__(self, graph):
        self.graph = graph

    def getUrls(self, url):
        return self.graph[url]

class Solution:
    def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
        def get_hostname(url: str) -> str:
            return url.split("/")[2]

        target_hostname = get_hostname(startUrl)
        visited = set()

        def dfs(url):
            visited.add(url)

            for next_url in htmlParser.getUrls(url):
                if next_url in visited:
                    continue

                if get_hostname(next_url) != target_hostname:
                    continue

                dfs(next_url)

        dfs(startUrl)

        return list(visited)

solution = Solution()

# Example 1
graph1 = {
    "http://news.yahoo.com/news/topics/": [
        "http://news.yahoo.com",
        "http://news.yahoo.com/news",
    ],
    "http://news.yahoo.com": [
        "http://news.yahoo.com/us"
    ],
    "http://news.yahoo.com/news": [],
    "http://news.yahoo.com/us": [],
}

result1 = set(solution.crawl(
    "http://news.yahoo.com/news/topics/",
    HtmlParser(graph1)
))

assert result1 == {
    "http://news.yahoo.com",
    "http://news.yahoo.com/news",
    "http://news.yahoo.com/news/topics/",
    "http://news.yahoo.com/us"
}, "Example 1 test"

# Example 2
graph2 = {
    "http://news.google.com": [
        "http://news.yahoo.com",
        "http://news.yahoo.com/news"
    ]
}

result2 = set(solution.crawl(
    "http://news.google.com",
    HtmlParser(graph2)
))

assert result2 == {
    "http://news.google.com"
}, "Example 2 test"

# Single isolated page
graph3 = {
    "http://example.com": []
}

result3 = set(solution.crawl(
    "http://example.com",
    HtmlParser(graph3)
))

assert result3 == {
    "http://example.com"
}, "Single node graph"

# Cycle in graph
graph4 = {
    "http://a.com": ["http://a.com/page1"],
    "http://a.com/page1": ["http://a.com/page2"],
    "http://a.com/page2": ["http://a.com"]
}

result4 = set(solution.crawl(
    "http://a.com",
    HtmlParser(graph4)
))

assert result4 == {
    "http://a.com",
    "http://a.com/page1",
    "http://a.com/page2"
}, "Cycle handling"

# External links should be ignored
graph5 = {
    "http://site.com": [
        "http://site.com/page",
        "http://other.com/page"
    ],
    "http://site.com/page": [],
}

result5 = set(solution.crawl(
    "http://site.com",
    HtmlParser(graph5)
))

assert result5 == {
    "http://site.com",
    "http://site.com/page"
}, "Ignore external hostname"

# URLs with trailing slash are different
graph6 = {
    "http://x.com": ["http://x.com/"],
    "http://x.com/": []
}

result6 = set(solution.crawl(
    "http://x.com",
    HtmlParser(graph6)
))

assert result6 == {
    "http://x.com",
    "http://x.com/"
}, "Trailing slash distinction"
Test Why
Example 1 Validates normal traversal across same hostname
Example 2 Validates hostname filtering
Single isolated page Ensures minimal input works
Cycle in graph Confirms visited set prevents infinite loops
External links ignored Verifies hostname filtering logic
Trailing slash distinction Ensures URLs are treated as unique strings

Edge Cases

A major source of bugs in graph traversal problems is cyclic references between pages. For example:

A -> B -> C -> A

Without tracking visited URLs, recursion would never terminate.

This implementation handles cycles safely by adding each URL to the visited set before exploring neighbors. Once a URL has been seen, it is skipped permanently.

External Hostnames

A page may link to many external websites. A naive crawler might continue traversing the entire internet graph.

For example:

http://news.yahoo.com
    -> http://google.com

The solution prevents this by extracting hostnames and comparing them against the original hostname before recursion occurs.

Different pages may link to the same destination:

A -> C
B -> C

Without a visited set, page C would be crawled multiple times.

The implementation guarantees each URL is processed once because membership in visited is checked before recursion.

Trailing Slash Differences

The problem explicitly states that:

http://example.com
http://example.com/

are different URLs.

Some implementations incorrectly normalize them into the same string. This solution does not modify URLs at all, it compares exact strings, so both are handled correctly as distinct pages.