LeetCode 1242 - Web Crawler Multithreaded
The problem asks us to implement a multi-threaded web crawler that starts from a given URL startUrl and visits all URLs reachable from it, but only those that share the same hostname.
Difficulty: 🟡 Medium
Topics: Depth-First Search, Breadth-First Search, Concurrency
Solution
Problem Understanding
The problem asks us to implement a multi-threaded web crawler that starts from a given URL startUrl and visits all URLs reachable from it, but only those that share the same hostname. We are given an interface HtmlParser that can return all URLs on a page through its getUrls(url) method. The goal is to return all URLs discovered, without visiting any URL more than once.
In essence, the problem is asking us to perform a graph traversal where nodes are URLs and edges exist if a page links to another. However, unlike a standard traversal, we are constrained to nodes (URLs) under the same hostname as the starting URL. The challenge emphasizes multi-threading because getUrls is a blocking operation and sequential traversal may exceed time limits.
The input is a single string startUrl, and the output is a list of URLs (order does not matter). Constraints assure us that there are at most 1000 URLs, each URL is at most 300 characters, and there are no duplicates. Hostname rules are standard and simplify extracting the hostname from a URL.
Important edge cases include URLs that link to pages outside the hostname, pages with no outgoing links, and cycles in the link graph.
Approaches
Brute Force Approach
A naive solution would perform a single-threaded DFS or BFS, keeping a visited set to avoid duplicates. Starting from startUrl, we recursively fetch URLs using getUrls, filter by hostname, and continue. This works correctly but will be too slow because getUrls is blocking and each URL request takes time. With a single thread, the worst-case scenario of visiting 1000 URLs sequentially can exceed time limits if we assume each request is 15ms.
Optimal Approach
The key observation is that getUrls is I/O-bound. Multi-threading allows us to initiate multiple URL requests in parallel, significantly reducing total crawl time. By using a thread pool or concurrent task execution, we can launch multiple crawling tasks simultaneously. Each thread processes a URL, fetches all linked URLs, filters by hostname, and submits any unvisited URLs to the pool.
This approach guarantees correctness as long as we properly synchronize access to the visited set, ensuring no URL is crawled twice.
| Approach | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Brute Force | O(n + e) sequential | O(n) | DFS/BFS single-threaded, slow due to blocking getUrls |
| Optimal | O(n + e) parallel | O(n) | Multi-threaded BFS/DFS using concurrent visited set, much faster wall-clock time |
Algorithm Walkthrough
- Extract Hostname: Parse
startUrlto extract the hostname. This will be used to filter URLs. - Initialize Concurrent Structures: Create a thread-safe
visitedset and a queue (or task pool) to hold URLs to crawl. - Start Crawling Tasks: Add
startUrlto the queue and mark it as visited. - Concurrent Processing: Each worker thread takes a URL from the queue, calls
htmlParser.getUrls(url), and filters the results to only those URLs with the same hostname that have not been visited. - Submit New URLs: Add any new, unvisited URLs to the queue for further processing. Mark them as visited immediately to avoid duplicates.
- Wait for Completion: Continue until the queue is empty and all worker threads finish.
- Return Results: Convert the visited set to a list and return.
Why it works: Each URL under the same hostname is eventually added to the queue and visited exactly once. Multi-threading ensures that the blocking calls overlap, reducing wall-clock time while maintaining correctness via synchronization on the visited set.
Python Solution
from typing import List
from urllib.parse import urlparse
from concurrent.futures import ThreadPoolExecutor
import threading
class Solution:
def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
hostname = urlparse(startUrl).hostname
visited = set()
visited_lock = threading.Lock()
def crawl_url(url: str):
urls = htmlParser.getUrls(url)
for u in urls:
if urlparse(u).hostname == hostname:
with visited_lock:
if u not in visited:
visited.add(u)
executor.submit(crawl_url, u)
with ThreadPoolExecutor(max_workers=8) as executor:
visited.add(startUrl)
executor.submit(crawl_url, startUrl)
executor.shutdown(wait=True)
return list(visited)
Explanation: We first extract the hostname from startUrl. The visited set is protected by a lock to ensure thread-safety. The crawl_url function fetches URLs, filters by hostname, and submits new unvisited URLs back to the thread pool. We start by submitting startUrl, wait for all threads to finish, and then return all visited URLs.
Go Solution
package main
import (
"net/url"
"sync"
)
func crawl(startUrl string, htmlParser *HtmlParser) []string {
parsedUrl, _ := url.Parse(startUrl)
hostname := parsedUrl.Hostname()
visited := make(map[string]struct{})
var mu sync.Mutex
var wg sync.WaitGroup
var crawlURL func(string)
crawlURL = func(current string) {
defer wg.Done()
urls := htmlParser.getUrls(current)
for _, u := range urls {
parsed, _ := url.Parse(u)
if parsed.Hostname() == hostname {
mu.Lock()
if _, ok := visited[u]; !ok {
visited[u] = struct{}{}
wg.Add(1)
go crawlURL(u)
}
mu.Unlock()
}
}
}
mu.Lock()
visited[startUrl] = struct{}{}
mu.Unlock()
wg.Add(1)
go crawlURL(startUrl)
wg.Wait()
result := make([]string, 0, len(visited))
for u := range visited {
result = append(result, u)
}
return result
}
Explanation: Similar to Python, we use Go routines to crawl URLs in parallel. A sync.Mutex protects the visited map, and a sync.WaitGroup ensures the main thread waits until all crawling goroutines finish. New URLs are only crawled if they are under the same hostname and not visited yet.
Worked Examples
Example 1:
Start URL: "http://news.yahoo.com/news/topics/"
Hostname: "news.yahoo.com"
- Visit
"http://news.yahoo.com/news/topics/". - Fetch URLs:
["http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.google.com", "http://news.yahoo.com/us"] - Filter same hostname:
["http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/us"] - Submit each unvisited URL to thread pool and repeat.
- Eventually,
visited = ["http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/", "http://news.yahoo.com/us"]
Example 2:
Start URL: "http://news.google.com"
Hostname: "news.google.com"
- Visit
"http://news.google.com" - Fetch URLs: all other URLs are different hostnames.
- Result: only
["http://news.google.com"].
Complexity Analysis
| Measure | Complexity | Explanation |
|---|---|---|
| Time | O(n + e) | Each URL is visited once, each edge processed once. Wall-clock time is much faster due to parallelism. |
| Space | O(n) | Storing visited URLs and queue/task pool. |
Even though the theoretical time is O(n + e), multi-threading reduces the actual crawl time significantly.
Test Cases
# Provided examples
assert set(Solution().crawl("http://news.yahoo.com/news/topics/", htmlParser)) == set([
"http://news.yahoo.com",
"http://news.yahoo.com/news",
"http://news.yahoo.com/news/topics/",
"http://news.yahoo.com/us"
])
assert set(Solution().crawl("http://news.google.com", htmlParser)) == set([
"http://news.google.com"
])
# Edge cases
assert set(Solution().crawl("http://single.com", htmlParser)) == set([
"http://single.com"
]) # only one URL
assert set(Solution().crawl("http://loop.com/start", htmlParser)) == set([
"http://loop.com/start",
"http://loop.com/next"
]) # URL links back to itself
assert set(Solution().crawl("http://example.com", htmlParser)) == set([
"http://example.com",
"http://example.com/page1",
"http://example.com/page2"
]) # multiple URLs under same hostname
| Test | Why |
|---|---|
| Example 1 | Normal multi-page crawl |
| Example 2 | Only start URL is valid |
| Single URL | No links |
| Loop | Avoid infinite cycles |
| Multiple URLs | Ensure all under same hostname are crawled |
Edge Cases
- Self-referencing pages: Some URLs may link back to themselves, forming a cycle. The visited set prevents infinite recursion.
- External links: URLs pointing to a different hostname