LeetCode 1242 - Web Crawler Multithreaded

Difficulty: 🟡 Medium
Topics: Depth-First Search, Breadth-First Search, Concurrency

Solution

Problem Understanding

The problem asks us to implement a multi-threaded web crawler that starts from a given URL startUrl and visits all URLs reachable from it, but only those that share the same hostname. We are given an interface HtmlParser that can return all URLs on a page through its getUrls(url) method. The goal is to return all URLs discovered, without visiting any URL more than once.

In essence, the problem is asking us to perform a graph traversal where nodes are URLs and edges exist if a page links to another. However, unlike a standard traversal, we are constrained to nodes (URLs) under the same hostname as the starting URL. The challenge emphasizes multi-threading because getUrls is a blocking operation and sequential traversal may exceed time limits.

The input is a single string startUrl, and the output is a list of URLs (order does not matter). Constraints assure us that there are at most 1000 URLs, each URL is at most 300 characters, and there are no duplicates. Hostname rules are standard and simplify extracting the hostname from a URL.

Important edge cases include URLs that link to pages outside the hostname, pages with no outgoing links, and cycles in the link graph.

Approaches

Brute Force Approach

A naive solution would perform a single-threaded DFS or BFS, keeping a visited set to avoid duplicates. Starting from startUrl, we recursively fetch URLs using getUrls, filter by hostname, and continue. This works correctly but will be too slow because getUrls is blocking and each URL request takes time. With a single thread, the worst-case scenario of visiting 1000 URLs sequentially can exceed time limits if we assume each request is 15ms.

Optimal Approach

The key observation is that getUrls is I/O-bound. Multi-threading allows us to initiate multiple URL requests in parallel, significantly reducing total crawl time. By using a thread pool or concurrent task execution, we can launch multiple crawling tasks simultaneously. Each thread processes a URL, fetches all linked URLs, filters by hostname, and submits any unvisited URLs to the pool.

This approach guarantees correctness as long as we properly synchronize access to the visited set, ensuring no URL is crawled twice.

Approach	Time Complexity	Space Complexity	Notes
Brute Force	O(n + e) sequential	O(n)	DFS/BFS single-threaded, slow due to blocking `getUrls`
Optimal	O(n + e) parallel	O(n)	Multi-threaded BFS/DFS using concurrent visited set, much faster wall-clock time

Algorithm Walkthrough

Extract Hostname: Parse startUrl to extract the hostname. This will be used to filter URLs.
Initialize Concurrent Structures: Create a thread-safe visited set and a queue (or task pool) to hold URLs to crawl.
Start Crawling Tasks: Add startUrl to the queue and mark it as visited.
Concurrent Processing: Each worker thread takes a URL from the queue, calls htmlParser.getUrls(url), and filters the results to only those URLs with the same hostname that have not been visited.
Submit New URLs: Add any new, unvisited URLs to the queue for further processing. Mark them as visited immediately to avoid duplicates.
Wait for Completion: Continue until the queue is empty and all worker threads finish.
Return Results: Convert the visited set to a list and return.

Why it works: Each URL under the same hostname is eventually added to the queue and visited exactly once. Multi-threading ensures that the blocking calls overlap, reducing wall-clock time while maintaining correctness via synchronization on the visited set.

Python Solution

from typing import List
from urllib.parse import urlparse
from concurrent.futures import ThreadPoolExecutor
import threading

class Solution:
    def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
        hostname = urlparse(startUrl).hostname
        visited = set()
        visited_lock = threading.Lock()
        
        def crawl_url(url: str):
            urls = htmlParser.getUrls(url)
            for u in urls:
                if urlparse(u).hostname == hostname:
                    with visited_lock:
                        if u not in visited:
                            visited.add(u)
                            executor.submit(crawl_url, u)
        
        with ThreadPoolExecutor(max_workers=8) as executor:
            visited.add(startUrl)
            executor.submit(crawl_url, startUrl)
            executor.shutdown(wait=True)
        
        return list(visited)

Explanation: We first extract the hostname from startUrl. The visited set is protected by a lock to ensure thread-safety. The crawl_url function fetches URLs, filters by hostname, and submits new unvisited URLs back to the thread pool. We start by submitting startUrl, wait for all threads to finish, and then return all visited URLs.

Go Solution

package main

import (
	"net/url"
	"sync"
)

func crawl(startUrl string, htmlParser *HtmlParser) []string {
	parsedUrl, _ := url.Parse(startUrl)
	hostname := parsedUrl.Hostname()
	visited := make(map[string]struct{})
	var mu sync.Mutex
	var wg sync.WaitGroup

	var crawlURL func(string)
	crawlURL = func(current string) {
		defer wg.Done()
		urls := htmlParser.getUrls(current)
		for _, u := range urls {
			parsed, _ := url.Parse(u)
			if parsed.Hostname() == hostname {
				mu.Lock()
				if _, ok := visited[u]; !ok {
					visited[u] = struct{}{}
					wg.Add(1)
					go crawlURL(u)
				}
				mu.Unlock()
			}
		}
	}

	mu.Lock()
	visited[startUrl] = struct{}{}
	mu.Unlock()
	wg.Add(1)
	go crawlURL(startUrl)
	wg.Wait()

	result := make([]string, 0, len(visited))
	for u := range visited {
		result = append(result, u)
	}
	return result
}

Explanation: Similar to Python, we use Go routines to crawl URLs in parallel. A sync.Mutex protects the visited map, and a sync.WaitGroup ensures the main thread waits until all crawling goroutines finish. New URLs are only crawled if they are under the same hostname and not visited yet.

Worked Examples

Example 1:

Start URL: "http://news.yahoo.com/news/topics/"

Hostname: "news.yahoo.com"

Visit "http://news.yahoo.com/news/topics/".
Fetch URLs: ["http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.google.com", "http://news.yahoo.com/us"]
Filter same hostname: ["http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/us"]
Submit each unvisited URL to thread pool and repeat.
Eventually, visited = ["http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/", "http://news.yahoo.com/us"]

Example 2:

Start URL: "http://news.google.com"

Hostname: "news.google.com"

Visit "http://news.google.com"
Fetch URLs: all other URLs are different hostnames.
Result: only ["http://news.google.com"].

Complexity Analysis

Measure	Complexity	Explanation
Time	O(n + e)	Each URL is visited once, each edge processed once. Wall-clock time is much faster due to parallelism.
Space	O(n)	Storing visited URLs and queue/task pool.

Even though the theoretical time is O(n + e), multi-threading reduces the actual crawl time significantly.

Test Cases

# Provided examples
assert set(Solution().crawl("http://news.yahoo.com/news/topics/", htmlParser)) == set([
    "http://news.yahoo.com",
    "http://news.yahoo.com/news",
    "http://news.yahoo.com/news/topics/",
    "http://news.yahoo.com/us"
])

assert set(Solution().crawl("http://news.google.com", htmlParser)) == set([
    "http://news.google.com"
])

# Edge cases
assert set(Solution().crawl("http://single.com", htmlParser)) == set([
    "http://single.com"
]) # only one URL

assert set(Solution().crawl("http://loop.com/start", htmlParser)) == set([
    "http://loop.com/start",
    "http://loop.com/next"
]) # URL links back to itself

assert set(Solution().crawl("http://example.com", htmlParser)) == set([
    "http://example.com",
    "http://example.com/page1",
    "http://example.com/page2"
]) # multiple URLs under same hostname

Test	Why
Example 1	Normal multi-page crawl
Example 2	Only start URL is valid
Single URL	No links
Loop	Avoid infinite cycles
Multiple URLs	Ensure all under same hostname are crawled

Edge Cases

Self-referencing pages: Some URLs may link back to themselves, forming a cycle. The visited set prevents infinite recursion.
External links: URLs pointing to a different hostname