Question Details

No question body available.

Tags

design web-crawler

Answers (2)

October 20, 2025 Score: 1 Rep: 8,129 Quality: Low Completeness: 80%

This design question is generic across languages and technologies; the fact that you're using specific python libs isn't relevant.

A work table contains (domain, url) pairs, and at any instant only a single crawler machine can work on the entries of a given domain. Create a current-status crawl table with PK of domain. Assume it has something in the ballpark of a million rows.

The work is performed by N crawler machines or processes, the "workers".

naïve central approach

You're already using a scalable RDBMS: postgres. Relying on ACID guarantees, let workers race to COMMIT an UPDATE of the crawl.crawledat timestamp column. A worker who successfully updates it with a fresh stamp has permission to pull any of that domain's work entries and crawl them.

Downside is that we're leaning pretty heavily on database scaling.

sharded approach

We don't need a central database to help us create an N-way partition of the set of domains. A hash function could do that as easily, with no communication overhead.

Add a hash column to the crawl table. Compute SHA3(domain) or some other convenient message digest. Store a prefix of that hash, perhaps the first 64 bits. Arrange for workers to elect a leader, so they know the set of current workers. Crucially, they know N, the size of the set of workers. Each worker has a name, from 0 .. N-1.

A worker may work on a domain's entries iff the worker's name matches the domain's hash modulo N.

From time to time workers come and go, so workers will eventually need to become aware of a new name and a new N. If this happens frequently, it may be convenient to identify workers with a (community, id) tuple, so we can shard first on the assigned community (maybe East Coast datacenter vs West Coast datacenter) and then on a numeric ID within that community. The idea is that workers frequently come and go, but communities seldom do, and re-sharding for a given worker transition will only affect a single community. If there are four communities then the million domains form four shards, each with 250 K domains, and each community will further partition its 250 K domains so just a single worker is repsonsible for any given domain.

Leader elections can be handled by Raft or Paxos, but they happen seldom enough that you can view it as just a DB problem or a Redis/Valkey problem.


Using either setup, the crawled documents can be put in a local filestore or database, so there's few scaling challenges there. Combine the results as one unified view at your leisure.


randomize across domains

The OP problem setup seems to suggest that a worker will focus on one domain at a time, fetching URLs sequentially from it, and then move on to the next domain. We might sleep() for a moment before each GET, to obey a configured rate limit.

Consider storing prefix of an url hash in the work table, indexing on that, and doing SELECT ... ORDER BY hash. This effectively shuffles the deck, so URLs appear in an arbitrary (random?) order, and crucially the domains are similarly randomized. If a single domain like "reddit.com" accounts for half the work entries, then this doesn't help much. But if no domain is especially dominant, then after crawling domain d there's a fair chance you will do work for several unrelated domains before returning to d. And those unrelated domains will consume enough time that the rate limit won't have need of a sleep() call. Use a local in-memory datastructure like a pqueue to double check this and to occasionally sleep(), if desired.

July 6, 2025 Score: -1 Rep: 107 Quality: Low Completeness: 70%

I don’t know much about Celery and Airflow, but the problem could be solved in this way.

You can have multiple crawler server that make request to the target website. Every crawler will have rate limiter per domain. This can easily control the rate of request per domain. But, there are challenges associated with this design.

  1. You'll have to ensure that all request of a particular domain must go to the same crawler, otherwise you can not control the rate of request.

  2. Crawler need to buffer urls of domain in memory while waiting for 5sec. In case crawler restarts, we may either skip processing all buffered jobs. So, to skip processing all buffered job, we can update meta data about each url in DB/Redis, and after restarts fetch unprocessed jobs again from DB.

enter image description here

Let me know if this solves your problem or any issue in this approach!