Here I was again, looking at my screen in complete disbelief.
This time, it was different though. My distributed web crawler seemed to be slowing down over time. Adding more nodes only had a temporary performance boost; the overall crawling speed gradually declined afterwards. So simply put, it couldn't scale. But why?
In this post, you'll find out what techniques and tools I used to diagnose scaling issues - and to an extent, more general performance issues - in my Python-based web crawler.
The problem
First off, I had designed my crawler as follows:
The main components were:
- A crawler dispatcher, responsible for dispatching URLs to be crawled to the N crawler supervisors.
- N crawler supervisors, responsible for supervising N child processes (crawlers). Those child processes performed the actual crawling.
- Two MongoDB servers. The first one stored the initial seed URLs as well as the crawled HTML data; the second one stored robots.txt files and timestamps indicating when a domain was last crawled.
Additionally, all interprocess communications across servers happened thanks to ZeroMQ queues. So in theory, it would be easily scalable. I could add more supervisors and expect performance to increase linearly, or so I thought.
But as crawling was going on, I discovered a curious thing. The overall crawl rate was actually decreasing over time. And I noticed that the effect was even stronger when adding more nodes. Here's an overview with 2 supervisors:
And here's what it looked like with 4 supervisors:
What the hell! There was such a clear trend that it couldn't be just a coincidence. Something was putting the brakes on my crawler. Big time! And I absolutely needed to determine where the bottleneck was. Otherwise, my crawler would be running forever, instead of a few days or weeks.
Attempt #1: Analyzing my application's CPU and memory use
So first of all, I looked at the CPU and memory use of the dispatcher node. Thus, I would know if a lack of resources was a contributing factor.
[acme@dispatcher ~]$ dstat -tcmdsn -N eth0 --top-cpu-adv --top-mem
----system---- ----total-cpu-usage---- ------memory-usage----- -dsk/total- ----swap--- --net/eth0- -------most-expensive-cpu-process------- --most-expensive-
time |usr sys idl wai hiq siq| used buff cach free| read writ| used free| recv send|process pid cpu read write| memory process
19-12 20:30:15| 25 4 71 0 0 0| 268M 5644k 197M 18.7M| 0 176k| 34M 466M| 18k 19k|dispatcher 5145 32%8355B 385k|dispatcher 155M
19-12 20:30:16| 22 2 75 0 0 0| 268M 5652k 197M 18.7M| 0 48k| 34M 466M|9622B 18k|dispatcher 5145 26%8013B 340k|dispatcher 155M
19-12 20:30:17| 28 4 68 0 0 0| 268M 5652k 197M 18.7M| 0 4096B| 34M 466M| 12k 21k|dispatcher 5145 34%9983B 283k|dispatcher 155M
19-12 20:30:18| 25 4 71 0 0 0| 268M 5652k 197M 18.7M| 0 4096B| 34M 466M| 15k 14k|dispatcher 5145 29% 21k 303k|dispatcher 155M
19-12 20:30:19| 28 3 68 0 0 1| 268M 5652k 197M 18.7M| 0 0 | 34M 466M| 27k 18k|dispatcher 5145 33%8955B 199k|dispatcher 155M
19-12 20:30:20| 29 4 60 5 0 1| 268M 5652k 197M 18.7M| 0 3800k| 34M 466M| 16k 17k|dispatcher 5145 30% 11k 334k|dispatcher 155M
19-12 20:30:21| 22 5 73 0 0 0| 268M 5660k 197M 18.7M| 0 48k| 34M 466M| 11k 16k|dispatcher 5145 29% 12k 322k|dispatcher 155M
19-12 20:30:22| 24 3 72 0 0 1| 268M 5660k 197M 18.7M| 0 0 | 34M 466M| 13k 15k|dispatcher 5145 28% 12k 245k|dispatcher 155M
19-12 20:30:23| 24 4 72 0 0 0| 268M 5660k 197M 18.7M| 0 4096B| 34M 466M| 10k 15k|dispatcher 5145 28%7097B 315k|dispatcher 155M
19-12 20:30:24| 26 4 71 0 0 0| 268M 5660k 197M 18.7M| 0 4096B| 34M 466M| 15k 18k|dispatcher 5145 32% 12k 320k|dispatcher 155M
...
And here are the same stats for one of the supervisor nodes:
[acme@supervisor-1 ~]$ dstat -tcmdsn -N eth0 --top-cpu-adv --top-mem
----system---- ----total-cpu-usage---- ------memory-usage----- -dsk/total- ----swap--- --net/eth0- -------most-expensive-cpu-process------- --most-expensive-
time |usr sys idl wai hiq siq| used buff cach free| read writ| used free| recv send|process pid cpu read write| memory process
20-12 20:33:35| 8 2 90 0 0 0| 466M 296k 15.5M 7956k| 0 40k| 22M 478M| 10k 5357B|crawlers/management/cr2638 9.0% 954B1860B|crawlers/man65.0M
20-12 20:33:36| 13 2 84 0 0 0| 465M 296k 15.5M 7988k| 0 0 | 22M 478M| 15k 10k|crawlers/management/cr2658 9.0%3578B4614B|crawlers/man65.0M
20-12 20:33:37| 7 2 90 0 0 1| 465M 296k 15.5M 7992k| 0 0 | 22M 478M|3761B 3350B|crawlers/management/cr2638 9.0% 8B1945B|crawlers/man65.0M
20-12 20:33:38| 10 3 87 0 0 0| 465M 296k 15.5M 7992k| 0 0 | 22M 478M|5596B 11k|crawlers/management/cr2638 9.0% 115B1454B|crawlers/man65.0M
20-12 20:33:39| 10 2 88 0 0 0| 465M 296k 15.5M 7992k| 0 40k| 22M 478M|6360B 6489B|crawlers/management/cr2638 9.0% 8B1802B|crawlers/man65.0M
20-12 20:33:40| 21 4 74 0 0 0| 466M 304k 15.5M 7812k| 0 44k| 22M 478M| 25k 19k|crawlers/management/cr2658 10%2856B3903B|crawlers/man65.0M
20-12 20:33:41| 39 2 59 0 0 0| 466M 304k 15.5M 7824k| 0 0 | 22M 478M| 55k 19k|crawlers/management/cr2658 18%9145B4954B|crawlers/man65.0M
20-12 20:33:42| 19 3 78 0 0 0| 466M 304k 15.6M 7840k| 0 0 | 22M 478M| 27k 14k|crawlers/management/cr2657 12%4302B4217B|crawlers/man65.0M
20-12 20:33:43| 14 2 84 0 0 0| 466M 304k 15.6M 7784k| 0 0 | 22M 478M| 20k 14k|crawlers/management/cr2638 10%1650B2963B|crawlers/man65.0M
20-12 20:33:44| 24 3 73 0 0 0| 466M 304k 15.6M 7796k| 0 0 | 22M 478M| 56k 11k|crawlers/management/cr2658 17% 37k3771B|crawlers/man65.0M
...
In both cases, the CPU and memory use was under control. So it couldn't explain the slowdown I was observing.
Attempt #2: Analyzing my application's logs
Then I looked at my application's logs1. Analyzing the frequency of appearance of some strategic lines would reveal if any particular operation was slowing down over time.
Here's an excerpt of a log file generated by an individual crawler child process2:
2016-12-20 14:29:19,244 - requests.packages.urllib3.connectionpool - INFO - Starting new HTTP connection (1): www.perdu.com
2016-12-20 14:29:20,086 - requests.packages.urllib3.connectionpool - DEBUG - "GET /robots.txt HTTP/1.1" 200 None
2016-12-20 14:29:20,135 - crawlers.web.multi_worker_web_crawler - DEBUG - The URL 'http://www.perdu.com/' will be crawled.
2016-12-20 14:29:20,136 - crawlers.web.standalone_web_crawler - DEBUG - Crawling URL 'http://www.perdu.com/'.
...
2016-12-20 14:29:22,239 - requests.packages.urllib3.connectionpool - INFO - Starting new HTTP connection (1): www.theansweris42.com
2016-12-20 14:29:23,103 - requests.packages.urllib3.connectionpool - DEBUG - "GET /robots.txt HTTP/1.1" 404 315
2016-12-20 14:29:23,137 - crawlers.web.multi_worker_web_crawler - DEBUG - The URL 'http://www.theansweris42.com/my' will be crawled.
2016-12-20 14:29:23,137 - crawlers.web.standalone_web_crawler - DEBUG - Crawling URL 'http://www.theansweris42.com/my'.
2016-12-20 14:29:23,139 - requests.packages.urllib3.connectionpool - INFO - Starting new HTTP connection (1): www.theansweris42.com
2016-12-20 14:29:24,072 - requests.packages.urllib3.connectionpool - DEBUG - "HEAD /my HTTP/1.1" 404 0
2016-12-20 14:29:24,074 - crawlers.web.standalone_web_crawler - ERROR - Could not retrieve content for URL 'http://www.theansweris42.com/my'.
2016-12-20 14:29:24,099 - crawlers.web.multi_worker_web_crawler - DEBUG - The URL 'http://maddox.xmission.com/' was postponed.
...
2016-12-20 14:29:31,368 - requests.packages.urllib3.connectionpool - INFO - Starting new HTTP connection (1): wikipedia.org
2016-12-20 14:29:31,439 - crawlers.web.throttling.url_action_manager - ERROR - Unable to determine if the URL 'http://wikipedia.org/' is allowed to be crawled.
...
2016-12-20 14:29:31,524 - crawlers.web.multi_worker_web_crawler - DEBUG - The URL 'http://wikipedia.org/' was ignored.