当前位置: 动力学知识库 > 问答 > 编程问答 >

python - How to stop the reactor while several scrapy spiders are running in the same process

问题描述:

I have read from here and here, and made multiple spiders running in the same process work.

However, I don't know how to design a signal system to stop the reactor when all spiders are finished

My code is quite similar with the following example:

from twisted.internet import reactor

from scrapy.crawler import Crawler

from scrapy.settings import Settings

from scrapy import log

from testspiders.spiders.followall import FollowAllSpider

def setup_crawler(domain):

spider = FollowAllSpider(domain=domain)

crawler = Crawler(Settings())

crawler.configure()

crawler.crawl(spider)

crawler.start()

for domain in ['scrapinghub.com', 'insophia.com']:

setup_crawler(domain)

log.start()

reactor.run()

After all the crawler stops, the reactor is still running.

If I add the statement

crawler.signals.connect(reactor.stop, signal=signals.spider_closed)

to the setup_crawler function, reactor stops when first crawler closed.

Can any body show me howto make the reactor stops when all the crawler finished?

网友答案:

What I usually do, in PySide (I use QNetworkAccessManager and many self created workers for scrapping) is to mantain a counter of how many workers have finished processing work from the queue, when this counter reach the number of created workers, a signal is triggered to indicate that there is no more work to do and the application can do something else (like enabling a "export" button so the user can export it's result to a file, etc). Of course, this counter have to be inside a method and have to be called upon a signal is emitted by the crawler/spider/worker.

It might not be a elegant way of fixing your problem, but, Have you tried this anyway?

网友答案:

Further to shackra's answer, taking that route does work. You can create the signal receiver as a closure which retains state, which means that it keeps record of the amount of spiders that have completed. Your code should know how many spiders you are running, so it should be a simple matter of checking when all have run, and then running reactor.stop().

e.g

Link the signal receiver to your crawler:

crawler.signals.connect(spider_finished, signal=signals.spider_closed)

Create the signal receiver:

def spider_finished_count():
    spider_finished_count.count = 0

    def inc_count(spider, reason):
        spider_finished_count.count += 1
        if spider_finished_count.count == NUMBER_OF_SPIDERS:
            reactor.stop()
    return inc_count
spider_finished = spider_finished_count()

NUMBER_OF_SPIDERS being the total number of spiders you are running in this process.

Or you could do it the other way around and count down from the number of spiders running to 0. Or more complex solutions could involve keeping a dict of which spiders have and have not completed etc.

NB: inc_count gets sent spider and reason which we do not use in this example but you may wish to use those variables: they are sent from the signal dispatcher and are the spider which closed and the reason (str) for it closing.

Scrapy version: v0.24.5

分享给朋友:
您可能感兴趣的文章:
随机阅读: