I have read from here and here, and made multiple spiders running in the same process work.
However, I don't know how to design a signal system to stop the reactor when all spiders are finished
My code is quite similar with the following example:
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from testspiders.spiders.followall import FollowAllSpider
spider = FollowAllSpider(domain=domain)
crawler = Crawler(Settings())
for domain in ['scrapinghub.com', 'insophia.com']:
After all the crawler stops, the reactor is still running.
If I add the statement
to the setup_crawler function, reactor stops when first crawler closed.
Can any body show me howto make the reactor stops when all the crawler finished?
What I usually do, in PySide (I use
QNetworkAccessManager and many self created workers for scrapping) is to mantain a counter of how many workers have finished processing work from the queue, when this counter reach the number of created workers, a signal is triggered to indicate that there is no more work to do and the application can do something else (like enabling a "export" button so the user can export it's result to a file, etc). Of course, this counter have to be inside a method and have to be called upon a signal is emitted by the crawler/spider/worker.
It might not be a elegant way of fixing your problem, but, Have you tried this anyway?
Further to shackra's answer, taking that route does work. You can create the signal receiver as a closure which retains state, which means that it keeps record of the amount of spiders that have completed. Your code should know how many spiders you are running, so it should be a simple matter of checking when all have run, and then running
Link the signal receiver to your crawler:
Create the signal receiver:
def spider_finished_count(): spider_finished_count.count = 0 def inc_count(spider, reason): spider_finished_count.count += 1 if spider_finished_count.count == NUMBER_OF_SPIDERS: reactor.stop() return inc_count spider_finished = spider_finished_count()
NUMBER_OF_SPIDERS being the total number of spiders you are running in this process.
Or you could do it the other way around and count down from the number of spiders running to 0. Or more complex solutions could involve keeping a dict of which spiders have and have not completed etc.
NB: inc_count gets sent
reason which we do not use in this example but you may wish to use those variables: they are sent from the signal dispatcher and are the spider which closed and the reason (str) for it closing.
Scrapy version: v0.24.5