当前位置: 动力学知识库 > 问答 > 编程问答 >

python - How to set a rule according to the current URL?

问题描述:

I'm using Scrapy and I want to be able to have more control over the crawler. To do this I would like to set rules depending on the current URL that I am processing.

For example if I am on example.com/a I want to apply a rule with LinkExtractor(restrict_xpaths='//div[@class="1"]'). And if I'm on example.com/b I want to use another Rule with a different Link Extractor.

How do I accomplish this?

网友答案:

I'd just code them in separate callbacks, instead of relying in the CrawlSpider rules.

def parse(self, response):
    extractor = LinkExtractor(.. some default ..)

    if 'example.com/a' in response.url:
        extractor = LinkExtractor(restrict_xpaths='//div[@class="1"]')

    for link in extractor.extract_links(response):
        yield scrapy.Request(link.url, callback=self.whatever)

This is better than trying to change the rules at runtime, because the rules are supposed to be the same for all callbacks.

In this case I've just used link extractors, but if you want to use different rules you can do about the same thing, mirroring the same code to handle rules in the loop shown from CrawlSpider._requests_to_follow.

分享给朋友:
您可能感兴趣的文章:
随机阅读: