I am using Scrapy to extract some data from a site, say "myproject.com". Here is the logic:
categorylistthat to be used to build the second wave of links.
wholesale/something/something/request or wholesale/pagenumber. And I want to follow those patterns to keep crawling and meanwhile store the raw HTML in my item object.
I tested these two steps separately by using the parse and they both worked.
First, I tried:
scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules
And I can see it built the outlinks successfully. Then I tested the built outlink again.
scrapy parse http://www.myproject.com/wholesale/cat_a/request/1 --spider myproject --rules
And seems like the rule is correct and it generate a item with the HTML stored in there.
However, when I tried to link those two steps together by using the
depth argument. I saw it crawled the outlinks but no items got generated.
scrapy parse http://www.myproject.com/categorylist/cat_a --spider myproject --rules --depth 2
Here is the pseudo code:
name = "Myproject"
allowed_domains = ["Myproject.com"]
start_urls = ["http://www.Myproject.com/"]
rules = (
Rule(LinkExtractor(allow=('/categorylist/\w+',)), callback='parse_category', follow=True),
Rule(LinkExtractor(allow=('/wholesale/\w+/(?:wholesale|request)/\d+',)), callback='parse_pricing', follow=True),
def parse_category(self, response):
soup = BeautifulSoup(response.body)
my_request1 = Request(url=myurl1)
my_request2 = Request(url=myurl2)
def parse_pricing(self, response):
item = MyprojectItem()
item['myurl'] = response.url
item['myhtml'] = response.body
item['mystatus'] = 'fetched'
item['mystatus'] = 'failed'
Thanks a lot for any suggestion!
I was assuming the new
Request objects that I built will run against the
rules and then be parsed by the corresponding callback function define in the Rule, however, after reading the documentation of Request, the
callback method is handled in a different way.
class scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])
callback (callable) – the function that will be called with the response of this request (once its downloaded) as its first parameter. For more information see Passing additional data to callback functions below. If a Request doesn’t specify a callback, the spider’s parse() method will be used. Note that if exceptions are raised during processing, errback is called instead.
... my_request1 = Request(url=myurl1, callback=self.parse_pricing) yield my_request1 my_request2 = Request(url=myurl2, callback=self.parse_pricing) yield my_request2 ...
In another way, even if the URLs I built matches the second rule, it won't be passed to
parse_pricing. Hope this is helpful to other people.