当前位置: 动力学知识库 > 问答 > 编程问答 >

web crawler - Scrapy: having problems in crawling a .aspx page

问题描述:

I'm trying to crawl a .aspx page, but it redirects me to a page which doesn't exist.

To solve this, I tried to set 'dont_merge_cookies': True and 'dont_redirect': True, and overwrite my start_requests, but now, it gives me an error "'Response' object has no attribute 'body_as_unicode'" and my response class type is 'scrapy.http.response.Response'.

Here's my code:

class Inon_Spider(BaseSpider):

name = 'Inon'

allowed_domains = ['www.shop.inonit.in']

start_urls = ['http://www.shop.inonit.in/Products/Inonit-Men-Jackets/QUIRK-BOX/Toy-Factory-Jacket---Soldiers/pid-1177471.aspx?Rfs=&pgctl=713619&cid=CU00049295']

#redirects to http://www.shop.inonit.in/Products/Inonit-Men-Jackets/QUIRK-BOX/Toy-Factory-Jacket---Soldiers/1177471

def start_requests(self):

start_urls = ['http://www.shop.inonit.in/Products/Inonit-Men-Jackets/QUIRK-BOX/Toy-Factory-Jacket---Soldiers/pid-1177471.aspx?Rfs=&pgctl=713619&cid=CU00049295']

for i in start_urls:

yield Request(i, meta = {

'dont_merge_cookies': True,

'dont_redirect': True,

'handle_httpstatus_list': [302]

},callback=self.parse)

def parse(self, response):

print "Response %s" %response.__class__

resp = TextResponse

item = DealspiderItem()

hxs = HtmlXPathSelector(resp)

title = hxs.select('//div[@class="aboutproduct"]/div[@class="container9"]/div[@class="ctl_aboutbrand"]/h1/text()').extract()

price = hxs.select('//span[@id="ctl00_ContentPlaceHolder1_Price_ctl00_spnWebPrice"]/span[@class="offer"]/span[@id="ctl00_ContentPlaceHolder1_Price_ctl00_lblOfferPrice"]/text()').extract()

prc = price[0].replace("Rs. ","")

description = []

item['price'] = prc

item['title'] = title

item['description'] = description

item['url'] = response.url

return item

分享给朋友:
您可能感兴趣的文章:
随机阅读: