当前位置: 动力学知识库 > 问答 > 编程问答 >

python - scrapy response.xpath only picking out the first item

问题描述:

I have the html structure

 <div class="column first">

<div class="detail">

<strong>Phone: </strong>

<span class="value"> 012-345-6789</span>

</div>

<div class="detail">

<span class="value">1 Street Address, Big Road, City, Country</span>

</div>

<div class="detail">

<h3 class="inline">Area:</h3>

<span class="value">Georgetown</span>

</div>

<div class="detail">

<h3 class="inline">Nearest Train:</h3>

<span class="value">Georgetown Station</span>

</div>

<div class="detail">

<h3 class="inline">Website:</h3>

<span class="value"><a href='http://www.website.com' target='_blank'>www.website.com</a></span>

</div>

</div>

When I run sel = response.xpath('//span[@class="value"]/text()') in scrapy shell I get what I expect back, which is:

[<Selector xpath='//span[@class="value"]/text()' data=u' 012-345-6789'>, <Selector xpath='//span[@class="value"]/text()' data=u'1 Street Address, Big Road, City, Country'>, <Selector xpath='//span[@class="value"]/text()' data=u'Georgetown Station'>, <Selector xpath='//span[@class="value"]/text()' data=u' '>, <Selector xpath='//span[@class="value"]/text()' data=u'January, 2016'>]

However, in the parse block in my scrapy spider, it's only returning the first item

def parse(self, response):

def extract_with_xpath(query):

return response.xpath(query).extract_first().strip()

yield {

'details': extract_with_xpath('//span[@class="value"]/text()')

}

I realise I am using extract_first() but if I use extract() it breaks, even though I know extract() is a legitimate function.

What I am doing wrong? Do I need to loop through the

extract_with_xpath('//span[@class="value"]/text()') part?

Thanks for any enlightenment!

网友答案:

in items.py, specify-

from scrapy.item import Item, Field

class yourProjectNameItem(Item):
    # define the fields for your item here like:
    name = Field()
    details= Field()

in your scrapy spider: imports:

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from yourProjectName.items import yourProjectNameItem
import re

and the parse function as follows:

def parse_item(self, response):
    hxs = HtmlXPathSelector(response)
    i = yourProjectNameItem()

    i['name'] = hxs.select('YourXPathHere').extract() 
    i['details'] = hxs.select('YourXPathHere').extract()

    return i

Hope this solves the issue. You can refer to my project on git:https://github.com/omkar-dsd/SRMSE/tree/master/Scrapers/NasaScraper

分享给朋友:
您可能感兴趣的文章:
随机阅读: