当前位置: 动力学知识库 > 问答 > 编程问答 >

python - Nesting item data in Scrapy

问题描述:

I'm fairly new to Python and Scrapy and have issues wrapping my head around how to create nested JSON with the help of Scrapy.

Selecting the elements I want from HTML has not been a problem with the help of XPath Helper and some Googling. I am however not quite sure how I’m supposed to get the JSON structure that I want.

The JSON structure I desire would look like:

{"menu": {

"Monday": {

"alt1": "Item 1",

"alt2": "Item 2",

"alt3": "Item 3"

},

"Tuesday": {

"alt1": "Item 1",

"alt2": "Item 2",

"alt3": "Item 3"

}

}}

The HTML looks like:

<ul>

<li class="title"><h2>Monday</h2></li>

<li>Item 1</li>

<li>Item 2</li>

<li>Item 3</li>

</ul>

<ul>

<li class="title"><h2>Tuesday</h2></li>

<li>Item 1</li>

<li>Item 2</li>

<li>Item 3</li>

</ul>

I did find http://stackoverflow.com/a/25096896/6856987, I was however not able to adapt this to fit my needs. I would greatly appreciate a nudge in the right direction on how I would accomplish this.

Edit: With the nudge provided by Padraic I managed to get one step closer to what I want to accomplish. I've come up with the following, which is a slight improvement over my previous situation. The JSON is still not quite where I want it.

Scrapy spider:

import scrapy

from dmoz.items import DmozItem

class DmozSpider(scrapy.Spider):

name = "dmoz"

start_urls = ['http://urlto.com']

def parse(self, response):

uls = response.xpath('//ul[position() >= 1 and position() < 6]')

item = DmozItem()

item['menu'] = {}

item['menu'] = {"restaurant": "name"}

for ul in uls:

item['menu']['restaurant']['dayOfWeek'] = ul.xpath("li/h2/text()").extract()

item['menu']['restaurant']['menuItem'] = ul.xpath("li/text()").extract()

yield item

Resulting JSON:

[

{

"menu":{

"dayOfWeek":[

"Monday"

],

"menuItem":[

"Item 1",

"Item 2",

"Item 3"

]

}

},

{

"menu":{

"dayOfWeek":[

"Tuesday"

],

"menuItem":[

"Item 1",

"Item 2",

"Item 3"

]

}

}

]

It sure feels like I'm doing a thousand and a one things wrong with this, hopefully someone more clever than me can point me the right way.

网友答案:

You just need to find all the uls and then extract the lis to group them, an example using lxml below:

from lxml import html

h = """<ul>
    <li class="title"><h2>Monday</h2></li>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
</ul>
<ul>
    <li class="title"><h2>Tuesday</h2></li>
    <li>Item 1</li>
    <li>Item 2</li>
    <li>Item 3</li>
</ul>"""

tree = html.fromstring(h)

uls = tree.xpath("//ul")

data = {}
# iterate over all uls
for ul in uls:
    # extract the ul's li's
    lis = ul.xpath("li")
    # use the h2 text as the key and all the text from the remaining as values
    # with enumerate to add the alt logic
    data[lis[0].xpath("h2")[0].text] =  {"alt{}".format(i): node.text for i, node in enumerate(lis[1:], 1)}

print(data)

Which would give you:

{'Monday': {'alt1': 'Item 1', 'alt2': 'Item 2', 'alt3': 'Item 3'},
 'Tuesday': {'alt1': 'Item 1', 'alt2': 'Item 2', 'alt3': 'Item 3'}}

If you wanted to put it into a single comporehension:

data = {lis[0].xpath("h2")[0].text:
               {"alt{}".format(i): node.text for i, node in enumerate(lis[1:], 1)}
                    for lis in (ul.xpath("li") for ul in tree.xpath("//ul"))}

Working with your edited code in your question and following the same required output:

def parse(self, response):
    uls = response.xpath('//ul[position() >= 1 and position() < 6]')
    item = DmozItem()
    # just create an empty dict
    item['menu'] = {}
    for ul in uls:
        # for each ul, add a key value pair {day: {alti: each li_text skipping the first}}
        item['menu'][ul.xpath("li/h2/text()").extract_first()]\
            = {"alt{}".format(i): node.text for i, node in enumerate(ul.xpath("li[postition() > 1]/text()").extract(), 1)}
    # yield outside the loop 
    yield item

That will give you data in one dict like:

In [15]: d = {"menu":{'Monday': {'alt1': 'Item 1', 'alt2': 'Item 2', 'alt3': 'Item 3'},
                  'Tuesday': {'alt1': 'Item 1', 'alt2': 'Item 2', 'alt3': 'Item 3'}}}

In [16]: d["menu"]["Tuesday"]
Out[16]: {'alt1': 'Item 1', 'alt2': 'Item 2', 'alt3': 'Item 3'}

In [17]: d["menu"]["Monday"]
Out[17]: {'alt1': 'Item 1', 'alt2': 'Item 2', 'alt3': 'Item 3'}

In [18]: d["menu"]["Monday"]["alt1"]
Out[18]: 'Item 1'

That matches your original question expected output more than your new but I see no advantage to what you are doing in the new logic adding "dayOfWeek" etc..

分享给朋友:
您可能感兴趣的文章:
随机阅读: