当前位置: 动力学知识库 > 问答 > 编程问答 >

python - Removing HTML tags without /text().extract()

问题描述:

To start, I'm very new at all this so get ready for some jacked up code from me copying/pasting from all kinds of sources.

I'm looking to be able to remove any html code that scrapy returns. I've got everything storing in MySQL with no issues, but the thing I can't get to work yet is removing a lot of '< td >' and other html tags. I initially just ran with /text().extract() but randomly it would come across a cell that was formatted this way:

<td> <span class="caps">TEXT</span> </td>

<td> Text </td>

<td> Text </td>

<td> Text </td>

<td> Text </td>

There isn't a pattern to it that I can just choose between using /text or not, I'm looking for the easiest way that a beginner can implement that will strip all that off.

from scrapy.spider import BaseSpider

from scrapy.selector import HtmlXPathSelector

from scrapy.contrib.loader import XPathItemLoader

from scrapy.contrib.loader.processor import Join, MapCompose

import html2text

from scraper.items import LivingSocialDeal

class CFBDVRB(BaseSpider):

name = "cfbdvrb"

allowed_domains = ["url"]

start_urls = [

"url",

]

deals_list_xpath = '//table[@class="tbl data-table"]/tbody/tr'

item_fields = {

'title': './/td[1]',

'link': './/td[2]',

'location': './/td[3]',

'original_price': './/td[4]',

'price': './/td[5]',

}

def parse(self, response):

selector = HtmlXPathSelector(response)

for deal in selector.xpath(self.deals_list_xpath):

loader = XPathItemLoader(LivingSocialDeal(), selector=deal)

# define processors

loader.default_input_processor = MapCompose(unicode.strip)

loader.default_output_processor = Join()

# iterate over fields and add xpaths to the loader

for field, xpath in self.item_fields.iteritems():

loader.add_xpath(field, xpath)

converter = html2text.HTML2Text()

converter.ignore_links = True

yield loader.load_item()

The converter = html2text was my last attempt at removing it that way, I'm not entirely sure I implemented it correctly but it didn't work.

Thanks in advance for any help you would like to give and I also apologize if I'm missing something easy that a quick search could pull up.

网友答案:

The authors of Scrapy use a bunch of this functionality in their w3lib which is part of/included with Scrapy.

Based on your code, you're using a pretty dated version of Scrapy (pre 0.22). I'm not sure exactly what's available to you, so you may need to import from scrapy.utils.markup instead

If you have the variable my_text that has your HTML text in it, do the following:

>>> from w3lib.html import remove_tags
>>> my_text
'<td>    <span class="caps">TEXT</span>  </td>\n<td>    Text    </td>\n<td>    Text    </td>\n<td>    Text    </td>\n<td>    Text    </td>'
>>> remove_tags(my_text)
u'    TEXT  \n    Text    \n    Text    \n    Text    \n    Text    '

There's a lot of additionally functionality for fixing up/converting html/markup with w3lib (code available here).

As this is just a function, it will be pretty easy to incorporate into your item loader, and will be more lightweight than using BS4.

网友答案:

Easiest way to do it is using BeautifulSoup. Even the Scrapy Documentation recommends it.

Imagine you have a variable called "html_text" with this html code inside:

<td>    <span class="caps">TEXT</span>  </td>
<td>    Text    </td>
<td>    Text    </td>
<td>    Text    </td>
<td>    Text    </td>

Then you could use this to remove all the htmltags:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_text, 'html.parser')
just_text = soup.get_text()

Then the variable "just_text" will contain just the text:

TEXT
Text
Text
Text

I hope this solves your problem.

You can see more examples and the guide to install it (easier than Scrapy) at: BeautifulSoup

Good Luck!

EDIT:

Here you have a working example with the html you proposed:

from bs4 import BeautifulSoup


html_text = """
<td>    <span class="caps">TEXT</span>  </td>
<td>    Text    </td>
<td>    Text    </td>
<td>    Text    </td>
<td>    Text    </td>
"""

soup = BeautifulSoup(html_text, 'html.parser')

List_of_tds = soup.findAll('td')

for td_element in List_of_tds:
    print td_element.get_text()

Please, note that you need to be using BeautifulSoup 4, which you can install following these instructions. If you have it, you can just copypaste that code and see what it does to other html and modify it to satisfy your needs.

分享给朋友:
您可能感兴趣的文章:
随机阅读: