使用lxml解析HTML数据

来源:转载

HTML数据解析

诸如爬虫类场景下我们需要对抓取的HTML做内容解析,提取感兴趣的内容,python标准库提供了HTMLParser/SGMLParser两个模块用于解析HTML,然而这两个模块的实现方式都很难理解,用来做遍历查找实在是很不友好,第三方库lxml则简单许多,逻辑上更容易理解,而且同时支持HTML和XML两类结构化数据解析

用官方话说:

“lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).”

Parsing HTML with lxml

从html中提取感兴趣的内容, 一种选择是用正则表达式, 不过正则表达式写起来太痛苦,万不得已不用也罢。html语言可以看做是类似xml的层次化结构语言, 可以解析成一个树,然后用xpath语言做数据定位提取.

实现一个小爬虫的思路

Python Documentation 中 HTMLParser章节Example中用一个网站做演示如何使用HTMLParser解析HTML,这里我也借用这个网站做演示,该网站总工有10页,页面下方有“Next”链接到下一页,内容是罗列一堆名人名言,关键信息为“名言”、“作者”、“关键字”,我就遍历这10个页面并提取这三个信息。

http://quotes.toscrape.com

  • requests.get(url)抓取一个链接的页面;
  • 抓取的页面字符形式喂给 lxml.html.fromstring();
  • XPath定位并提取感兴趣的内容;
  • 数据写入MySQL;

代码 & Walk Through

#-*- coding:utf-8 -*-'''Created on 2017年7月3日@author: will'''import MySQLdbfrom lxml import htmlimport requestsclass Pipeline(): ''' 数据库连接,已在MySQL Server上提前创建db = Locust ''' connDB = MySQLdb.connect( host = '192.168.8.82', port = 3306, user = 'willyan', passwd = 'will392891', db = 'Locust', charset = 'utf8' ) cur = connDB.cursor() class HtmlPar(): ''' 解析并提取html文件中的感兴趣信息, ''' def myPar(self,start_url): ##创建urls列表,用于存放待爬取的页面链接 ##爬虫起始页链接start_url需要作为参数传入并存放到urls[],提取页面底部“Next"的 href 添加到urls[] urls = [] urls.append(start_url) ##创建三个list分别用于存放提取到的名言、作者、关键字 text = [] author = [] tags = [] ''' 定义一个条件循环体,从urls[]中提取待爬取页面的链接,爬取结果以字符形式喂给解析器,提取“Next”元素,若“Next”元素存在,则将其“href”信息添加到urls[]列表中,作为下一次循环爬取的目标链接,同时提取页面中的全部“名言Text”、“作者author”、“关键字 tags”分别添加到对应的list,当爬取的页面中定位不到“Next"元素时说明已到达最后一页跳出循环并将提取到的三个List返回。 ''' i = 0 while True: #从urls[]依次取待爬取页面链接,爬取结果以字符形式喂给解析器 page = requests.get(urls[i]) tree = html.fromstring(page.text) #提取页面底部的“Next”元素,作为判读是否继续爬取的依据 nextPage = tree.xpath('//nav/ul/li[@class="next"]/a/@href') ##当nextPage 返回为空[]时,说明已到末页,应终止循环并将提取到的全部数据返回 if nextPage != []: #提取当前页面“Next”元素的“href”链接数据,并添加到urls[]作为下一次循环的爬取目标 urls.append(urls[0] + tree.xpath('//nav/ul/li[@class="next"]/a/@href')[0]) #提取的名言、作者、关键字三个信息都是以list[]形式返回,以len()函数识别其中一个对象的长度(如名言或作者),定义for循环将返回的三个list[]内容依次添加到text[]、author[]、tags[]中。 num = len(tree.xpath('//span[@itemprop="text"]/text()')) for x in range(num): text.append(tree.xpath('//span[@itemprop="text"]/text()')[x]) author.append(tree.xpath('//small[@itemprop="author"]/text()')[x]) tags.append(tree.xpath('//meta/@content')[x]) else: return text, author, tags i += 1 if __name__ == '__main__': #数据库中建表qutes,用于存放抓取的数据 db = Pipeline() conn = db.connDB cur = db.cur dbCreateCMD = 'create table quotes(quoteID varchar(10), quoteText varchar(600), author varchar(20), tags varchar(20), primary key (quoteID), unique(quoteID)) ENGINE=InnoDB DEFAULT CHARSET=utf8' cur.execute(dbCreateCMD) #定义起始爬取页 start_url = 'http://quotes.toscrape.com' quotes = HtmlPar() result = quotes.myPar(start_url) #将返回的三维元组数据循环写入数据库,返回数据格式为: result(text[...],author[...],tags[...]) for y in range(len(result[0])): #Text部分有的条目字符数太多,超过MySQL字符限制无法写入,所以text部分就不写库了。。。 cmd = "insert ignore into quotes(quoteID, author, tags) values('" + str(y+1) + "', '" + result[1][y] + "', '" + result[2][y] + "')" cur.execute(cmd) conn.commit() cur.close() conn.close()

执行结果

进数据库Locust查看,总计抓取了90条内容。

mysql> select * from quotes;Empty set (0.01 sec)mysql> select * from quotes;+---------+-----------+----------------------+----------------------+| quoteID | quoteText | author | tags |+---------+-----------+----------------------+----------------------+| 1 | NULL | Albert Einstein | change,deep-thoughts || 10 | NULL | Steve Martin | humor,obvious,simile || 11 | NULL | Marilyn Monroe | friends,heartbreak,i || 12 | NULL | J.K. Rowling | courage,friends || 13 | NULL | Albert Einstein | simplicity,understan || 14 | NULL | Bob Marley | love || 15 | NULL | Dr. Seuss | fantasy || 16 | NULL | Douglas Adams | life,navigation || 17 | NULL | Elie Wiesel | activism,apathy,hate || 18 | NULL | Friedrich Nietzsche | friendship,lack-of-f || 19 | NULL | Mark Twain | books,contentment,fr || 2 | NULL | J.K. Rowling | abilities,choices || 20 | NULL | Allen Saunders | fate,life,misattribu || 21 | NULL | Pablo Neruda | love,poetry || 22 | NULL | Ralph Waldo Emerson | happiness || 23 | NULL | Mother Teresa | attributed-no-source || 24 | NULL | Garrison Keillor | humor,religion || 25 | NULL | Jim Henson | humor || 26 | NULL | Dr. Seuss | comedy,life,yourself || 27 | NULL | Albert Einstein | children,fairy-tales || 28 | NULL | J.K. Rowling | || 29 | NULL | Albert Einstein | imagination || 3 | NULL | Albert Einstein | inspirational,life,l || 30 | NULL | Bob Marley | music || 31 | NULL | Dr. Seuss | learning,reading,seu || 32 | NULL | J.K. Rowling | dumbledore || 33 | NULL | Bob Marley | friendship || 34 | NULL | Mother Teresa | misattributed-to-mot || 35 | NULL | J.K. Rowling | death,inspirational || 36 | NULL | Charles M. Schulz | chocolate,food,humor || 37 | NULL | William Nicholson | misattributed-to-c-s || 38 | NULL | Albert Einstein | knowledge,learning,u || 39 | NULL | Jorge Luis Borges | books,library || 4 | NULL | Jane Austen | aliteracy,books,clas || 40 | NULL | George Eliot | inspirational || 41 | NULL | George R.R. Martin | read,readers,reading || 42 | NULL | C.S. Lewis | books,inspirational, || 43 | NULL | Marilyn Monroe | || 44 | NULL | Marilyn Monroe | girls,love || 45 | NULL | Albert Einstein | life,simile || 46 | NULL | Marilyn Monroe | love || 47 | NULL | Marilyn Monroe | attributed-no-source || 48 | NULL | Martin Luther King J | hope,inspirational || 49 | NULL | J.K. Rowling | dumbledore || 5 | NULL | Marilyn Monroe | be-yourself,inspirat || 50 | NULL | James Baldwin | love || 51 | NULL | Jane Austen | friendship,love || 52 | NULL | Eleanor Roosevelt | attributed,fear,insp || 53 | NULL | Marilyn Monroe | attributed-no-source || 54 | NULL | Albert Einstein | music || 55 | NULL | Haruki Murakami | books,thought || 56 | NULL | Alexandre Dumas fils | misattributed-to-ein || 57 | NULL | Stephenie Meyer | drug,romance,simile || 58 | NULL | Ernest Hemingway | books,friends,noveli || 59 | NULL | Helen Keller | inspirational || 6 | NULL | Albert Einstein | adulthood,success,va || 60 | NULL | George Bernard Shaw | inspirational,life,y || 61 | NULL | Charles Bukowski | alcohol || 62 | NULL | Suzanne Collins | the-hunger-games || 63 | NULL | Suzanne Collins | humor || 64 | NULL | C.S. Lewis | love || 65 | NULL | J.R.R. Tolkien | bilbo,journey,lost,q || 66 | NULL | J.K. Rowling | live-death-love || 67 | NULL | Ernest Hemingway | good,writing || 68 | NULL | Ralph Waldo Emerson | life,regrets || 69 | NULL | Mark Twain | education || 7 | NULL | André Gide | life,love || 70 | NULL | Dr. Seuss | troubles || 71 | NULL | Alfred Tennyson | friendship,love || 72 | NULL | Charles Bukowski | humor || 73 | NULL | Terry Pratchett | humor,open-mind,thin || 74 | NULL | Dr. Seuss | humor,philosophy || 75 | NULL | J.D. Salinger | authors,books,litera || 76 | NULL | George Carlin | humor,insanity,lies, || 77 | NULL | John Lennon | beatles,connection,d || 78 | NULL | W.C. Fields | humor,sinister || 79 | NULL | Ayn Rand | || 8 | NULL | Thomas A. Edison | edison,failure,inspi || 80 | NULL | Mark Twain | books,classic,readin || 81 | NULL | Albert Einstein | mistakes || 82 | NULL | Jane Austen | humor,love,romantic, || 83 | NULL | J.K. Rowling | integrity || 84 | NULL | Jane Austen | books,library,readin || 85 | NULL | Jane Austen | elizabeth-bennet,jan || 86 | NULL | C.S. Lewis | age,fairytales,growi || 87 | NULL | C.S. Lewis | god || 88 | NULL | Mark Twain | death,life || 89 | NULL | Mark Twain | misattributed-mark-t || 9 | NULL | Eleanor Roosevelt | misattributed-eleano || 90 | NULL | C.S. Lewis | christianity,faith,r |+---------+-----------+----------------------+----------------------+90 rows in set (0.01 sec)

多线程优化

对于页面数较多的站点爬取可以考虑使用multiprocessing库做多线程处理,先爬取所有页面的链接,再以多线程做爬取页面和数据提取以提高爬虫效率。

分享给朋友:
您可能感兴趣的文章:
随机阅读: