当前位置: 动力学知识库 > 问答 > 编程问答 >

How to parse div tag from alexa.com and show results in table in django

问题描述:

I have successfully created a webapp using HTMLParser and urllib2 that gets the first 20 websites from www.alexa.com/topsites/global and put the results in a HTML table. My problem is that I can't follow the same rules and apply the same algorithm for <div class="count"> and <div class="description">.

Can anybody help me with some sort of a snippet for this without using BS4 ?

My code so far:

urlparse.py

import HTMLParser, urllib

class MyHTMLParser(HTMLParser.HTMLParser):

site_list = []

def reset(self):

HTMLParser.HTMLParser.reset(self)

self.in_a = False

self.next_link_text_pair = None

def handle_starttag(self, tag, attrs):

if tag=='a':

for name, value in attrs:

if name=='href':

self.next_link_text_pair = [value, '']

self.in_a = True

break

def handle_data(self, data):

if self.in_a: self.next_link_text_pair[1] += data

def handle_endtag(self, tag):

if tag=='a':

if self.next_link_text_pair is not None:

if self.next_link_text_pair[0].startswith('/siteinfo/'):

self.site_list.append(self.next_link_text_pair[1])

self.next_link_text_pair = None

self.in_a = False

if __name__=='__main__':

p = MyHTMLParser()

p.feed(urllib.urlopen('http://www.alexa.com/topsites/global').read())

print p.site_list[:20]

urls.py

urlpatterns = patterns('',

url(r'^$', 'myapp.views.top_urls', name='home'),

url(r'^admin/', include(admin.site.urls)),

)

views.py

def top_urls(request):

p = MyHTMLParser()

p.feed(urllib2.urlopen('http://www.alexa.com/topsites/global').read())

urls = p.site_list[:20]

print urls

return render(request, 'top_urls.html', {'urls': urls})

top_urls.html

...

<tbody>

{% for url in urls %}

<tr>

<td>Something</td><!--here should be {{rank}}-->

<td>{{ url }}</td>

<td>something</td><!--here should be {{description}}-->

</tr>

{% endfor %}

</tbody>

...

网友答案:

The idea is to create a some kind of a state machine. In the starttag event we decide which data field of the site info should be populated later in the data event. Decision is made on the <div>/<span> class attribute and the ATTR_FIELDS map.

For example if the <div class="count"> tag is started then we will populate the rank field of the current self.site dictionary.

class MyHTMLParser(HTMLParser.HTMLParser):

    ATTR_FIELDS = {'count': 'rank',
                   'description': 'description', 'remainder': 'description'}

    def reset_site(self):
        self.site = {'rank': '', 'url': '', 'description': ''}
        self.in_site_listing = self.data_field = False

    def reset(self):
        HTMLParser.HTMLParser.reset(self)
        self.reset_site()
        self.site_list = []

    def handle_starttag(self, tag, attrs):
        class_attr = dict(attrs).get('class')
        if tag == 'li' and class_attr == 'site-listing':
            self.in_site_listing = True
        elif self.in_site_listing:
            if tag == 'a':
                if class_attr != 'moreDesc':
                    self.site['url'] = dict(attrs)['href'].replace(
                                                             '/siteinfo/', '')
            elif tag in ['div', 'span']:
                self.data_field = self.ATTR_FIELDS.get(class_attr)

    def handle_data(self, data):
        if self.data_field:
            self.site[self.data_field] += data

    def handle_endtag(self, tag):
        if tag == 'li' and self.in_site_listing:
            self.site_list.append(self.site)
            self.reset_site()
        self.data_field = None

And then change the view and template:

view.py

def top_urls(request):
    p = MyHTMLParser()
    p.feed(urllib2.urlopen('http://www.alexa.com/topsites/global').read())
    sites = p.site_list[:20]
    return render(request, 'top_urls.html', {'sites': sites})

top_urls.html

...
<tbody>
    {% for site in sites %}
        <tr>
            <td>{{ site.rank }}</td>
            <td>{{ site.url }}</td>
            <td>{{ site.description }}</td>
        </tr>
    {% endfor %}
</tbody>
...

EXPLANATION UPDATE:

Variables used:

  • self.site - current site info
  • self.in_site_listing' - flag is set to True if we are in the` tag
  • self.data_field - key in the site info to add the data
  • ATTR_FIELDS - a map of the <div>/<span> classes to the site info keys

The key method is the handle_starttag():

def handle_starttag(self, tag, attrs):
    # get the tag `class` attribute if any
    class_attr = dict(attrs).get('class')
    # if the tag is `<li class="site-listing">` then set the flag that we
    # should populate the site info
    if tag == 'li' and class_attr == 'site-listing':
        self.in_site_listing = True
    # we a in the site population mode
    elif self.in_site_listing:
        if tag == 'a':
            # `<li class="site-info">` contains two `<a>` tags. We should
            # use the tag withoud `class="moreDesc"` attribute to set the url
            if class_attr != 'moreDesc':
                self.site['url'] = dict(attrs)['href'].replace(
                                                         '/siteinfo/', '')
        elif tag in ['div', 'span']:
            # we are in the `<div>` or `<span>` tag. Get the `class` attribute
            # of the tag and decide which field of the site info we will
            # populate in the `handle_data()` method
            self.data_field = self.ATTR_FIELDS.get(class_attr)

So the handle_data() is pretty simple:

def handle_data(self, data):
    # if we know which field of site info should be populated
    if self.data_field:
        # append the data to this field. Site description is spread in several
        # tags this is why we append data instead of simple assigning.
        self.site[self.data_field] += data
分享给朋友:
您可能感兴趣的文章:
随机阅读: