当前位置: 动力学知识库 > 问答 > 编程问答 >

python - Access to a specific table in html tag

问题描述:

I am going to use beautifulsoup to find a table that defined in the “content logical definition” in the following links:

1) https://www.hl7.org/fhir/valueset-account-status.html

2) https://www.hl7.org/fhir/valueset-activity-reason.html

3) https://www.hl7.org/fhir/valueset-age-units.html

Several tables may be defined in the pages. The table I want is located under <h2> tag with text “content logical definition”. Some of the pages may lack of any table in the “content logical definition” section, so I want the table to be null. By now I tried several solution, but each of them return wrong table for some of the pages.

The last solution that was offered by alecxe is this:

import requests

from bs4 import BeautifulSoup

urls = [

'https://www.hl7.org/fhir/valueset-activity-reason.html',

'https://www.hl7.org/fhir/valueset-age-units.html'

]

for url in urls:

r = requests.get(url)

soup = BeautifulSoup(r.content, 'lxml')

h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)

table = None

for sibling in h2.find_next_siblings():

if sibling.name == "table":

table = sibling

break

if sibling.name == "h2":

break

print(table)

This solution returns null if no table is located in the section of “content logical definition” but for the second url having table in “content logical definition” it returns wrong table, a table at the end of the page.

How can I edit this code to access a table defined exactly after tag having text of “content logical definition”, and if there is no table in this section it returns null.

网友答案:

It looks like the problem with alecxe's code is that it returns a table that is a direct sibling of h2, but the one you want is actually within a div (which is h2's sibling). This worked for me:

import requests
from bs4 import BeautifulSoup

urls = [
    'https://www.hl7.org/fhir/valueset-account-status.html',
    'https://www.hl7.org/fhir/valueset-activity-reason.html',
    'https://www.hl7.org/fhir/valueset-age-units.html'
]


def extract_table(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')

    h2 = soup.find(lambda elm: elm.name == 'h2' and 'Content Logical Definition' in elm.text)
    div = h2.find_next_sibling('div')
    return div.find('table')


for url in urls:
    print extract_table(url)
分享给朋友:
您可能感兴趣的文章:
随机阅读: