I am going to use beautifulsoup to find a table that defined in the “content logical definition” in the following links:
Several tables may be defined in the pages. The table I want is located under
<h2> tag with text “content logical definition”. Some of the pages may lack of any table in the “content logical definition” section, so I want the table to be null. By now I tried several solution, but each of them return wrong table for some of the pages.
The last solution that was offered by alecxe is this:
from bs4 import BeautifulSoup
urls = [
for url in urls:
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
table = None
for sibling in h2.find_next_siblings():
if sibling.name == "table":
table = sibling
if sibling.name == "h2":
This solution returns null if no table is located in the section of “content logical definition” but for the second url having table in “content logical definition” it returns wrong table, a table at the end of the page.
How can I edit this code to access a table defined exactly after tag having text of “content logical definition”, and if there is no table in this section it returns null.
It looks like the problem with alecxe's code is that it returns a table that is a direct sibling of h2, but the one you want is actually within a div (which is h2's sibling). This worked for me:
import requests from bs4 import BeautifulSoup urls = [ 'https://www.hl7.org/fhir/valueset-account-status.html', 'https://www.hl7.org/fhir/valueset-activity-reason.html', 'https://www.hl7.org/fhir/valueset-age-units.html' ] def extract_table(url): r = requests.get(url) soup = BeautifulSoup(r.content, 'lxml') h2 = soup.find(lambda elm: elm.name == 'h2' and 'Content Logical Definition' in elm.text) div = h2.find_next_sibling('div') return div.find('table') for url in urls: print extract_table(url)