In Python, how would I go about grabbing all the titles, and plane text from a wikipedia article such as: https://en.wikipedia.org/wiki/Amadeus_(film). My current code is this:
from bs4 import BeautifulSoup
# ---- Definitions ----#
#Amount of documents
amount_of_documents = 1
#Directory of raw HTML documents
directory_of_raw_documents = "raw_documents/"
#Directory of parsed documents
directory_of_parsed_documents = "parsed_documents/"
# ---- Code ----#
for i in range (1, 1+1):
with open(directory_of_raw_documents + str(i), "r") as document:
html = document.read()
soup = BeautifulSoup(html, "html.parser")
body = soup.find('div', id='bodyContent')
for elements in body.find_all('p'):
I am loading a downloaded HTML file, then using BeautifulSoup to grab all the content between the
<p> tags. My goal is to grab all titles and plain text content of this article. How would I go about doing this?
In the example posted above, my desired output would be contains:
You might be interested in using specialized wikipedia page parsers like
wikipedia package. This way you may get the contents easily:
In : import wikipedia In : page = wikipedia.page("Amadeus (film)") In : page.summary Out: u"Amadeus is a 1984 American period drama film directed by Milo\u0161 Forman, written by Peter Shaffer, and adapted from Shaffer's stage play Amadeus (1979). The story, set in Vienna, Austria, during the latter half of the 18th century, is a fictionalized biography of Wolfgang Amadeus Mozart. Mozart's music is heard extensively in the soundtrack of the movie. Its central thesis is that Antonio Salieri, an Italian contemporary of Mozart is so driven by jealousy of the latter and his success as a composer that he plans to kill him and to pass off a Requiem, which he secretly commissioned from Mozart as his own, to be premiered at Mozart's funeral. Historically, the Requiem which was never finished was commissioned by Count von Walsegg and Salieri, far from being jealous of Mozart, was on good terms with him and even tutored his son after Mozart's death.\nThe film was nominated for 53 awards and received 40, which included eight Academy Awards (including Best Picture), four BAFTA Awards, four Golden Globes, and a Directors Guild of America (DGA) award. As of 2016, it is the most recent film to have more than one nomination in the Academy Award for Best Actor category. In 1998, the American Film Institute ranked Amadeus 53rd on its 100 Years... 100 Movies list." In : page.content Out: u'Amadeus is a 1984 American period drama film directed by Milo\u0161 Forman, written by Peter Shaffer, and adapted from Shaffer\'s s ... Amadeus Filming locations at Movieloci.com'
As far as getting the titles, here is a sample code to get them via
In : import requests In : from bs4 import BeautifulSoup In : url = "https://en.wikipedia.org/wiki/Amadeus_(film)" In : response = requests.get(url) In : soup = BeautifulSoup(response.content, "html.parser") In : [item.get_text() for item in soup.select("h2 .mw-headline")] Out: [u'Plot', u'Cast', u'Production', u'Reception', u'Alternative versions', u'Music', u'Awards and nominations', u'References', u'External links']
h2 .mw-headline is a CSS selector that would match elements with
mw-headline class under the
h2 parent elements.