I would like to extract a random poem from this book.
Using BeautifulSoup, I have been able to find the title and prose.
print soup.find('div', class_="pre_poem").text
print soup.find('table', class_="poem").text
But I would like to find all the poems and pick one.
Should I use a regex and match all between
Use an html document parser instead. It's safer in terms of the unintended consquences.
The reason why all programmers discourage parsing HTML with regex is that HTML mark-up of the page is not static especially if your souce HTML is a webpage. Regex is better suited for strings.
Use regex at your own risk.
Assuming you already have a suitable
soup object to work with, the following might help you get started:
poem_ids =  for section in soup.find_all('ol', class_="TOC"): poem_ids.extend(li.find('a').get('href') for li in section.find_all('li')) poem_ids = [id[1:] for id in poem_ids[:-1] if id] poem_id = random.choice(poem_ids) poem_start = soup.find('a', id=poem_id) poem = poem_start.find_next() poem_text =  while True: poem = poem.next_element if poem.name == 'h3': break if poem.name == None: poem_text.append(poem.string) print '\n'.join(poem_text).replace('\n\n\n', '\n')
This first extracts a list of the poems from the table of contents at the top of the page. These contain unique IDs to each of the poems. Next a random ID is chosen and the matching poem is then extracted based on that ID.
For example, if the first poem was selected, you would see the following output:
"The Arrow and the Song," by Longfellow (1807-82), is placed first in this volume out of respect to a little girl of six years who used to love to recite it to me. She knew many poems, but this was her favourite. I shot an arrow into the air, It fell to earth, I knew not where; For, so swiftly it flew, the sight Could not follow it in its flight. I breathed a song into the air, It fell to earth, I knew not where; For who has sight so keen and strong That it can follow the flight of song? Long, long afterward, in an oak I found the arrow, still unbroke; And the song, from beginning to end, I found again in the heart of a friend. Henry W. Longfellow.
This is done by using BeautifulSoup to extract all of the text from each element until the next
<h3> tag is found, and then removing any extra line breaks.