当前位置: 动力学知识库 > 问答 > 编程问答 >

Python Regex Match Dates

问题描述:

I'm scraping and saving (as a comma-delimited text file) information on roll call votes in the US House of Representatives.

Each line in the resulting file takes the following form:

Roll Call Number, Bill, Date, Representative, Vote, Total Yeas, Total Nays

Where I'm running into trouble is scraping the dates from 1-Nov-2001 (roll call 414) onward. Instead of matching 1-Nov-2001, the regex matches incorrectly or breaks. In the first case, it matches the string '-AND-'. The text does change between #414 and #415 to include the string 'YEAS-AND-NAYS'.

I'm betting I've written the regex wrong, but I'm not seeing it. What might I need to change to match the date instead? The relevant code is below.

import urllib2, datetime, sys, re, string

import xml.etree.ElementTree as ET

for i in range(414,514):

if i < 10:

num_string = "00"+str(i)

elif i < 100:

num_string = "0"+str(i)

elif i > 100:

num_string = str(i)

print num_string, datetime.datetime.now()

url = "http://clerk.house.gov/evs/2001/roll"+num_string+".xml"

text = urllib2.urlopen(url).read()

tree = ET.fromstring(text)

notags = ET.tostring(tree, encoding="utf8", method="text")

dte = re.search(r'[0-9]*-[A-Za-z]*-[0-9]*', notags).group()

print dte

网友答案:

Using a regular expression against an XML document is never a good idea (seriously).

You can achieve the desired result without any regular expressions by extracting the date from the relevant XML element (I've used lxml.etree instead of xml.etree.ElementTree, but the principle will be the same).

Also, I've added an easier way to generate a 3-digit number (leading 0 if necessary).

import urllib2, datetime, sys, string
import lxml.etree

for i in range(414,416):
    num_string = '{:03d}'.format(i)
    print num_string, datetime.datetime.now()
    url = "http://clerk.house.gov/evs/2001/roll"+num_string+".xml"
    xml = lxml.etree.parse(urllib2.urlopen(url))
    root = xml.getroot()
    actdate = root.xpath('//action-date')[0]
    dte = actdate.text.strip()
    print dte

If you insist on using a regular expression, then [0-9]+-[A-Za-z]+-[0-9]+ would be better as it guarantees at least one digit followed by dash followed by at least one letter followed by dash followed by at least one digit (as holdenweb mentions in his comment).

分享给朋友:
您可能感兴趣的文章:
随机阅读: