I'm writing a python snippet to fix the casing of titles in HTML code. So far, I've come up with this code:
pattern = re.compile("<h1>(.*)</h1>|<h2>(.*)</h2>|<h3>(.*)</h3>|<h4>(.*)</h4>|<h5>(.*)</h5>|<h6>(.*)</h6>")
contents = m.group(1)
replacement = contents + contents[1:].lower()
Then, given a
line, the transformation I use is
line = pattern.sub(replace, line).
This doesn't work, because
m.group(1) is always
None, whereas I'd like it to be the match corresponding to any of the clauses in my regex. Since patterns can't share a name in python, I'm somewhat at a loss.
An obvious solution is to group all the patterns which I used, but then
<h1>bla</h2> would be recognized. That's not good, since
<h1><a href="...">Bla</a></h1> <h2>Bla</h2> should yield two matches (
<a href="...">Bla</a>, and
From what I understand you just want to capitalize all of the headings. You can use
lxml which would make this fairly painless:
import lxml.html doc = lxml.html.parse(your_html) for i in range(1,7): for h in doc.xpath('//h%d' % i): h.text = h.text.capitalize() print lxml.html.tostring(doc)
Why do you care about that? HTML tags are not case sensitive. If you need a proper solution than you use a tool like BeautifulSoup. Parsing HTML using regular expressions is nonsense and never ever recommendable (discussed often enough).
You may want to have a look at this question and all the tons of comments and answers to it. :-)
to parse html.
The following XPath expression selects all the wanted text nodes:
//*[starts-with(name(),'h') and substring(name(),2) >= 1 and not(substring(name(),2) >6) ] //text()