当前位置: 动力学知识库 > 问答 > 编程问答 >

web scraping - Extracting Url from the html page Beautiful Soup/Python

问题描述:

I am trying ot fetch href based on the argument which i pass..for example test.py arg1 arg2 ...where arg1 is school name something like "south carolina" so it has to retrieve the score according to the school given in the argument.

Here is a small snippet from the prettified source which i saved using urlopen and BeautifulSoup.

<a data-ylk="lt:s;sec:mod-sch;slk:game;itc:0;ltxt:;tar:sports.yahoo.com;"

href="/ncaaf/south-carolina-gamecocks-georgia-bulldogs-201309070068/">

<span class="away "> 30 </span>

-

<span class="home winner"> 41 </span> </a>

Now the arg1 should match with href provided so that i can retrieve the score.. I used

bs.find('a', href="/ncaaf/south-carolina-gamecocks-georgia-bulldogs-201309070068/")

But what if I have to match my argument such as south carolina to href..How can I match it? something like href="/ncaaf/south-carolina-* so that I can fetch whole href just by matching with argument1 (which I will be replacing with hyphens) and also if I give "gerorgia" is it possible to retrieve the href just by matching the argument regardless of the position of the string after /ncaaf/............./

As I'm poor in regex ,it's bit complicated

网友答案:

You'd indeed have to match that with a regular expression.

If your command-line argument is of the form south-carolina in sys.argv[1], use:

 import re

 school_name = sys.argv[1]
 url_pattern = re.compile(r'/ncaaf/{}-'.format(re.escape(school_name)))

 matching_links = soup.find_all('a', href=url_pattern)

The re.escape() makes sure that any characters in the input that could be interpreted as regular expression meta-characters are properly escaped.

For south-carolina that'd result in the pattern /ncaaf/south-carolina- which matches anything containing the literal text /ncaaf/south-carolina-; you don't really need to include any wild-card characters as for a re.search() match text containment is enough.

分享给朋友:
您可能感兴趣的文章:
随机阅读: