当前位置: 动力学知识库 > 问答 > 编程问答 >

Creating a dictionary using python and a .txt file

问题描述:

I have downloaded the following dictionary from Project Gutenberg http://www.gutenberg.org/cache/epub/29765/pg29765.txt (it is 25 MB so if you're on a slow connection avoid clicking the link)

In the file the keywords I am looking for are in uppercases for instance HALLUCINATION, then in the dictionary there are some lines dedicated to the pronunciation which are obsolete for me.

What I want to extract is the definition, indicated by "Defn" and then print the lines. I have came up with this rather ugly 'solution'

def lookup(search):

find = search.upper() # transforms our search parameter all upper letters

output = [] # empty dummy list

infile = open('webster.txt', 'r') # opening the webster file for reading

for line in infile:

for part in line.split():

if (find == part):

for line in infile:

if (line.find("Defn:") == 0): # ugly I know, but my only guess so far

output.append(line[6:])

print output # uncertain about how to proceed

break

Now this of course only prints the first line that comes right after "Defn:". I am new when it comes to manipulate .txt files in Python and therefore clueless about how to proceed. I did read in the line in a tuple and noticed that there are special new line characters.

So I want to somehow tell Python to keep on reading until it runs out of new line characters I suppose, but also that doesn't count for the last line which has to be read.

Could someone please enhance me with useful functions I might could use to solve this problem (with a minimal example would be appreciated).


Example of desired output:

lookup("hallucination")

out: To wander; to go astray; to err; to blunder; -- used of mental

processes. [R.] Byron.

lookup("hallucination")

out: The perception of objects which have no reality, or of \r\n

sensations which have no corresponding external cause, arising from \r\n

disorder or the nervous system, as in delirium tremens; delusion.\r\n

Hallucinations are always evidence of cerebral derangement and are\r\n

common phenomena of insanity. W. A. Hammond.


from text:

HALLUCINATE

Hal*lu"ci*nate, v. i. Etym: [L. hallucinatus, alucinatus, p. p. of

hallucinari, alucinari, to wander in mind, talk idly, dream.]

Defn: To wander; to go astray; to err; to blunder; -- used of mental

processes. [R.] Byron.

HALLUCINATION

Hal*lu`ci*na"tion, n. Etym: [L. hallucinatio cf. F. hallucination.]

1. The act of hallucinating; a wandering of the mind; error; mistake;

a blunder.

This must have been the hallucination of the transcriber. Addison.

2. (Med.)

Defn: The perception of objects which have no reality, or of

sensations which have no corresponding external cause, arising from

disorder or the nervous system, as in delirium tremens; delusion.

Hallucinations are always evidence of cerebral derangement and are

common phenomena of insanity. W. A. Hammond.

HALLUCINATOR

Hal*lu"ci*na`tor, n. Etym: [L.]

网友答案:

Here is a function that returns the first definition:

def lookup(word):
    word_upper = word.upper()
    found_word = False
    found_def = False
    defn = ''
    with open('dict.txt', 'r') as file:
        for line in file:
            l = line.strip()
            if not found_word and l == word_upper:
                found_word = True
            elif found_word and not found_def and l.startswith("Defn:"):
                found_def = True
                defn = l[6:]
            elif found_def and l != '':
                defn += ' ' + l
            elif found_def and l == '':
                return defn
    return False

print lookup('hallucination')

Explanation: There are four different cases we have to consider.

  • We haven't found the word yet. We have to compare the current line to the word we are looking for in uppercases. If they are equal, we found the word.
  • We have found the word, but haven't found the start of the definition. We therefore have to look for a line that starts with Defn:. If we found it, we add the line to the definition (excluding the six characters for Defn:.
  • We have already found the start of the definition. In that case, we just add the line to the definition.
  • We have already found the start of definition and the current line is empty. The definition is complete and we return the definition.

If we found nothing, we return False.

Note: There are certain entries, e.g. CRANE, that have multiple definitions. The above code is not able to handle that. It will just return the first definition. However, it is far from easy to code a perfect solution considering the format of the file.

网友答案:

You can split into paragraphs and use the index of the search word and find the first Defn paragraph after:

def find_def(f,word):
    import re
    with open(f) as f:
        lines = f.read() 
        try:
            start = lines.index("{}\r\n".format(word)) # find where our search word is
        except ValueError: 
            return "Cannot find search term" 
        paras = re.split("\s+\r\n",lines[start:],10) # split into paragraphs using maxsplit = 10 as there are no grouping of paras longer in the definitions
        for para in paras:
            if para.startswith("Defn:"): # if para startswith Defn: we have what we need
                return para # return the  para

print(find_def("in.txt","HALLUCINATION"))

Using the whole file returns:

In [5]: print find_def("gutt.txt","VACCINATOR")
Defn: One who, or that which, vaccinates.

In [6]: print find_def("gutt.txt","HALLUCINATION")
Defn: The perception of objects which have no reality, or of
sensations which have no corresponding external cause, arising from
disorder or the nervous system, as in delirium tremens; delusion.
Hallucinations are always evidence of cerebral derangement and are
common phenomena of insanity. W. A. Hammond.

A slightly shorter version:

def find_def(f,word):
    import re
    with open(f) as f:
        lines = f.read()
        try:
            start = lines.index("{}\r\n".format(word))
        except ValueError:
            return "Cannot find search term"
        defn = lines[start:].index("Defn:")
        return re.split("\s+\r\n",lines[start+defn:],1)[0]
网友答案:

From here I learned an easy way to deal with memory mapped files and use them as if they were strings. Then you can use something like this to get the first definition for a term.

def lookup(search):
    term = search.upper()
    f = open('webster.txt')
    s = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    index = s.find('\r\n\r\n' + term + '\r\n')
    if index == -1:
        return None
    definition = s.find('Defn:', index) + len('Defn:') + 1
    endline = s.find('\r\n\r\n', definition)
    return s[definition:endline]

print lookup('hallucination')
print lookup('hallucinate')

Assumptions:

  • There is at least one definition per term
  • If there are more than one, only the first is returned
分享给朋友:
您可能感兴趣的文章:
随机阅读: