当前位置: 动力学知识库 > 问答 > 编程问答 >

Efficiently parsing a large text file in Python?

问题描述:

I have a series of large, flat text files that I need to parse in order insert into a SQL database. Each record spans multiple lines and consists of about a hundred fixed-length fields. I am trying to figure out how to efficiently parse them without loading the entire file into memory.

Each record starts with a numeric "1" as the first character on a new line (though not every line that starts with "1" is a new record) and terminates many lines later with a series of 20 spaces. While each field is fixed-width, each record is variable-length because it may or may not contain several optional fields. So I've been using "...20 spaces...\n1" as a record delimiter.

I've been trying to work with something like this to process 1kb at a time:

def read_in_chunks(file_object, chunk_size):

while True:

data = file_object.read(chunk_size)

if not data:

break

yield data

file = open('test.txt')

for piece in read_in_chunks(file, chunk_size=1024):

# Do stuff

However, the problem I'm running into is when a single record spans multiple chunks. Am I overlooking an obvious design pattern? This problem would seem to be somewhat common. Thanks!

网友答案:
def recordsFromFile(inputFile):
    record = ''
    terminator = ' ' * 20
    for line in inputFile:
        if line.startswith('1') and record.endswith(terminator):
            yield record
            record = ''
        record += line
    yield record

inputFile = open('test.txt')
for record in recordsFromFile(inputFile):
    # Do stuff

BTW, file is a built-in function. It's bad style to change its value.

分享给朋友:
您可能感兴趣的文章:
随机阅读: