当前位置: 动力学知识库 > 问答 > 编程问答 >

shell - speedup postgresql to add data from text file using Django python script

问题描述:

I am working with server who's configurations are as:

RAM - 56GB

Processor - 2.6 GHz x 16 cores

How to do parallel processing using shell? How to utilize all the cores of processor?

I have to load data from text file which contains millions of entries for example one file contains half million lines data.

I am using django python script to load data in postgresql database.

But it takes lot of time to add data in database even though i have such a good config. server but i don't know how to utilize server resources in parallel so that it takes less time to process data.

Yesterday i had loaded only 15000 lines of data from text file to postgresql and it took nearly 12 hours to do it.

My django python script is as below:

import re

import collections

def SystemType():

filename = raw_input("Enter file Name:")

in_file = file(filename,"r")

out_file = file("SystemType.txt","w+")

for line in in_file:

line = line.decode("unicode_escape")

line = line.encode("ascii","ignore")

values = line.split("\t")

if values[1]:

for list in values[1].strip("wordnetyagowikicategory"):

out_file.write(re.sub("[^\ a-zA-Z()<>\n""]"," ",list))

# Eliminate Duplicate Entries from extracted data using regular expression

def FSystemType():

lines_seen = set()

outfile = open("Output.txt","w+")

infile = open("SystemType.txt","r+")

for line in infile:

if line not in lines_seen:

l = line.lstrip()

# Below reg exp is used to handle Camel Case.

outfile.write(re.sub(r'((?<=[a-z])[A-Z]|(?<!\A)[A-Z](?=[a-z]))', r' \1', l).lower())

lines_seen.add(line)

infile.close()

outfile.close()

sylist=[]

def create_system_type(stname):

syslist=Systemtype.objects.all()

for i in syslist:

sylist.append(str(i.title))

if not stname in sylist:

slu=slugify(stname)

st=Systemtype()

st.title=stname

st.slug=slu

# st.sites=Site.objects.all()[0]

st.save()

print "one ST added."

网友答案:

if you could express your requirements without the code (not every shell programmer can really read phython), possibly we could help here.

e.g. your report of 12 hours for 15000 lines suggests you have a too-busy "for" loop somewhere, and i'd suggest the nested for

for list in values[1]....

what are you trying to strip? individual characters, whole words? ...

then i'd suggest "awk".

网友答案:

If you are able to work out the precise data structure required by Django, you can load the database tables directly, using the psql "copy" command. You could do this by preparing a csv file to load into the db.

There are any number of reasons why loading is slow using your approach. First of all Django has a lot of transactional overhead. Secondly it is not clear in what way you are running the Django code -- is this via the internal testing server? If so you may have to deal with the slowness of that. Finally what makes a fast database is not normally to do with CPU, but rather fast IO and lots of memory.

分享给朋友:
您可能感兴趣的文章:
随机阅读: