I have a tab delimited data file with just over 2 million rows and 19 columns. You can find it at US.zip: http://download.geonames.org/export/dump/ .
I started running the following, but with for l in f.readlines(). I understand that only repeating over a file should be more efficient, so I am posting it below. However, with this little optimization, I use 30% of my memory in the process and made only about 6.5% of the records. It seems that at this pace he does not have enough memory as before. In addition, my function is very slow. Is there something obvious that I can do to speed it up? Would delobjects with each pass of the loop help for?
def run():
from geonames.models import POI
f = file('data/US.txt')
for l in f:
li = l.split('\t')
try:
p = POI()
p.geonameid = li[0]
p.name = li[1]
p.asciiname = li[2]
p.alternatenames = li[3]
p.point = "POINT(%s %s)" % (li[5], li[4])
p.feature_class = li[6]
p.feature_code = li[7]
p.country_code = li[8]
p.ccs2 = li[9]
p.admin1_code = li[10]
p.admin2_code = li[11]
p.admin3_code = li[12]
p.admin4_code = li[13]
p.population = li[14]
p.elevation = li[15]
p.gtopo30 = li[16]
p.timezone = li[17]
p.modification_date = li[18]
p.save()
except IndexError:
pass
if __name__ == "__main__":
run()
EDIT, More (Apparently Important):
Memory consumption increases when a script is executed and saves more lines. The .save () method is a fake django model method with a unique fragment unique_slug that writes to postgreSQL / postgis db.
SOLVED: registering a DEBUG database in Django eats memory.