Shelve dictionary size is> 100Gb for 2Gb text file

I create a sequence shelf file from the FASTA genomic file:

# Import necessary libraries import shelve from Bio import SeqIO # Create dictionary of genomic sequences genome = {} with open("Mus_musculus.GRCm38.dna.primary_assembly.fa") as handle: for record in SeqIO.parse(handle, "fasta"): genome[str(record.id)] = str(record.seq) # Shelve genome sequences myShelve = shelve.open("Mus_musculus.GRCm38.dna.primary_assembly.db") myShelve.update(genome) myShelve.close() 

The 2.6Gb file itself, however, when I try to postpone it, a file> 100Gb is created, plus my computer will throw out some complaints about insufficient memory and filling up the boot disk. This only happens when I try to run it under OSX Yosemite, on Ubuntu it works as expected. Any suggestions why this is not working? I am using Python 3.4.2

+5
source share
2 answers

Check which interface is used for dbm through import dbm; print(dbm.whichdb('your_file.db') import dbm; print(dbm.whichdb('your_file.db') The file format used for debugging depends on the best installed binary package available on your system and its interfaces. The newest is gdbm and dumb is a backup solution if no bin is found , ndbm is something in between. https://docs.python.org/3/library/shelve.html
https://docs.python.org/3/library/dbm.html

It is not possible to have all the data in memory if you lose all the memory for the file system cache. Updating to smaller blocks is better. I don’t even see slowdowns if items are updated one by one.

 myShelve = shelve.open("Mus_musculus.GRCm38.dna.primary_assembly.db") with open("Mus_musculus.GRCm38.dna.primary_assembly.fa") as handle: for i, record in enumerate(SeqIO.parse(handle, "fasta")): myShelve.update([(str(record.id), str(record.seq))]) myShelve.close() 

It is known that dbm databases became fragmented if the application crashed after updates without calling the close database. I think that was your business. Now you probably don't have the important data in the big file yet, but in the future you can defragment the database with gdbm.reorganize() .

+2
source

I had the same problem: on a macOS system with a shelf about 4 megabytes in size, the data grew to a huge size of 29 gigabytes on disk! This obviously happened because I updated the same pairs of key values ​​in the shelf again and again.

Since my regiment was based on GNU dbm, I was able to use its hint of reorganization. Here is the code that returned my file in the shelf to its normal size in a few seconds:

 import dbm db = dbm.open(shelfFileName, 'w') db.reorganize() db.close() 

I am not sure if this method will work for other (non GNU) dbms. To test your dbm system, remember the code shown by @hynekcer:

 import dbm print( dbm.whichdb(shelfFileName) ) 

If GNU dbm is used by your system, this should output 'dbm.gnu' (this is the new name for the old gdbm).

0
source

Source: https://habr.com/ru/post/1207518/


All Articles