Saving a huge bigram dictionary to a file using brine

my friend wrote this little program. textFile size is 1.2 GB (newspaper for 7 years). He managed to create a dictionary, but he cannot write it to a file using pickle (the program freezes).

 import sys import string import cPickle as pickle biGramDict = {} textFile = open(str(sys.argv[1]), 'r') biGramDictFile = open(str(sys.argv[2]), 'w') for line in textFile: if (line.find('<s>')!=-1): old = None for line2 in textFile: if (line2.find('</s>')!=-1): break else: line2=line2.strip() if line2 not in string.punctuation: if old != None: if old not in biGramDict: biGramDict[old] = {} if line2 not in biGramDict[old]: biGramDict[old][line2] = 0 biGramDict[old][line2]+=1 old=line2 textFile.close() print "going to pickle..." pickle.dump(biGramDict, biGramDictFile,2) print "pickle done. now load it..." biGramDictFile.close() biGramDictFile = open(str(sys.argv[2]), 'r') newBiGramDict = pickle.load(biGramDictFile) 

early.

EDIT
For everyone who is interested, I will briefly explain what this program does. if you have a file created something like this:

 <s> Hello , World ! </s> <s> Hello , munde ! </s> <s> World domination . </s> <s> Total World domination ! </s> 
  • <s> are sentence separators.
  • one word per line.

biGramDictionary is generated for later use.
something like that:

 { "Hello": {"World": 1, "munde": 1}, "World": {"domination": 2}, "Total": {"World": 1}, } 

hope this helps. now the strategy has changed to using mysql because sqlite just didn't work (possibly due to size)

+4
source share
4 answers

Pickle is only for writing complete (small) objects. Your dictionary is a little large for storage in memory, you better use a database so that you can store and retrieve records one by one, and not all at once.

Some good and easily integrable singe database file formats that you can use with Python, SQLite or one of the DBM options . The latter acts just like a dictionary (i.e. you can read and write key / value pairs), but uses the disk as storage, rather than 1.2 GB of memory.

+10
source

Do you really need all the data in memory? You can split it naively as one file for each year o every month if you want a dictionary / brine approach.

Also, remember that dictionaries are not sorted, you may have problems sorting this amount of data. If you want to search or sort data, of course ...

In any case, I think that the database approach, commented earlier, is the most flexible, especially in the long run ...

+1
source

One solution is to use buzhug instead of brine. This is a pure Python solution and preserves the very syntax of Pythonic. I think of it as the next step from the shelf and their wards. It will handle the sizes of the data you are talking about. Its size limit is 2 GB per field (each field is stored in a separate file).

+1
source

If you really want to use a dictionary, such as semantics, try SQLAlchemy associationproxy . The next (rather long) piece of code translates your dictionary into Key, Value-Pairs into entries -Table. I don't know how SQLAlchemy handles your large dictionary, but SQLite should do it well.

 from sqlalchemy import create_engine, MetaData from sqlalchemy import Table, Column, Integer, ForeignKey, Unicode, UnicodeText from sqlalchemy.orm import mapper, sessionmaker, scoped_session, Query, relation from sqlalchemy.orm.collections import column_mapped_collection from sqlalchemy.ext.associationproxy import association_proxy from sqlalchemy.schema import UniqueConstraint engine = create_engine('sqlite:///newspapers.db') metadata = MetaData() metadata.bind = engine Session = scoped_session(sessionmaker(engine)) session = Session() newspapers = Table('newspapers', metadata, Column('newspaper_id', Integer, primary_key=True), Column('newspaper_name', Unicode(128)), ) entries = Table('entries', metadata, Column('entry_id', Integer, primary_key=True), Column('newspaper_id', Integer, ForeignKey('newspapers.newspaper_id')), Column('entry_key', Unicode(255)), Column('entry_value', UnicodeText), UniqueConstraint('entry_key', 'entry_value', name="pair"), ) class Base(object): def __init__(self, **kw): for key, value in kw.items(): setattr(self, key, value) query = Session.query_property(Query) def create_entry(key, value): return Entry(entry_key=key, entry_value=value) class Newspaper(Base): entries = association_proxy('entry_dict', 'entry_value', creator=create_entry) class Entry(Base): pass mapper(Newspaper, newspapers, properties={ 'entry_dict': relation(Entry, collection_class=column_mapped_collection(entries.c.entry_key)), }) mapper(Entry, entries) metadata.create_all() dictionary = { u'foo': u'bar', u'baz': u'quux' } roll = Newspaper(newspaper_name=u"The Toilet Roll") session.add(roll) session.flush() roll.entries = dictionary session.flush() for entry in Entry.query.all(): print entry.entry_key, entry.entry_value session.commit() session.expire_all() print Newspaper.query.filter_by(newspaper_id=1).one().entries 

gives

 foo bar baz quux {u'foo': u'bar', u'baz': u'quux'} 
0
source

Source: https://habr.com/ru/post/1298980/


All Articles