I assume that you have several copies on your dictionnary that are simultaneously stored in memory (in different formats). As an example, the line:
tweets = map(json.loads, tweet_file)
Creates a new copy (+ 400 ~ 1000 MB, including a dictionary). But your original tweet_file remains in memory. Why such big numbers? Well, if you work with Unicode strings, each Unicode character uses 2 or 4 bytes in memory. If in your file, assuming UTF-8 encoding, most characters use only 1 byte. If you work with regular strings in Python 2, the size of the string in memory should be almost the same as the size on disk. Therefore, you will have to find another explanation.
EDIT: The actual number of bytes occupied by a "character" in Python 2 may vary. Here are some examples:
>>> import sys >>> sys.getsizeof("") 40 >>> sys.getsizeof("a") 41 >>> sys.getsizeof("ab") 42
As you can see, it seems that each character is encoded as one byte. But:
>>> sys.getsizeof("à") 42
Not for the "French" characters. AND...
>>> sys.getsizeof("世") 43 >>> sys.getsizeof("世界") 46
For the Japanese, we have 3 bytes per character.
The above results depend on the site and are explained by the fact that my system uses UTF-8 for encoding by default. The "line size" calculated above is actually the "byte line size" representing the given text.
If "json.load" uses "unicode" strings, the result is somehow different:
>>> sys.getsizeof(u"") 52 >>> sys.getsizeof(u"a") 56 >>> sys.getsizeof(u"ab") 60 >>> sys.getsizeof(u"世") 56 >>> sys.getsizeof(u"世界") 60
In this case, as you can see, each additional character adds 4 additional bytes.
Maybe the file object will cache some data? If you want to cause an explicit selection of an object, try setting its link to None:
tweets = map(json.loads, tweet_file)
When object references are no longer being processed, Python will delact it - and therefore free the appropriate memory (from the Python heap - I don't think that memory is returning to the system).