Python OrderDict sputtering versus dict ()

This completely puzzles me.

asset_hist = [] for key_host, val_hist_list in am_output.asset_history.items(): for index, hist_item in enumerate(val_hist_list): #row = collections.OrderedDict([("computer_name", key_host), ("id", index), ("hist_item", hist_item)]) row = {"computer_name": key_host, "id": index, "hist_item": hist_item} asset_hist.append(row) 

This code works great with the commented collection line. However, when I comment out the row = dict line and delete the comment from the collection line, everything becomes very strange. There are about 4 million of these lines that are generated and added to asset_hist.

So, when I use row = dict, the whole cycle ends in about 10 milliseconds, lightning fast. When I use an ordered dictionary, I waited more than 10 minutes, and it still has not finished. Now I know that OrderDict should be a little slower than the recorder, but it should be about 10 times slower in the worst case, and in my math it's about 100,000 times slower in this function.

I decided to print the index in the bottom loop to see what happens. Interestingly, I noticed spraying at the console output. The index will print very quickly on the screen, and then stop for about 3-5 seconds before continuing.

am_output.asset_history is a dictionary that has one key, a host, and each line is a list of lines. For instance.

am_output.asset_history = {"host1": ["string1", "string2", ...], "host2": ["string1", "string2", ...], ...}

EDIT: Spray analysis using OrderedDict

Total memory on this VM server: only 8 GB ... you need to get more deactivation.

LOOP NUM

184796 (~ 5 seconds of waiting, ~ 60% of memory use)

634481 (~ 5 seconds of waiting, ~ 65% of memory use)

1197564 (~ 5 seconds of waiting, ~ 70% of memory use)

1899247 (~ 5 seconds of waiting, ~ 75% of memory use)

2777296 (~ 5 seconds of waiting, ~ 80% of memory use)

3873730 (LONG WAIT ... waited 20 minutes and gave up !, 88.3% of memory usage, the process is still running)

In the case where the wait occurs with each run.

EDIT : run it again, this time it will stop at 3873333, next to the place it stopped earlier. He stopped after forming a line, trying to add ... I did not notice this last attempt, but it was there too ... a problem with the added line, not the line of the line ... I'm still confused. Here's the line that she produced just before a long stop (she added a line to the print statement) ... the hostname was changed to protect the innocent:

3873333: OrderedDict ([('computer_name', 'bg-fd5612ea'), ('id', 1), ('hist_item', 'sys1 Normalizer (sys1-4): the domain name cannot be determined from sys1 Name' BG -fd5612ea '. ")])

+5
source share
1 answer

As your own tests prove, you run out of memory. Even on CPython 3.6 (where the regular dict actually ordered, although it is not yet a language guarantee), OrderedDict has significant memory overhead compared to dict ; it is still implemented with a list bound to the side range to maintain order and support easy iteration, reordering with move_to_end , etc. You can tell by simply checking with sys.getsizeof (the exact results will differ with the Python version and build the bitrate, 32 vs 64 bit):

 >>> od = OrderedDict([("a", 1), ("b", 2), ("c", 3)]) >>> d = {**od} >>> sys.getsizeof(od) 464 # On 3.5 x64 it 512 >>> sys.getsizeof(d) 240 # On 3.5 x64 it 288 

Ignoring the stored data, the overhead for an OrderedDict here is almost twice as high as a regular dict . If you make 4 million of these items, on my machine, which would add overhead with a title of more than 850 MB (on both 3.5 and 3.6).

Probably the combination of all the other programs on your system, plus your Python program, exceeds the amount of RAM allocated for your computer, and you get stuck. In particular, whenever asset_hist needs to expand for new entries, it will probably need to place pages in large parts (which were unloaded due to lack of use) and whenever a garbage collection starts (full GC happens approximately every 70 000 distributions and OrderedDict by default), all OrderedDict are returned back to check if they are still not bound outside the loops (you can check if GC execution is the main problem by disabling the cyclic GC via gc.disable() ).

Given your specific use case, I highly recommend avoiding both dict and OrderedDict . The overhead of even a dict , even a cheaper form in Python 3.6, is pretty extreme when you have a set of three fixed keys over and over again. Instead, use collections.namedtuple , which is designed for lightweight objects referencing a name or index (they act like regular tuple s, but also allow access to each value as a named attribute), which will significantly reduce the memory cost of your program (and rather of all, accelerate it, even if memory is not a problem).

For instance:

 from collections import namedtuple ComputerInfo = namedtuple('ComputerInfo', ['computer_name', 'id', 'hist_item']) asset_hist = [] for key_host, val_hist_list in am_output.asset_history.items(): for index, hist_item in enumerate(val_hist_list): asset_hist.append(ComputerInfo(key_host, index, hist_item)) 

The only difference in use is that you replace row['computer_name'] with row.computer_name , or if you need all the values, you can unpack it like a regular tuple , for example. comphost, idx, hist = row . If you need a true OrderedDict temporarily (don't store them for everything), you can call row._asdict() to get an OrderedDict with the same display as namedtuple , but this is usually not required. Saving memory makes sense; on my system, three namedtuple elements reduce the overhead per element to 72 bytes, which is less than a third than a 3.6 dict and less than a sixth of 3.6 OrderedDict (and three namedtuple elements namedtuple 72 bytes left at 3.5, where dict / OrderedDict greater than pre-3.6). This can save even more; tuple (and namedtuple by extension) are allocated as a single continuous C struct , and dict and company are at least two distributions (one for the structure of the object, one or several for dynamically changing parts of the structure), each of which can pay for the costs of the distributor and costs to align.

In any case, for your four namedtuple scenario, using namedtuple will mean overhead (beyond the cost of values) totaling about 275 MB, versus 915 (3.6) - 1100 (3.5) MB for dict and 1770 (3.6) - 1950 (3.5) MB for OrderedDict . When you talk about an 8 GB system, shaving 1.5 GB from your overhead is a major improvement.

+1
source

Source: https://habr.com/ru/post/1270069/


All Articles