Using probabilistic analyzer memory

I am writing a CKY parser for Concatenation Range grammar. I want to use treebank as a grammar, so the grammar will be big. I wrote prototype 1 in Python and it seems to work well when I mimic a tree structure of several dozen sentences, but memory usage is unacceptable. I tried to write it in C ++, but so far it has been very unpleasant since I have never used C ++ before. Here are some data (n is the number of sentences on which the grammar is based):

n mem 9 173M 18 486M 36 836M 

This growth model is what you should expect, given the best algorithm, but the amount of overhead is what concerns me. Memory usage according to heapy is ten less than these numbers, valgrind reported something similar. What causes this mismatch and is there anything I can do with it in Python (or Cython)? Perhaps this is due to fragmentation? Or maybe this is overhead for python dictionaries?

Some background: two important data structures are the boundaries of the display of the narrative to probabilities and the diagram, which is a dictionary that displays non-terminals and positions to edges. The agenda is implemented using headdict (which uses the internal list of dict and heapq), a diagram with a dictionary showing non-terminals and positions to edges. The agenda is often inserted and deleted, the chart receives only inserts and searches. I represent the ribs with tuples as follows:

 (("S", 111), ("NP", 010), ("VP", 100, 001)) 

Strings are nonterminal labels from a grammar, positions are encoded as a bitmask. When a component is interrupted, there may be several positions. Thus, this edge can be a happy Mary analysis, where the "eat" and the "happy" both belong to VP. The chart dictionary is indexed by the first element of this edge ("S", 111) in this. In the new version, I tried to rearrange this view in the hope that it would save memory due to reuse:

 (("S", "NP", "VP), (111, 100, 011)) 

I realized that Python will only store the first part once if this happens in combination with different positions, although I'm not really sure if this is true. In any case, this did not seem to matter.

So basically, I wonder if it’s worth continuing to implement my Python implementation, including doing something with Cython and different data structures, or what writing it from scratch in C ++ is the only viable option.

UPDATE After some improvements, I no longer have problems using memory. I am working on an optimized version of Cython. I will reward generosity with the most useful suggestion for improving code performance. There is an annotated version of http://student.science.uva.nl/~acranenb/plcfrs_cython.html

1 https://github.com/andreasvc/disco-dop/ - run test.py to analyze some suggestions. Requires python 2.6, nltk and heapdict

+4
source share
3 answers

I realized that Python will store the first part only once if it happens in combination with different positions

Not necessary:

 >>> ("S", "NP", "VP") is ("S", "NP", "VP") False 

You might want to intern all the lines that refer to non-terminals, since you seem to create a lot of them in rcgrules.py . If you want an intern tuple, first translate it into a line:

 >>> intern("S NP VP") is intern(' '.join('S', 'NP', 'VP')) True 

Otherwise, you will have to "copy" the tuples, and not re-design them.

(If you are new to C ++, rewriting such an algorithm in it is unlikely to bring much memory benefit. First you will need to evaluate the various hash table implementations and learn how to copy behavior in their containers. I found boost::unordered_map pretty wasteful with lots of small hash tables.)

+2
source

Have you tried to run the application with PyPy and not with CPython?

PyPy is much smarter than CPython to notice the commonality and avoid the excess memory associated with duplicating things unnecessarily.

Anyway, it's worth a try: http://pypy.org/

+2
source

The first thing to do in these cases is always a profile:

 15147/297 0.032 0.000 0.041 0.000 tree.py:102(__eq__) 15400/200 0.031 0.000 0.106 0.001 tree.py:399(convert) 1 0.023 0.023 0.129 0.129 plcfrs_cython.pyx:52(parse) 6701/1143 0.022 0.000 0.043 0.000 heapdict.py:45(_min_heapify) 18212 0.017 0.000 0.023 0.000 plcfrs_cython.pyx:38(__richcmp__) 10975/10875 0.017 0.000 0.035 0.000 tree.py:75(__init__) 5772 0.016 0.000 0.050 0.000 tree.py:665(__init__) 960 0.016 0.000 0.025 0.000 plcfrs_cython.pyx:118(deduced_from) 46938 0.014 0.000 0.014 0.000 tree.py:708(_get_node) 25220/2190 0.014 0.000 0.016 0.000 tree.py:231(subtrees) 10975 0.013 0.000 0.023 0.000 tree.py:60(__new__) 49441 0.013 0.000 0.013 0.000 {isinstance} 16748 0.008 0.000 0.015 0.000 {hasattr} 

The first thing I noticed is that there are very few functions from the cython module itself. Most of them come from the tree.py module and perhaps this is a bottleneck.

Focusing on the cython side, I see the richcmp function:

we can optimize it simply by adding a value type in the method declaration

 def __richcmp__(ChartItem self, ChartItem other, int op): .... 

It decreases the value

 ncalls tottime percall cumtime percall filename:lineno(function) .... 18212 0.011 0.000 0.015 0.000 plcfrs_cython.pyx:38(__richcmp__) 

Adding elif syntax instead of single if activates cython switch optimization

  if op == 0: return self.label < other.label or self.vec < other.vec elif op == 1: return self.label <= other.label or self.vec <= other.vec elif op == 2: return self.label == other.label and self.vec == other.vec elif op == 3: return self.label != other.label or self.vec != other.vec elif op == 4: return self.label > other.label or self.vec > other.vec elif op == 5: return self.label >= other.label or self.vec >= other.vec 

receipt:

 17963 0.002 0.000 0.002 0.000 plcfrs_cython.pyx:38(__richcmp__) 

trying to figure out exactly where the tree.pyhaps99 conversion happens, I found out that this function inside dopg.py takes all this time

  def removeids(tree): """ remove unique IDs introduced by the Goodman reduction """ result = Tree.convert(tree) for a in result.subtrees(lambda t: '@' in t.node): a.node = a.node.rsplit('@', 1)[0] if isinstance(tree, ImmutableTree): return result.freeze() return result 

Now I'm not sure that every node in the tree is a ChartItem, and if the getitem value is used elsewhere, but adding these changes:

 cdef class ChartItem: cdef public str label cdef public str root cdef public long vec cdef int _hash __slots__ = ("label", "vec", "_hash") def __init__(ChartItem self, label, int vec): self.label = intern(label) #.rsplit('@', 1)[0]) self.root = intern(label.rsplit('@', 1)[0]) self.vec = vec self._hash = hash((self.label, self.vec)) def __hash__(self): return self._hash def __richcmp__(ChartItem self, ChartItem other, int op): if op == 0: return self.label < other.label or self.vec < other.vec elif op == 1: return self.label <= other.label or self.vec <= other.vec elif op == 2: return self.label == other.label and self.vec == other.vec elif op == 3: return self.label != other.label or self.vec != other.vec elif op == 4: return self.label > other.label or self.vec > other.vec elif op == 5: return self.label >= other.label or self.vec >= other.vec def __getitem__(ChartItem self, int n): if n == 0: return self.root elif n == 1: return self.vec def __repr__(self): #would need bitlen for proper padding return "%s[%s]" % (self.label, bin(self.vec)[2:][::-1]) 

and inside the sample preparation itself:

 from libc cimport pow def mostprobableparse... ... cdef dict parsetrees = <dict>defaultdict(float) cdef float prob m = 0 for n,(a,prob) in enumerate(derivations): parsetrees[a] += pow(e,prob) m += 1 

I get:

  189345 function calls (173785 primitive calls) in 0.162 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 6701/1143 0.025 0.000 0.037 0.000 heapdict.py:45(_min_heapify) 1 0.023 0.023 0.120 0.120 plcfrs_cython.pyx:54(parse) 960 0.018 0.000 0.030 0.000 plcfrs_cython.pyx:122(deduced_from) 5190/198 0.011 0.000 0.015 0.000 tree.py:102(__eq__) 6619 0.006 0.000 0.006 0.000 heapdict.py:67(_swap) 9678 0.006 0.000 0.008 0.000 plcfrs_cython.pyx:137(concat) 

so the next steps are to optimize heapify and deduced_from

deduce_from can be optimized a bit more:

 cdef inline deduced_from(ChartItem Ih, double x, pyCx, pyunary, pylbinary, pyrbinary, int bitlen): cdef str I = Ih.label cdef int Ir = Ih.vec cdef list result = [] cdef dict Cx = <dict>pyCx cdef dict unary = <dict>pyunary cdef dict lbinary = <dict>pylbinary cdef dict rbinary = <dict>pyrbinary cdef ChartItem Ilh cdef double z cdef double y cdef ChartItem I1h for rule, z in unary[I]: result.append((ChartItem(rule[0][0], Ir), ((x+z,z), (Ih,)))) for rule, z in lbinary[I]: for I1h, y in Cx[rule[0][2]].items(): if concat(rule[1], Ir, I1h.vec, bitlen): result.append((ChartItem(rule[0][0], Ir ^ I1h.vec), ((x+y+z, z), (Ih, I1h)))) for rule, z in rbinary[I]: for I1h, y in Cx[rule[0][1]].items(): if concat(rule[1], I1h.vec, Ir, bitlen): result.append((ChartItem(rule[0][0], I1h.vec ^ Ir), ((x+y+z, z), (I1h, Ih)))) return result 

I will stop here, although I am sure that we can continue to optimize, because this will complicate the problem.

The unittest series would be useful to argue that each optimization does not introduce any subtle error.

Note, try using spaces instead of tabs.

+1
source

Source: https://habr.com/ru/post/1344617/


All Articles