Reduced memory usage while maintaining all subsequences of a sequence set

Question

Reduced memory usage while maintaining all subsequences of a sequence set

I have a set of input sequences (represented as lists), for each of which I generate a set of all its subsequences (also lists). These subsequences are stored as keys in the EQUAL hash table and are never collected or modified by garbage (changed values, however, are changed).

My problem is that the method I use for this uses a lot more heap space than I would like.

To understand what I'm talking about, suppose we have a sequence:

#1=(ABCD)

Subsequences:

 #2=() #3=(A) #4=(AB) #5=(ABC) #6=(ABCD) #7=(B) #8=(BC) #9=(BCD) #10=(C) #11=(CD) #12=(D)

The following code, which I assume is not very good, creates this set (with the exception of an empty subsequence, which I really don't care about):

 (defun subseqs-of-length (length sequence) (if (< (length sequence) length) nil (loop for start from 0 to (- (length sequence) length) collect (subseq sequence start (+ start length))))) (defun subseqs-of-length-< (length sequence) (loop for len from 1 to (1- length) append (subseqs-of-length len sequence))) (defun all-subseqs (sequence) (subseqs-of-length-< (1+ (length sequence)) sequence))

As you can imagine, with a lot of input sequences, this uses a huge amount of heap.

The most obvious way I can save memory is to split a bunch of list structure; for example, you can build list 11 on (cons 'c '#12#) , list 9 on (cons 'b '#11#) , list 8 on (cons 'b '#10# and so on. It would be even better if the output of several calls to all-subseqs could share the structure, but I cannot come up with an elegant way to write this.

My question is double:

Is there an elegant way to write all-subseqs so that its results can share the structure?
It occurred to me that it would be easier to generate subsequences naively and write a function to compress the results and run it periodically. Isn't that a stupid idea?
Is there an even more efficient way to store these subsequences and access / update their values? Perhaps something uses arrays or another data structure instead of a hash table?

+4

common-lisp

Samuel edwin ward Mar 13 '12 at 10:41

source share

3 answers

Clayton stanley · Answer 1 · 2012-03-14T01:31:17+0000

What is the minimum amount of information needed to restore a subsequence?

How about a sequence, an index of the beginning and end of the index. Store the start and end indexes in a two-dimensional hash table (hash table of hash tables) or a 2-dimensional array. Then write down the functions to work on this data structure (restore the subsequence based on the sequence, start index and end index, etc.).

Another idea: if you really want to keep subsequences in the data structure, you can use an N-dimensional hash table (hash table of hash tables of hash tables ...), where N is the length of the sequence. Instead of a linked list of items, you get a linked hash table tree. Each hash table has a special key that stores the actual value of the sequence with the key up to this point (that is, if this entered sequence actually matters). And other links in each hash table point to branches further down the tree.

And if each item in the list is a symbol, then "a in one list will have eq (share the same mem location) as" a in another list. I think you can get O (N) memory growth + any growth growth by adding links to the data structure.

Johan benum evensberget · Answer 2 · 2012-03-17T11:43:28+0000

In general, you will need to store so many sequences that you quickly fill up the memory. Another data structure that is more suitable for storing frequency information of sequences is the count / frequency tree.

The count tree is just a regular tree, supplemented by frequency information in each node. Each node in the tree can have several children, inform all the characters in your alphabet. Allocation can be rather wasteful for everyone, so just use a simple linked list as the key value, and let the next character in the sequence be the key and node represents the value. In this diagram, the root of the node is an empty sequence. This, of course, allows us to calculate P (C | A, B) by looking at the path A, B, C in the tree and dividing the frequency A, B, C by frequency. AB to get a basic MLE score.

 (defstruct lm-tree-node (total 0) (children (make-lash))) (defun add-sequence (seq lm-root &optional (count 1)) (let ((children (lm-tree-node-children lm-root))) (incf (lm-tree-node-total lm-root) count) (let ((child (get-or-add-lash (first seq) children (make-lm-tree-node)))) (if (rest seq) (add-sequence (rest seq) child count) (incf (lm-tree-node-total child) count)))))

To put your data in a tree, just take all your sequences, start at the root and add 1 to the frequency of all nodes along your sequences. The above code is a project in which I created a language model for a hidden Markov model decoder. This is on github, if any wan tto look. A “Lash” -table is simply a generic key-related list that can transform itself into a hash table if necessary. Since there can really be many nodes in the count tree, it is important that they have a small amount of memory, and the hash table in each node is out of the question. As a rule, only the top nodes in the count tree will have many (hundreds or around) children, so it is useful if they can use a hash table.

If you want all sequences in n-gram order for modeling languages you should go to extra hoops; breaking each token read into ordered n-grams, but also necessarily adding n-1 grams and n-2 grams, etc.

 (defun sentence-to-n-grams (sentence n lm-root) (let ((buffer (make-fast-queue :size n :buffer (make-array n)))) (loop for word fixnum in sentence for seq = (fast-queue-to-list (fast-enqueue buffer word)) when (>= (length seq) n) do (add-sequence seq lm-root) finally (loop for x on (rest seq) do (add-sequence x lm-root)))))

Svante · Answer 3 · 2013-08-22T19:16:07+0000

To answer the first question:

 (defun all-subseqs (list) (loop :for sublist := list :then (butlast sublist) :while sublist :nconc (maplist #'identity sublist)))

This allows you to use all subsequences having the same structure of the tail lobe.

For the application you draw, the data structure needs to be improved. The trees featured by Johan look promising.

Reduced memory usage while maintaining all subsequences of a sequence set

More articles: