In general, you will need to store so many sequences that you quickly fill up the memory. Another data structure that is more suitable for storing frequency information of sequences is the count / frequency tree.
The count tree is just a regular tree, supplemented by frequency information in each node. Each node in the tree can have several children, inform all the characters in your alphabet. Allocation can be rather wasteful for everyone, so just use a simple linked list as the key value, and let the next character in the sequence be the key and node represents the value. In this diagram, the root of the node is an empty sequence. This, of course, allows us to calculate P (C | A, B) by looking at the path A, B, C in the tree and dividing the frequency A, B, C by frequency. AB to get a basic MLE score.
(defstruct lm-tree-node (total 0) (children (make-lash))) (defun add-sequence (seq lm-root &optional (count 1)) (let ((children (lm-tree-node-children lm-root))) (incf (lm-tree-node-total lm-root) count) (let ((child (get-or-add-lash (first seq) children (make-lm-tree-node)))) (if (rest seq) (add-sequence (rest seq) child count) (incf (lm-tree-node-total child) count)))))
To put your data in a tree, just take all your sequences, start at the root and add 1 to the frequency of all nodes along your sequences. The above code is a project in which I created a language model for a hidden Markov model decoder. This is on github, if any wan tto look. A βLashβ -table is simply a generic key-related list that can transform itself into a hash table if necessary. Since there can really be many nodes in the count tree, it is important that they have a small amount of memory, and the hash table in each node is out of the question. As a rule, only the top nodes in the count tree will have many (hundreds or around) children, so it is useful if they can use a hash table.
If you want all sequences in n-gram order for modeling languages ββyou should go to extra hoops; breaking each token read into ordered n-grams, but also necessarily adding n-1 grams and n-2 grams, etc.
(defun sentence-to-n-grams (sentence n lm-root) (let ((buffer (make-fast-queue :size n :buffer (make-array n)))) (loop for word fixnum in sentence for seq = (fast-queue-to-list (fast-enqueue buffer word)) when (>= (length seq) n) do (add-sequence seq lm-root) finally (loop for x on (rest seq) do (add-sequence x lm-root)))))