A quick way to estimate the number of elements above a given threshold? Probabilistic data structure?

Question

A quick way to estimate the number of elements above a given threshold? Probabilistic data structure?

I have a large list of values taken from a range from 0 to 100,000 (presented here as letters for clarity). Each entry can have several thousand elements.

[a a a a b b b b c f d b c f ... ]

I want to find a number counter with scores over a specific threshold. For example, if the threshold is 3, the answer will be {a: 4, b: 5}.

The obvious way to do this is to group by ID, count each group and then filter.

This is an agnostic language question, but in Clojure (don't procrastinate if you don't know Clojure!):

(filter (fn [[k cnt]] (> cnt threshold)) (frequencies input))

This function works with a very large number of inputs, each input is very large, so grouping and filtering is an expensive operation. I want to find some kind of security function that will return earlier if the input can never produce any outputs at a given threshold or otherwise divide the problem space. For example, the most simplified is if the size of the input is less than the size of the threshold return nil.

I am looking for a better protection function that will skip computing if an input cannot produce any outputs. Or a faster way to create output.

Obviously, it should be cheaper than the group itself. One big solution involved counting input using a separate set of inputs, but in the end it was as expensive as grouping ...

I have an idea that probabilistic data structures may contain a key. Any ideas?

( hyerloglog, , , )

+4

algorithm data-structures clojure hyperloglog

Joe 11 . '15 19:26

4

Julien Rousé · Answer 1 · 2015-10-11T20:03:57+0000

~~, .~~

, :

hashmap <- initHashmap()
for element in list:
    if element<threshold
        if hashmap.get(element)!=null
            hasmap.set(element,(hashmap.get(element)+1))
        else
            hashmap.set(element,1)

lisp, ( , clojure, )

(defparameter threshold 3)
(defparameter (hashmap (make-hash-table))
(dotimes (i (length list))
    (when (< elt i list threshold)
        (cond 
             ((not (gethash i hashmap)) (incf (gethash i hashmap)))
             ( T (setf (gethash i hashmap) 1)))))

, - O (1), O (n).

Mark Fisher · Answer 2 · 2015-10-12T13:51:26+0000

Narrator. " ".

query-seq , :

(require '[narrator.query :refer [query-seq query-stream]])
(require '[narrator.operators :as n])

(def my-seq [:a :a :b :b :b :b :c :a :b :c])
(query-seq (n/group-by identity n/rate) my-seq)
==> {:a 3, :b 5, :c 2}

, .

quasi-cardinality, (, , ). HyperLogLog,

(query-seq (n/quasi-cardinality) my-seq)
==> 3

quasi-frequency-by, :

(defn freq-in-seq
  "returns a function that, when given a value, returns the frequency of that value in the sequence s
   e.g. ((freq-in-seq [:a :a :b :c]) :a)  ==> 2"
  [s]
  (query-seq (n/quasi-frequency-by identity) s))

((freq-in-seq my-seq) :a) ==> 3

quasi-distinct-by:

(query-seq (n/quasi-distinct-by identity) my-seq)
==> [:a :b :c]

query-stream.

- , "":

(s/stream->seq 
  (->> my-seq
       (map #(hash-map :timestamp %1 :value %2) (range))
       (query-stream (n/group-by identity n/rate) 
                     {:value :value :timestamp :timestamp :period 3})))
==> ({:timestamp 3, :value {:a 2, :b 1}} {:timestamp 6, :value {:b 3}} {:timestamp 9, :value {:a 1, :b 1, :c 1}} {:timestamp 12, :value {:c 1}})

3 ( 3) .

, , , . ( ), :

(defn lazy-value-accum
  ([s] (lazy-value-accum s {}))
  ([s m]
   (when-not (empty? s)
     (lazy-seq
      (let [new-map (merge-with + m (:value (first s)))]
        (cons new-map
              (lazy-value-accum (rest s) new-map))))))


(lazy-value-accum
  (s/stream->seq 
    (->> my-seq
         (map #(hash-map :timestamp %1 :value %2) (range))
         (query-stream (n/group-by identity n/rate) 
                       {:value :value :timestamp :timestamp :period 3}))))
==> ({:a 2, :b 1} {:a 2, :b 4} {:a 3, :b 5, :c 1} {:a 3, :b 5, :c 2})

period , .

guillaume ERETEO · Answer 3 · 2015-10-13T15:36:23+0000

partition-all n, , , ?

(defn lazy-count-and-filter
  [coll n threshold]
  (filter #(< threshold (val %))
          (apply (partial merge-with +) 
                 (map frequencies 
                      (partition-all n coll)))))

:

(lazy-count-and-filter [:a :c :b :c :a :d :a] 2 1)
==> ([:a 3] [:c 2])

Elango · Answer 4 · 2015-10-17T06:56:43+0000

node, core.async, .

, , node, Storm, Onyx.

Actually, it sounds like gearboxes, giving you maximum benefits for the least amount of work. With all the options I listed, solutions that are more powerful / flexible / faster require more time to understand. In order of the simplest to the most powerful, these are reducers, core.async, Storm, Onyx.

A quick way to estimate the number of elements above a given threshold? Probabilistic data structure?

More articles: