Maintain top-k set in Java

Can't come up with a neat way to do this in java:

I am streaming a set of lines from a file, line by line.

s1 s2 s3
s4 s5
s6 s7 s8 s9 s10
...

I load each line in TreeSet, do some analysis and discard it and move on to the next line ... I can put the contents of individual lines into memory, but not all.

Now I want to support the top five largest sets of strings that I came across in scanning (without saving anything else).

I think PriorityQueues SetSizeComparator, s add/ poll, when the queue reaches size 5. Has anyone got a tidier solution?

(I can’t brain today. I have a dumb ...)

+3
source share
4 answers
  • , LineTuple, .

  • min heap LineTuples, - .

  • k .

  • (k + 1) - ,

    • , .. , ( O( lg k )).
    • . ( , O( lg k ))
  • k, , k .

Java, . , .

+1

?

<T> T[] topK(Iterator<? extends T> items, int k, Class<T> clazz, Comparator<? super T> cmp) {
  T[] topK = Arrays.newInstance(clazz, k);
  if (k == 0) { return topK; }
  for (int i = 0; i < k && items.hasNext(); ++i) {
    topK[i] = items.hasNext();
  }
  // TODO: what is the appropriate output when there are less than k input?
  Arrays.sort(topK, cmp);
  for (T item; items.hasNext();) {
    item = items.next();
    if (cmp.compareTo(item, topK[k - 1]) < 0) {
      int pos = Arrays.binarySearch(topK, item);
      if (pos < 0) { pos = ~pos; }
      System.arraycopy(topK, pos, topK, pos + 1, k - (pos + 1));
      topK[pos] = item;
    }
  }
  return topK;
}

- O (k), , , topK O (log k), , PriorityQueues.

+1

k :

from random import randint

def rand_k(a, k):
  ret = []
  n = 0
  for e in a:
    n += 1
    if len(ret) < k:
      ret.append(e)
    else:
      if randint(1, n) <= k:
        ret[randint(0, k-1)] = e
  return ret

, k / n, n - . O(n) O(k) .

i ( 1) i > k :

(k / i) * (1 - (k/(i+1))*(1/k)) * ... * (1 - (k/n)*(1/k))

, i- . :

= (k / i) * (i/(i+1)) * ((i+1)/(i+2)) * ... * ((n-1)/n)

, :

= k / n

i <= k .

+1

Source: https://habr.com/ru/post/1766881/


All Articles