Java: linear algorithm, but non-linear performance drop, where did it come from?

I am currently experiencing high performance issues when using an application that I develop with natural language processing. Basically, given texts, he collects various data and a little crunch of numbers.

And for every sentence it is EXACTLY the same. The algorithms used to collect statistics do not develop with previously read data and therefore remain unchanged.

The problem is that the processing time does not develop linearly: 1 min for offers of 10 thousand, 1 hour for 100 thousand and days for 1 M ...

I tried everything I could, from reimplementing basic data structures to combining objects to reuse instances. The behavior does not change. I get a non-linear increase in time, which seems impossible to justify with even more collisions with hash maps, neither expectations of waiting, nor anything! Java starts sluggishly as data grows and I feel completely helpless.

If you need an example, just try this: count the number of occurrences of each word in a large file. Below is the code. By doing this, I need 3 seconds over 100 thousand offers. And 326 seconds at 1.6 M ... so the multiplier is 110 times instead of 16 times. As data grows larger, it only gets worse ...

Here is a sample code: Note that I am comparing strings by reference (for performance reasons), this can be done using the String.intern () method, which returns a unique string reference. And the card is never re-hashed during the whole process for the numbers above.

public class DataGathering
{
 SimpleRefCounter<String> counts = new SimpleRefCounter<String>(1000000);

 private void makeCounts(String path) throws IOException
 {

  BufferedReader file_src = new BufferedReader(new FileReader(path));

  String line_src;

  int n = 0;
  while (file_src.ready())
  {
   n++;

   if (n % 10000 == 0)
    System.out.print(".");

   if (n % 100000 == 0)
    System.out.println("");

   line_src = file_src.readLine();

   String[] src_tokens = line_src.split("[ ,.;:?!'\"]");

   for (int i = 0; i < src_tokens.length; i++)
   {
    String src = src_tokens[i].intern();
    counts.bump(src);
   }

  }
  file_src.close();
 }

 public static void main(String[] args) throws IOException
 {
  String path = "some_big_file.txt";

  long timestamp = System.currentTimeMillis();

  DataGathering dg = new DataGathering();
  dg.makeCounts(path);


  long time = (System.currentTimeMillis() - timestamp) / 1000;
  System.out.println("\nElapsed time: " + time + "s.");
 }
}

public class SimpleRefCounter<K>
{
 static final double GROW_FACTOR = 2;
 static final double LOAD_FACTOR = 0.5;

 private int capacity;

 private Object[] keys;
 private int[] counts;


 public SimpleRefCounter()
 {
  this(1000);
 }

 public SimpleRefCounter(int capacity)
 { 
  this.capacity = capacity;
  keys = new Object[capacity];
  counts = new int[capacity];
 }



 public synchronized int increase(K key, int n)
 {
  int id = System.identityHashCode(key) % capacity;

  while (keys[id] != null && keys[id] != key) // if it occupied, let move to the next one!
   id = (id + 1) % capacity;


  if (keys[id] == null)
  {
   key_count++;
   keys[id] = key;

   if (key_count > LOAD_FACTOR * capacity)
   {
    resize((int) (GROW_FACTOR * capacity));
   }
  }


  counts[id] += n;

  total += n;

  return counts[id];
 }



 public synchronized void resize(int capacity)
 {
  System.out.println("Resizing counters: " + this);

  this.capacity = capacity;

  Object[] new_keys = new Object[capacity];
  int[] new_counts = new int[capacity];

  for (int i = 0; i < keys.length; i++)
  {
   Object key = keys[i];
   int count = counts[i];

   int id = System.identityHashCode(key) % capacity;

   while (new_keys[id] != null && new_keys[id] != key) // if it occupied, let move to the next one!
    id = (id + 1) % capacity;

   new_keys[id] = key;
   new_counts[id] = count;
  }

  this.keys = new_keys;
  this.counts = new_counts;
 }


 public int bump(K key)
 {
  return increase(key, 1);
 }

 public int get(K key)
 {
  int id = System.identityHashCode(key) % capacity;

  while (keys[id] != null && keys[id] != key) // if it occupied, let move to the next one!
   id = (id + 1) % capacity;


  if (keys[id] == null)
   return 0;
  else
   return counts[id];
 }
    }

Any explanation? Ideas? Suggestions?

... and, as stated at the beginning, this is not a toy for this example, in particular, but for a more general case. The same explosion behavior occurs for no reason in a more complex and larger program.

+3
source share
9 answers

Instead of helplessly using the profiler! This will tell you exactly where in your code all this time is spent.

+4
source

, (TLB).

String.intern .

, - System.identityHashCode. , , ArrayIndexOutOfBoundsException s. String.hashCode.

+2
String[] src_tokens = line_src.split("[ ,.;:?!'\"]");

- Pattern ( String.split()). , , ?

, , :

final private static Pattern TOKEN_PATTERN = Pattern.compile("[ ,.;:?!'\"]");

:

String[] src_tokens = TOKEN_PATTERN.split(line_src);

, , , .

+2

get, , .

: HashMap, HashMap. .

+1

-. -Xloggc?

+1

, , , . ? , JVM?

0
  • python python Java.
  • :

    , (*)

  • - ? ? , (2).

0

? .

0

, , - , . . PS : , , "String.intern()" , , .

0

Source: https://habr.com/ru/post/1733056/


All Articles