How to find duplicates inside a row?

I want to find out if the shared comma string contains only the same values:

test,asd,123,test test,test,test 

Here the second line contains only the word "test". I would like to identify these lines.

Since I want to iterate over 100 GB, performance matters a lot.

What could be the fastest way to determine the result of a boolean if the string contains only one value repeatedly?

 public static boolean stringHasOneValue(String string) { String value = null; for (split : string.split(",")) { if (value == null) { value = split; } else { if (!value.equals(split)) return false; } } return true; } 
+5
source share
3 answers

No need to split the string at all, in fact, no string manipulation is needed.

  • Find the first word ( indexOf comma).
  • Checking the remaining string length is an exact multiple of the word + a separator point. (i.e. length-1 % (foundLength+1)==0 )
  • Scroll through the rest of the line, checking the found word for each part of the line. Just keep two indexes on the same line and move them through it. Make sure you also check for commas (i.e. bob,bob,bob matches bob,bobabob ).
  • As assylias pointed out that there is no need to reset the pointers, just let them run the line and compare the 1st with the 2nd, 2nd and 3rd, etc.

Example loop, you will need to adjust the exact position of startPos to point to the first character after the first comma:

 for (int i=startPos;i<str.length();i++) { if (str.charAt(i) != str.charAt(i-startPos)) { return false; } } return true; 

You cannot do this much faster than this, given the format in which the incoming data arrives, but you can do it with a single linear scan. Checking the length will immediately eliminate many inappropriate cases, so simple optimization.

+12
source

A split call can be expensive - especially if it's 200 GB of data.

Consider something like below (NOT tested and may need to change the index values ​​a bit, but I think you will get this idea) -

 public static boolean stringHasOneValue(String string) { String seperator = ","; int firstSeparator = string.indexOf(seperator); //index of the first separator ie the comma String firstValue = string.substring(0, firstSeparator); // first value of the comma separated string int lengthOfIncrement = firstValue.length() + 1; // the string plus one to accommodate for the comma for (int i = 0 ; i < string.length(); i += lengthOfIncrement) { String currentValue = string.substring(i, firstValue.length()); if (!firstValue.equals(currentValue)) { return false; } } return true; } 

Complexity O (n) - provided that the Java substring implementation is efficient. If not, you can write your own substring method, which takes the required number of characters from a string.

+1
source

for a crack, just the line code:

(The answer to @Tim is more efficient)

 System.out.println((new HashSet<String>(Arrays.asList("test,test,test".split(","))).size()==1)); 
0
source

Source: https://habr.com/ru/post/1235712/


All Articles