How to sort a list of strings and find the 1000 most common values ​​in java

In java (either using external libraries or not) I need to take a list of approximately 500,000 values ​​and find the most common (mode) 1000. Do everything possible to minimize complexity.

What I have tried so far is to make a hash, but I can’t, because it must be the opposite of key = count value = string, otherwise, when I get the top 1000, my complexity will be garbage. and the return path doesn’t actually work perfectly, because I will have terrible difficulty to insert when I search where my line should delete it and paste into it above ...

I tried using a binary search tree, but this had the same problem as the sorting data, either by account or by line. If it is on the line, then getting the score for the top 1000 is bad, and vice versa the insert is bad.

I could sort the list first (by line) and then iterate over the list and save the score until it changes the lines. but what data structure should I use to track the top 1000?

thanks

+5
source share
5 answers

First I created a Map<String, Long> to store the frequency of each word. Then I sort this card by value in descending order, and finally I would save the first 1000 records.

In code:

 List<String> top1000Words = listOfWords.stream() .collect(Collectors.groupingBy(Function.identity(), Collectors.counting())) .entrySet().stream() .sorted(Map.Entry.comparingByValue().reversed()) .limit(1000) .map(Map.Entry::getKey) .collect(Collectors.toList()); 

You can find it cleaner to separate the above into 2 steps: first collect on a frequency map, and then sort your entries by value and save the first 1000 entries.

+9
source

I would divide it into three stages:

  • Enter the number of words (for example, using HashMap<String, Integer> )
  • Sorting results (for example, converting a map to a list of records and sorting in descending order)
  • Print the first 1000 records of sorted results

Sorting will be slow if the counts are small (for example, if you really got 500,000 individual words), but if you expect a lot of repeated words, everything should be fine.

+3
source

I had an open-ended question for a few days, and I decided to rebel against Federico's elegant Java answer and present a minimal answer to Java 8.

The following code uses a helper class that associates a count with a string.

 public class TopOccurringValues { static HashMap<String, StringCount> stringCounts = new HashMap<>(); // set low for demo. Change to 1000 (or whatever) static final int TOP_NUMBER_TO_COLLECT = 10; public static void main(String[] args) { // load your strings in here List<String> strings = loadStrings(); // tally up string occurrences for (String string: strings) { StringCount stringCount = stringCounts.get(string); if (stringCount == null) { stringCount = new StringCount(string); } stringCount.increment(); stringCounts.put(string, stringCount); } // sort which have most ArrayList<StringCount> sortedCounts = new ArrayList<>(stringCounts.values()); Collections.sort(sortedCounts); // collect the top occurring strings ArrayList<String> topCollection = new ArrayList<>(); int upperBound = Math.min(TOP_NUMBER_TO_COLLECT, sortedCounts.size()); System.out.println("string\tcount"); for (int i = 0; i < upperBound; i++) { StringCount stringCount = sortedCounts.get(i); topCollection.add(stringCount.string); System.out.println(stringCount.string + "\t" + stringCount.count); } } // in this demo, strings are randomly generated numbers. private static List<String> loadStrings() { Random random = new Random(1); ArrayList<String> randomStrings = new ArrayList<>(); for (int i = 0; i < 5000000; i++) { randomStrings.add(String.valueOf(Math.round(random.nextGaussian() * 1000))); } return randomStrings; } static class StringCount implements Comparable<StringCount> { int count = 0; String string; StringCount(String string) {this.string = string;} void increment() {count++;} @Override public int compareTo(StringCount o) {return o.count - count;} } } 

55 lines of code! It looks like golf is the other way around. The String generator creates 5 million lines instead of 500,000, because: why not?

 string count -89 2108 70 2107 77 2085 -4 2077 36 2077 65 2072 -154 2067 -172 2064 194 2063 -143 2062 

Randomly generated strings can have values ​​between -999 and 999, but since we get Gaussian values, we will see numbers with higher ratings that are closer to 0.

+1
source

You can do this using the java thread API:

 List<String> input = Arrays.asList(new String[]{"aa", "bb", "cc", "bb", "bb", "aa"}); // First we compute a map of word -> occurrences final Map<String, Long> collect = input.stream() .collect(Collectors.groupingBy(Function.identity(), Collectors.counting())); // Here we sort the map and collect the first 1000 entries final List<Map.Entry<String, Long>> entries = new ArrayList<>(collect.entrySet()); final List<Map.Entry<String, Long>> result = entries.stream() .sorted(Comparator.comparing(Map.Entry::getValue, Comparator.reverseOrder())) .limit(1000) .collect(Collectors.toList()); result.forEach(System.out::println); 
0
source

The solution I chose was to first create a hash map with pairs of key values. I got the score by sorting through a linked list and inserting a pair of key values. Before inserting, I would check the existence and increase the score. This part was pretty straight forward.

In the next part, where I needed to sort it according to its value, I used the guava library published by google, and it was very easy to sort by value instead of a key, using what they called a multimap. where they, in a sense, reverse the hash and allow you to display multiple values ​​on one key, so that I can have all of my first 1000, the opposite of some of the solutions mentioned above, which would not allow this, and would make me just get one value for each key.

The last step was to iterate over the multimap (back) to get the 1000 most common cases.

Look at the function code if you are interested.

 private static void FindNMostFrequentOccurences(ArrayList profileName,int n) { HashMap<String, Integer> hmap = new HashMap<String, Integer>(); //iterate through our data for(int i = 0; i< profileName.size(); i++){ String current_id = profileName.get(i).toString(); if(hmap.get(current_id) == null){ hmap.put(current_id, 1); } else { int current_count = hmap.get(current_id); current_count += 1; hmap.put(current_id, current_count); } } ListMultimap<Integer, String> multimap = ArrayListMultimap.create(); hmap.entrySet().forEach(entry -> { multimap.put(entry.getValue(), entry.getKey()); }); for (int i = 0; i < n; i++){ if (!multimap.isEmpty()){ int lastKey = Iterables.getLast(multimap.keys()); String lastValue = Iterables.getLast(multimap.values()); multimap.remove(lastKey, lastValue); System.out.println(i+1+": "+lastValue+", Occurences: "+lastKey); } } } 
0
source

Source: https://habr.com/ru/post/1269989/


All Articles