Faster string / iteration matching method?

The program I'm working on now has one part that takes a little time. Basically, I have a list of lines and one target phrase. For example, suppose the target phrase is “finished goods inventory”. Now, after filtering the word stop (of), I want to extract all the lines from a list containing one of three words: "inventory", "finished" and "product". Now I implemented this idea as follows:

String[] targetWords; // contains "inventory", "finished", and "goods" ArrayList<String> extractedStrings = new ArrayList<String>(); for (int i = 0; i < listOfWords.size(); i++) { String[] words = listOfWords.get(i).split(" "); outerloop: for (int j = 0; j < words.length; j++) { for (int k = 0; k < targetWords.length; k++) { if (words[j].equalsIgnoreCase(targetWords[k])) { extractedStrings.add(listOfWords.get(i)); break outerloop; } } } } 

The list contains more than 100 thousand words, and it takes from 4 to 0.8 seconds to complete the task for each target phrase. The fact is, I have a lot of these target phrases to process, and the seconds really add up. So I was wondering if anyone knows a more efficient way to accomplish this task? Thanks for the help in advance!

+6
source share
5 answers

You go through each of the elements from targetWords , instead of checking all the words from the target words at the same time. In addition, you split your list of words at each iteration without needing it, creating overhead.

I would suggest that you combine your targetWords into one (compiled) regular expression :

 (?xi) # turn on comments, use case insensitive matching \b # word boundary, ie start/end of string, whitespace ( # begin of group containing 'inventory' or 'finished' or 'goods' inventory|finished|goods # bar separates alternatives ) # end of group \b # word boundary 

Remember to specify spaces in the regex line twice.

 import java.util.regex.*; ... Pattern targetPattern = Pattern.compile("(?xi)\\b(inventory|finished|goods)\\b"); for (String singleString : listOfWords) { if (targetPattern.matcher(singleString).find()) { extractedStrings.add(singleString); } } 

If you are not comfortable with the speed of regular expressions, although the mechanisms of regular expressions are usually optimized for performance, you need to turn your own high-speed multi-line search. Aho-Corasick's string matching algorithm is optimized for finding multiple fixed lines in text, but of course, implementing this algorithm is a lot of effort compared to just creating a Template.

+1
source

Your list of 100 thousand words can be added (once) to the HashSet. Instead of repeating this list through your list, use wordSet.contains() - HashSet provides constant performance for this, therefore it does not depend on the size of the list.

+6
source

You can take your giant list of words and add them to the hash map, and then when your phrase appears, simply flip the words in your phrase and check against the hash map. You are currently doing a linear search, and what I suggest will reduce it to a constant time search.

The key minimizes the search. Using this method, you would effectively index your giant list of words for quick searches.

+2
source

I'm a little confused if you want the whole phrase or just single words from listOfWords. If you are trying to get a string from listOfWords, if one of your target words is in the string, this should work for you.

  String[] targetWords= new String[]{"inventory", "finished", "goods"}; List<String> listOfWords = new ArrayList<String>(); // build lookup map Map<String, ArrayList<String>> lookupMap = new HashMap<String, ArrayList<String>>(); for(String words : listOfWords) { for(String word : words.split(" ")) { if(lookupMap.get(word) == null) lookupMap.put(word, new ArrayList<String>()); lookupMap.get(word).add(words); } } // find phrases Set<String> extractedStrings = new HashSet<String>(); for(String target : targetWords) { if(lookupMap.containsKey(target)) extractedStrings.addAll(lookupMap.get(target)); } 
+1
source

I would try to implement it using ExecutorService to parallelize the search for each word. http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ExecutorService.html

For example, with a fixed thread pool size:

 Executors.newFixedThreadPool(20); 
0
source

Source: https://habr.com/ru/post/951361/


All Articles