What is the best method for parsing strings for multiple word combinations?

Question

What is the best method for parsing strings for multiple word combinations?

I am writing a program that tries to extract meaning from a natural language. The program will accept String and see if it contains certain combinations of words. See the following code snippet for an example:

if (phrase.contains("turn")) { // turn something on/off if (phrase.contains("on") && !phrase.contains("off")) { // turn something ON if (phrase.contains("pc") || phrase.contains("computer")) // turn on computer turnOnComputer(); else if (phrase.contains("light") || phrase.contains("lamp")) // turn on lights turnOnLights(); else badPhrase(); } else if (phrase.contains("off") && !phrase.contains("on")) { // turn something OFF if (phrase.contains("pc") || phrase.contains("computer")) // turn off computer turnOffComputer(); else if (phrase.contains("light") || phrase.contains("lamp")) // turn off lights turnOffLights(); else badPhrase(); } else { badPhrase(); } } else { badPhrase(); }

As you can see, this can quickly become an uncontrollable mess of code if I want to interpret more than a few values. How can I handle this better?

+4

java string parsing

BLuFeNiX May 08 '13 at 6:33

source share

5 answers

Apache OpenNLP is a set of machine learning tools for processing natural language text.

It includes a sentence detector, a tokenizer, a speech fragment tag (POS), and a tree parser.

Guide for NLP

Download

Hope this helps; )

+3

Asier aranbarri May 08 '13 at 8:33

source share

Keyword definition is, of course, only managed for a very small set of words and / or a very limited input language. Well, perhaps also if the surrounding text doesn't matter.

However, for this kind of parsing in natural language, you need a more complex approach, such as tokenizing the text, and then trying to find syntactic relationships between words (start with direct neighbors and expand the range later). Finally, use the syntax relationships you found as control codes to make your decisions.

Regular expressions are most likely not the answer here, as they require very strict information. Consider the following sentence:

Do not turn off the light, but turn it on.

Neither RE nor your original approach will give you any reasonable result. Also, do not forget syntax or grammar errors.

+2

Mike lischke May 08 '13 at 7:52

source share

Use Regex to achieve what you want, as the regular expression can match a string combination.

+1

Bhushan bhangale May 08 '13 at 6:35

source share

This is the fixed code from the answer provided by @Oak

 import java.util.HashMap; import java.util.Map; class StringFinder { private final String phrase; private final Map<String, Boolean> cache = new HashMap<String, Boolean>(); public StringFinder(String phrase) { this.phrase = phrase; } public StringFinder containsAll(String... strings) { for (String string : strings) { if (contains(string) == false) return new FailedStringFinder(phrase); } return this; } public StringFinder andOneOf(String... strings) { for (String string: strings) { if (contains(string)) return this; } return new FailedStringFinder(phrase); } public StringFinder andNot(String... strings) { for (String string : strings) { if (contains(string)) return new FailedStringFinder(phrase); } return this; } public boolean matches() { return true; } private boolean contains(String s) { Boolean cached = cache.get(s); if (cached == null) { cached = phrase.contains(s); cache.put(s, cached); } return cached; } } class FailedStringFinder extends StringFinder { public FailedStringFinder(String phrase) { super(phrase); } public boolean matches() { return false; } // The below are actually optional, but save on performance: public StringFinder containsAll(String... strings) { return this; } public StringFinder andOneOf(String... strings) { return this; } public StringFinder andNot(String... strings) { return this; } }

0

BLuFeNiX May 08 '13 at 16:36

source share

Oak · Accepted Answer · 2013-05-08T08:26:18+0000

First of all, I'm not sure how applicable your approach to natural language processing is. Also, are there any existing libraries for NLP? In particular, in NLP, I know that sometimes the order and part of speech are of great importance, plus this approach is not very stable for word variations.

However, if you want to stick to your approach, one idea to make it more readable and more convenient (see more complete pros / cons below) looks something like this:

 StringFinder finder = new StringFinder(phrase); if (finder.containsAll("turn", "on").andOneOf("computer", "pc").andNot("off").matches()) { turnOnComputer(); return; } else if (finder.containsAll("turn", "off").andOneOf("computer", "pc").andNot("on").matches()) { turnOffComputer(); return; } else if (finder.containsAll("turn", "on").andOneOf("light", "lamp").andNot("off").matches()) { ... } else if (finder.containsAll("turn")) { // If we reached this point badPhrase(); } else if (...

With something like:

 class StringFinder { private final String phrase; private final Map<String, Boolean> cache = new HashMap<String, Boolean>(); public StringFinder(String phrase) { this.phrase = phrase; } public StringFinder containsAll(String... strings) { for (String string : strings) { if (contains(string) == false) return new FailedStringFinder(phrase); } return this; } public StringFinder andOneOf(String... strings) { for (String string: strings) { if (contains(string)) return this; } return FailedStringFinder(phrase); } public StringFinder andNot(String... strings) { for (String string : strings) { if (contains(string)) return new FailedStringFinder(phrase); } return this; } public boolean matches() { return true; } private boolean contains(String s) { Boolean cached = cache.get(s); if (cached == null) { cached = phrase.contains(s); cached.put(s, cached); } return cached; } } class FailedStringFinder extends StringFinder { public boolean matches() { return false; } // The below are actually optional, but save on performance: public StringFinder containsAll(String... strings) { return this; } public StringFinder andOneOf(String... strings) { return this; } public StringFinder andNot(String... strings) { return this; } }

Disadvantages:

Duplication of checks: the queue is checked several times.
Duplicate patterns (but see benefits below).

Benefits:

Relatively short code.
Checks are duplicated but cached, so performance remains high.
The condition is very close to the operation, the result is very readable code.
Non-nested conditions allow you to change the condition required for a particular operation without restructuring the code, which leads to a much more convenient code.
It is easy to change the order in which conditions and operations appear to manage priorities.
The lack of nesting facilitates its parallelization in the future.
Flexible state check: for example, you can add methods to StringFinder to repeat the checks, for example: public StringFinder containsOnAndNotOff() { return containsAll("on").andNot("off"); } public StringFinder containsOnAndNotOff() { return containsAll("on").andNot("off"); } , or to meet some exotic conditions that you need, such as andAtLeast3Of(String... strings) {...} .
- The cache can also be expanded to not only remember whether words appear, but also to remember whether whole patterns appear.
- You can also add the final condition: andMatches(Pattern p) (with a Regex pattern) - in fact, you can probably model many other checks with a regular expression. Then it will simplify caching - instead of using a string as a key, use a template.

What is the best method for parsing strings for multiple word combinations?

More articles: