Need a high efficient algorithm to check if a string contains English speech

I have a lot of lines. All of them contain only characters. Symbols and words are not separated by a space from each other. Some of the characters make up English words and others are just bufflegab. Lines may not contain the whole sentence.

I need to find out which ones are written in real English. I mean, String can be built by concatenating well-written English words. I know that I can do something with the word. But words do not split apart. Therefore, it may take a long time to verify each possible combination of words.

I am looking for an algorithm or high performance method that checks if strings are built from English words or English speech. Perhaps there is something that gives me a chance that the line contains English speech.

Do you know a method or algorithm that helps me? Something like sphinx help me?

+3
source share
6 answers

This is called a segmentation problem .

There is no trivial way to solve this problem. What I can offer you, based on my assumption of your level of knowledge, is to build a trie from your dictionary, as well as the first chance that you find a possible word, try to assume that this word.

, - , , , - , .

+2

bufflegab , - , bigram, .. - ( N-). , .

+2

, , . Rabin-Karp. , . , . , , , , , .

0

Trie. Trie - . , .

0

It depends on what precision you want, how effective you need it and what text you process.

0
source

Source: https://habr.com/ru/post/1708977/


All Articles