The algorithm for determining the probable language of the text

I am looking for a simple algorithm or an open source library (PHP) to evaluate if the text uses mostly a specific language. I found the following answer regarding Python, which probably leads in the right direction. But something out of the box for PHP would be a charm.

Of course, something like n-gram evaluation will not be too difficult to implement, but it also requires a database of links.

The real problem is this. I launched a WordPress blog that is currently flooded with SPAM. The blog is in German, and almost all trackback spam is English. My idea is to immediately spam all trackbacks that seem to be English. However, I cannot use marker words because I do not want to spam typos or quotes.

My decision:

Using the answers to this question, I implemented a solution that detects German using a simple duration factor. Any comment should contain at least 25% of German delays if it has a link. That way, you can still comment on something like a “cool article” that has no passwords at all, but if you put a link, you have to work hard to write the correct language.

Unfortunately, the stop words from NLTK are incorrect. The list contains words that do not exist in German. Therefore, I used the snowball list. Using the Perl regexp optimizer , I condensed the entire list into one regular expression and counted the stop words using preg_match_all (). The entire filter has 25 lines, a third of Perl code to create a regular expression from a list. Let's see how it works in the wild.

Thank you for your help.

+4
source share
2 answers

I agree with @Thomas that what you are looking for is a spam classifier, not a language detection algorithm. Nevertheless, I think that this solution for determining the language is quite simple and accessible as you want. In principle, if you consider the number of stop words in different languages ​​and choose a language with a large number of them in the document, you have a simple but very effective language classifier.

Now the best thing is that you don’t need to encode anything, since you can use a standard list of stop words and processing packages such as nltk to process information. Here you have an example of how to implement it from scratch using Python and nltk .

Hope this helps.

+1
source

If all you want to do is recognize English, then it’s very easy to crack. If you just check the letters in the message, English is one of the only languages ​​that will be completely in the pure ASCII range. It’s hacked, but it’s a decent simplification, otherwise it’s a very difficult problem, I think.

My hunch about efficiency, just by making a few quick answers to the calculations in an envelope for a couple of French and German blogs, will be ~ 85%, which is not reliable, but pretty good for the simplicity that I think of.

0
source

Source: https://habr.com/ru/post/1486096/


All Articles