I am looking for a simple algorithm or an open source library (PHP) to evaluate if the text uses mostly a specific language. I found the following answer regarding Python, which probably leads in the right direction. But something out of the box for PHP would be a charm.
Of course, something like n-gram evaluation will not be too difficult to implement, but it also requires a database of links.
The real problem is this. I launched a WordPress blog that is currently flooded with SPAM. The blog is in German, and almost all trackback spam is English. My idea is to immediately spam all trackbacks that seem to be English. However, I cannot use marker words because I do not want to spam typos or quotes.
My decision:
Using the answers to this question, I implemented a solution that detects German using a simple duration factor. Any comment should contain at least 25% of German delays if it has a link. That way, you can still comment on something like a “cool article” that has no passwords at all, but if you put a link, you have to work hard to write the correct language.
Unfortunately, the stop words from NLTK are incorrect. The list contains words that do not exist in German. Therefore, I used the snowball list. Using the Perl regexp optimizer , I condensed the entire list into one regular expression and counted the stop words using preg_match_all (). The entire filter has 25 lines, a third of Perl code to create a regular expression from a list. Let's see how it works in the wild.
Thank you for your help.
source share