In python, how can I distinguish between a human-readable word and a random string?

Examples of words:

  • ball
  • Encyklopedia
  • Tablo

Examples of random strings:

  • qxbogsac
  • jgaynj
  • rnnfdwpm

Of course, it may happen that a random string is actually a word in some language or looks like one. But in principle, a person can say that something looks “random” or not, basically just by checking whether you can say it or not.

I tried to calculate the entropy to distinguish between the two, but this is far from ideal. Do you have any other ideas, algorithms that work?

However, there is one important requirement: I cannot use heavy libraries such as nltk or use dictionaries. Basically I need a simple and fast heuristic that works in most cases.

+4
source share
5 answers

Caution I am not a natural language expert

Assuming that you ever mentioned in a link If you can get some silence, then you Msut Be Raelly Smrat is genuine, a simple approach would be

  • Have an English (I consider it antagonistic) vocabulary
  • Create a python word string with keys as the first and last character of the words in the dictionary

    words = defaultdict() with open("your_dict.txt") as fin: for word in fin: words[word[0]+word[-1]].append(word) 
  • Now for any given word, search the dictionary (remember that the key is the first and last character of the word)

     for matches in words[needle[0] + needle[-1]]: 
  • Compare if the characters in the meaning of the dictionary and your needle match

     for match in words[needle[0] + needle[-1]]: if sorted(match) == sorted(needle): print "Human Readable Word" 

Will use difflib.get_close_matches comparatively slower (word, features [, n] [, cutoff])

+2
source

If you really mean that your random metric is spoken, you fall into the realm of phonotactics : allowed sequences of sounds in a language. As @ChrisPosser notes in his comment on your question, these allowed sequences of sounds are language specific.

This question only makes sense in a particular language.

Whatever language you choose, you might be lucky with the n-gram model, trained on the letters themselves (as opposed to words, which are the usual approach). You can then calculate the score for a particular line and set a threshold at which the line is random and above which the line is something like a word.

EDIT: Someone has already done this and actually implemented: fooobar.com/questions/102321 / ...

+1
source

Works well for me:

 VOWELS = "aeiou" PHONES = ['sh', 'ch', 'ph', 'sz', 'cz', 'sch', 'rz', 'dz'] def isWord(word): if word: consecutiveVowels = 0 consecutiveConsonents = 0 for idx, letter in enumerate(word.lower()): vowel = True if letter in VOWELS else False if idx: prev = word[idx-1] prevVowel = True if prev in VOWELS else False if not vowel and letter == 'y' and not prevVowel: vowel = True if prevVowel != vowel: consecutiveVowels = 0 consecutiveConsonents = 0 if vowel: consecutiveVowels += 1 else: consecutiveConsonents +=1 if consecutiveVowels >= 3 or consecutiveConsonents > 3: return False if consecutiveConsonents == 3: subStr = word[idx-2:idx+1] if any(phone in subStr for phone in PHONES): consecutiveConsonents -= 1 continue return False return True 
0
source

Use PyDictionary. You can install PyDictionary using the following command.

 easy_install -U PyDictionary 

Now in the code:

 from PyDictionary import PyDictionary dictionary=PyDictionary() a = ['ball', 'asdfg'] for item in a: x = dictionary.meaning(item) if x==None: print item + ': Not a valid word' else: print item + ': Valid' 

As far as I know, you can use PyDictionary for some other languages ​​besides English.

0
source

I developed a Python 3 package called Nostril for a problem closely related to what the OP asked: the decision about whether text strings were retrieved during Source code is a class / function / variable / etc. identifiers or random nonsense. It does not use a dictionary, but it contains a fairly large table of n-gram frequencies to support its probabilistic assessment of text strings. (I'm not sure if this qualifies as a “dictionary.”) The approach does not test pronunciation, and its specialization may make it unsuitable for general detection of words / non-clauses; nevertheless, it may be useful either for the OP, or for someone else who wants to solve a similar problem.

Example: the following code,

 from nostril import nonsense real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo', 'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom'] junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty'] for s in real_test + junk_test: print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real')) 

will produce the following result:

 bunchofwords: real getint: real xywinlist: real ioFlXFndrInfo: real DMEcalPreshowerDigis: real httpredaksikatakamiwordpresscom: real faiwtlwexu: nonsense asfgtqwafazfyiur: nonsense zxcvbnmlkjhgfdsaqwerty: nonsense 
0
source

Source: https://habr.com/ru/post/1501450/


All Articles