In python, how can I distinguish between a human-readable word and a random string?

Question

In python, how can I distinguish between a human-readable word and a random string?

Examples of words:

ball
Encyklopedia
Tablo

Examples of random strings:

qxbogsac
jgaynj
rnnfdwpm

Of course, it may happen that a random string is actually a word in some language or looks like one. But in principle, a person can say that something looks “random” or not, basically just by checking whether you can say it or not.

I tried to calculate the entropy to distinguish between the two, but this is far from ideal. Do you have any other ideas, algorithms that work?

However, there is one important requirement: I cannot use heavy libraries such as nltk or use dictionaries. Basically I need a simple and fast heuristic that works in most cases.

+4

python string random nlp heuristics

mnowotka 10 sept. '13 at 11:19

source share

5 answers

Abhijit · Answer 1 · 2013-09-10T11:32:36+0000

Caution I am not a natural language expert

Assuming that you ever mentioned in a link If you can get some silence, then you Msut Be Raelly Smrat is genuine, a simple approach would be

Have an English (I consider it antagonistic) vocabulary

Create a python word string with keys as the first and last character of the words in the dictionary

words = defaultdict() with open("your_dict.txt") as fin: for word in fin: words[word[0]+word[-1]].append(word)

Now for any given word, search the dictionary (remember that the key is the first and last character of the word)
```
 for matches in words[needle[0] + needle[-1]]: 
```

Compare if the characters in the meaning of the dictionary and your needle match

 for match in words[needle[0] + needle[-1]]: if sorted(match) == sorted(needle): print "Human Readable Word"

Will use difflib.get_close_matches comparatively slower (word, features [, n] [, cutoff])

arturomp · Answer 2 · 2013-09-10T14:56:43+0000

If you really mean that your random metric is spoken, you fall into the realm of phonotactics : allowed sequences of sounds in a language. As @ChrisPosser notes in his comment on your question, these allowed sequences of sounds are language specific.

This question only makes sense in a particular language.

Whatever language you choose, you might be lucky with the n-gram model, trained on the letters themselves (as opposed to words, which are the usual approach). You can then calculate the score for a particular line and set a threshold at which the line is random and above which the line is something like a word.

EDIT: Someone has already done this and actually implemented: fooobar.com/questions/102321 / ...

mnowotka · Answer 3 · 2013-09-10T15:40:33+0000

Works well for me:

 VOWELS = "aeiou" PHONES = ['sh', 'ch', 'ph', 'sz', 'cz', 'sch', 'rz', 'dz'] def isWord(word): if word: consecutiveVowels = 0 consecutiveConsonents = 0 for idx, letter in enumerate(word.lower()): vowel = True if letter in VOWELS else False if idx: prev = word[idx-1] prevVowel = True if prev in VOWELS else False if not vowel and letter == 'y' and not prevVowel: vowel = True if prevVowel != vowel: consecutiveVowels = 0 consecutiveConsonents = 0 if vowel: consecutiveVowels += 1 else: consecutiveConsonents +=1 if consecutiveVowels >= 3 or consecutiveConsonents > 3: return False if consecutiveConsonents == 3: subStr = word[idx-2:idx+1] if any(phone in subStr for phone in PHONES): consecutiveConsonents -= 1 continue return False return True

Sakib iqbal · Answer 4 · 2016-11-20T04:38:28+0000

Use PyDictionary. You can install PyDictionary using the following command.

 easy_install -U PyDictionary

Now in the code:

 from PyDictionary import PyDictionary dictionary=PyDictionary() a = ['ball', 'asdfg'] for item in a: x = dictionary.meaning(item) if x==None: print item + ': Not a valid word' else: print item + ': Valid'

As far as I know, you can use PyDictionary for some other languages besides English.

mhucka · Answer 5 · 2018-01-29T18:11:54+0000

I developed a Python 3 package called Nostril for a problem closely related to what the OP asked: the decision about whether text strings were retrieved during Source code is a class / function / variable / etc. identifiers or random nonsense. It does not use a dictionary, but it contains a fairly large table of n-gram frequencies to support its probabilistic assessment of text strings. (I'm not sure if this qualifies as a “dictionary.”) The approach does not test pronunciation, and its specialization may make it unsuitable for general detection of words / non-clauses; nevertheless, it may be useful either for the OP, or for someone else who wants to solve a similar problem.

Example: the following code,

 from nostril import nonsense real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo', 'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom'] junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty'] for s in real_test + junk_test: print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))

will produce the following result:

 bunchofwords: real getint: real xywinlist: real ioFlXFndrInfo: real DMEcalPreshowerDigis: real httpredaksikatakamiwordpresscom: real faiwtlwexu: nonsense asfgtqwafazfyiur: nonsense zxcvbnmlkjhgfdsaqwerty: nonsense

In python, how can I distinguish between a human-readable word and a random string?

More articles: