I developed a Python 3 package called Nostril for a problem closely related to what the OP asked: the decision about whether text strings were retrieved during Source code is a class / function / variable / etc. identifiers or random nonsense. It does not use a dictionary, but it contains a fairly large table of n-gram frequencies to support its probabilistic assessment of text strings. (I'm not sure if this qualifies as a “dictionary.”) The approach does not test pronunciation, and its specialization may make it unsuitable for general detection of words / non-clauses; nevertheless, it may be useful either for the OP, or for someone else who wants to solve a similar problem.
Example: the following code,
from nostril import nonsense real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo', 'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom'] junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty'] for s in real_test + junk_test: print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))
will produce the following result:
bunchofwords: real getint: real xywinlist: real ioFlXFndrInfo: real DMEcalPreshowerDigis: real httpredaksikatakamiwordpresscom: real faiwtlwexu: nonsense asfgtqwafazfyiur: nonsense zxcvbnmlkjhgfdsaqwerty: nonsense
source share