Building a more realistic random word generator?

I have seen many examples of using Markov chains to generate random words from source data, but they often seem a bit overly mechanical and annotating to me. I am trying to develop a better one.

I believe that part of the problem is that they completely rely on the general statistical appearance of pairs and ignore the tendency of words to start and end in a certain way. For example, if you use the names of the top 1000 children as input, the letter J is relatively rare in general, but it is the second most common letter for names, to begin with. Or, if you use Latin source data, word endings such as -um and -us will be common endings, but not so common if you think that all pairs are the same.

So, I'm basically trying to put together a word generator in a Markov-based chain that takes into account how words begin and end with raw data.

Conceptually this makes sense to me, but I can't figure out how to implement this from a software point of view. I am trying to build a small PHP tool that allows you to discard the source data (for example, a list of 1000 words), from which it will generate a lot of random words with realistic starts, middle and endings. (Unlike most Markov word generators, which are based only on the statistical appearance of pairs in general.)

I would also like to do this with the word length determined by the source data, if possible; that is, a breakdown along the length of randomly generated words should be approximately the same as a breakdown of the length of the source data.

Any ideas would be greatly appreciated! Thanks.

+6
source share
1 answer

The part about non-observance of common beginnings and endings is not really true if you think that the “space between words” is a symbol - the common beginning will have high frequencies following the “space between words” and the common endings will have high frequencies preceding "space between words". The correct word length also depends on what is more or less natural - the average number of letters that you displayed before moving to the “space between words” symbol should be equal to the average number of letters per word in the training data, although something in my opinion I say that distribution can be disabled.

+3
source

Source: https://habr.com/ru/post/888177/


All Articles