I have seen many examples of using Markov chains to generate random words from source data, but they often seem a bit overly mechanical and annotating to me. I am trying to develop a better one.
I believe that part of the problem is that they completely rely on the general statistical appearance of pairs and ignore the tendency of words to start and end in a certain way. For example, if you use the names of the top 1000 children as input, the letter J is relatively rare in general, but it is the second most common letter for names, to begin with. Or, if you use Latin source data, word endings such as -um and -us will be common endings, but not so common if you think that all pairs are the same.
So, I'm basically trying to put together a word generator in a Markov-based chain that takes into account how words begin and end with raw data.
Conceptually this makes sense to me, but I can't figure out how to implement this from a software point of view. I am trying to build a small PHP tool that allows you to discard the source data (for example, a list of 1000 words), from which it will generate a lot of random words with realistic starts, middle and endings. (Unlike most Markov word generators, which are based only on the statistical appearance of pairs in general.)
I would also like to do this with the word length determined by the source data, if possible; that is, a breakdown along the length of randomly generated words should be approximately the same as a breakdown of the length of the source data.
Any ideas would be greatly appreciated! Thanks.
source share