- Take all your texts and create a list of words. Easy way: take all the words. The hard way: take only the appropriate one (i.e., in English, “is never a suitable word because it is used too often). Let's say you have V words in your dictionary.
- For each text, construct an adjacency matrix A whose size is V * V. Line A (i) indicates how close the words of your dictionary are to the ith word of V (i). For example, if V (i) = "ski", then A (i, j) how closely the word V (j) refers to the word "ski". Would you prefer a little vocabulary!
Technical Details: For a dictionary, you have several options for getting a good vocabulary. Unfortunately, I can’t remember the names. One of them consists of removing words that are present often and everywhere. On the contrary, you should keep rare words that are present in several texts. However, it makes no sense to preserve the words that are present in exactly the same text.
For an adjacency matrix, adjacency is measured by counting how far the words you are considering are (couting the number of words separating them). For example, let me use your own text =)
One method of comparing style is to search for similar phrases. If I find in one book “fishing, skiing and hiking” a couple of times, and in another book “fishing, hiking and skiing” a style in style points to one author. I also need to find "fishing and even skiing or hiking." Ideally, I would also find “fishing, hiking and skiing”, but since they are non-English texts (Koine Greek ), synonyms are more difficult to resolve, and this aspect is not vital.
These are fully compiled values:
A (method comparing) + = 1.0
A (method, similarity) + = 0.5
A (method, Greek) + = 0.0
You basically need a "typical distance". For example, you can say that after 20 words of separation, words can no longer be considered contiguous.
After some normalization, just make the distance L2 between the adjacency matrix of the two texts to see how close they are. After that, you can make more interesting material, but this should give acceptable results. Now that you have synonyms, you can update the adjacency in a beautiful way. For example, if you have a “beautiful girl” input, then A (beautiful, virgin) + = 1.0
A (gorgeous, virgin) + = 0.9
A (honest, girlish) + = 0.8
A (sublime, damsel) + = 0.8
...