Phrase Detection and Comparison Algorithm

Question

Phrase Detection and Comparison Algorithm

I have some non-English texts. I would like to make stylistic comparisons on them.

One way to compare style is to look for similar phrases. If I find in one book “fishing, skiing and hiking” a couple of times, and in another book “fishing, hiking and skiing” the similarity in style points to one author. I also need to find "fishing and even skiing or hiking." Ideally, I would also find “fishing, hiking, and skiing,” but since they are non-English texts (Greek words), synonyms are more difficult to resolve, and this aspect is not vital.

What is the best way (1) to find such phrases and then (2) look for them so that the other texts are not too hard (to find “fishing and even skiing or hiking”)?

+6

language-agnostic algorithm semantics nlp

jcuenod Jun 30 '11 at 11:30

source share

3 answers

You should probably use some measure of string similarity, such as Jaccard , Dice, or cosine similarity . You can try them either in words or in words, or at character level n-grams or lemmas. (For a highly distorted language such as Greek koini, I would suggest using lemmas if you have a good lemmatizer for it.)

Capturing synonyms is difficult if you don't have something like WordNet that displays synonyms together.

+2

Fred foo Jun 30 '11 at 11:42

source share

I would follow two guides:

Beware of premature optimization in the matching algorithm. Start with a broad approach, and then refine it as needed (i.e. check if a simple proximity test is close to a dataset that you know the answer to, and if not, tune it until it does). In many cases, you will find that a highly optimized solution will not produce results that differ significantly from your first rough attempt.
Use some kind of self-learning algorithm. . That way, you could fuel the AI with a few texts that can make it smarter. Inspiring your example: before trying to compare the two target texts, I would feed the text about life in the open air. Thus, the AI will most likely learn by itself that angling is a very close match for fishing .

As a self-learning AI, I would use (at least for starters) a neural network . There is a simple and fully working example (in python) that can be found here and is aimed specifically at “data mining.” Of course, you can implement it in some other language.

About your two specific questions:

What is the best way to find these phrases

Other answers to your question described this in detail (and their authors seem to know more than I do!), But again: I would start to easily and simply use a neural network that tells you how close the two conditions are. Then I proceed to the “waves” of optimization (for example, if it were an English text using only the root of the word), or maybe it’s useful to adjust the rating according to some other metadata of the text, for example, year, or the author , or geographical origin, or still completely changing the matching algorithm ...) until you are satisfied with the result.

What is the best way to search for them in a way that is not too harsh in other texts (to find "fishing and even skiing or hiking"

I would say that this is equivalent to asking the AI to return all phrases whose "proximity score" exceeds a given threshold.

NTN!

+1

mac Jun 30 '11 at 12:08

source share

Fezvez · Accepted Answer · 2011-06-30T11:46:46+0000

Take all your texts and create a list of words. Easy way: take all the words. The hard way: take only the appropriate one (i.e., in English, “is never a suitable word because it is used too often). Let's say you have V words in your dictionary.
For each text, construct an adjacency matrix A whose size is V * V. Line A (i) indicates how close the words of your dictionary are to the ith word of V (i). For example, if V (i) = "ski", then A (i, j) how closely the word V (j) refers to the word "ski". Would you prefer a little vocabulary!

Technical Details: For a dictionary, you have several options for getting a good vocabulary. Unfortunately, I can’t remember the names. One of them consists of removing words that are present often and everywhere. On the contrary, you should keep rare words that are present in several texts. However, it makes no sense to preserve the words that are present in exactly the same text.

For an adjacency matrix, adjacency is measured by counting how far the words you are considering are (couting the number of words separating them). For example, let me use your own text =)

One method of comparing style is to search for similar phrases. If I find in one book “fishing, skiing and hiking” a couple of times, and in another book “fishing, hiking and skiing” a style in style points to one author. I also need to find "fishing and even skiing or hiking." Ideally, I would also find “fishing, hiking and skiing”, but since they are non-English texts (Koine Greek ), synonyms are more difficult to resolve, and this aspect is not vital.

These are fully compiled values:
A (method comparing) + = 1.0
A (method, similarity) + = 0.5
A (method, Greek) + = 0.0

You basically need a "typical distance". For example, you can say that after 20 words of separation, words can no longer be considered contiguous.

After some normalization, just make the distance L2 between the adjacency matrix of the two texts to see how close they are. After that, you can make more interesting material, but this should give acceptable results. Now that you have synonyms, you can update the adjacency in a beautiful way. For example, if you have a “beautiful girl” input, then A (beautiful, virgin) + = 1.0
A (gorgeous, virgin) + = 0.9
A (honest, girlish) + = 0.8
A (sublime, damsel) + = 0.8
...

Phrase Detection and Comparison Algorithm

More articles: