PHP find relevance

Let's say I have a collection of 100,000 articles on 10 different topics. I don’t know which articles really belong to which topic, but I have the whole news article (you can analyze them by keywords). I would like to group these articles according to their topics. Any idea how I would do this? Any engine (sphinx, lucen) is in order.

+3
source share
7 answers

As part of machine learning / data mining, we called these problems a classification problem . The easiest approach is to use past data for future forecasting, i.e. Statistically oriented: http://en.wikipedia.org/wiki/Statistical_classification , in which you can start by using the Naive Bayes classifier (usually used when detecting spam)

I would suggest you read this book (although written for Python): Collective Intelligence Programming ( http://www.amazon.com/Programming- Collective-Intelligence-Building-Applications / dp / 0596529325 ), they have a good example.

+2
source

, apache, maschine, Mahout. :

[...] , . . , () . [...]

Mahout http://mahout.apache.org/

, ;-), , . , , , , , , . , : -)

+2

:

N 100K 10 . , , .

Lucene/Sphinx 10 , . , .

, , OR. 10 . Lucene/Sphinx , "" .

, , Naive Bayes. , Google WEKA MALLET, .

+1

7 " " (Manning 2009):

" , . , , - Google News.

, 7 , .

+1

sphinix 10 , , ..

0

" " . , .

0

, . , ?

.

Then I would make a list of topics and assign words and phrases that would fall into this topic, and then match them with tags. The problem is that you can get more than one topic in each article.

Perhaps the best way would be to use some form of Bayesian classifiers to determine which topic best describes the article. This will require you to train the system first.

This method is used to determine if the message is spam or not.

This article may help.

0
source

Source: https://habr.com/ru/post/1758906/


All Articles