How can I classify text documents using SVM and KNN

Almost all examples are based on numbers. In text documents, I have words instead of numbers.

So, can you show me simple examples of using these algorithms to classify text documents.

I don't need sample code, but just logic

Pseudocode will help significantly

+4
source share
3 answers

The general approach is to use a bagged word model ( http://en.wikipedia.org/wiki/Bag_of_words_model ), where the classifier recognizes the presence of words in the text, it's simple, but works surprisingly well.

In addition, a similar question arises here: Prepare data for text classification using Scikit Learn SVM

+9
source

You represent the terms that appear in documents as weight in a vector, where each index position is a “weight” for the term. For example, if we accept the document “hello world”, and we associate position 0 with the importance of “hello” and position 1 with the importance of peace, and we measure importance as the number of times the term appears, the document is considered as d = (1, 1) .

At the same time, a document saying only “hi” will be (1, 0).

This view may be based in some way on the importance of terms in documents that are frequency terms (as suggested by @Pedrom) the easiest option. The most common, but fairly simple method is to use TF-IDF , which combines how common a term is in a document and how rare a collection is in a document.

I hope this helps,

+3
source

In a word model bag, you can use the term frequencies and assign weights to them depending on their presence in the new document and the training document. After that, you can use the similarity function to calculate the similarity between study and test documents.

0
source

Source: https://habr.com/ru/post/1482179/


All Articles