Which is the best example of svm that classifies plain text input?

I tested various svm classification tools, mainly svmlight, pysvmlight, libsvm, scikit learn svm classifier.

Each accepts an input test file in a different format, e.g.

pysvmlight:

[(0, [(13.0, 1.0), (14.0, 1.0), (173.0, 1.0), (174.0, 1.0)]), (0, [(9.0, 1.0), (10.0, 1.0), (11.0, 1.0), (12.0, 1.0), (16.0, 1.0), (19.0, 1.0), (20.0, 1.0), (21.0, 1.0), (22.0, 1.0), (56.0, 1.0)] 

svmlight

 +1 6:0.0342598670723747 26:0.148286149621374 27:0.0570037235976456 31:0.0373086482671729 33:0.0270832794680822 63:0.0317368459004657 67:0.138424991237843 75:0.0297571881179897 96:0.0303237495966756 142:0.0241139382095992 144:0.0581948804675796 185:0.0285004985793364 199:0.0228776475252599 208:0.0366675566391316 274:0.0528930062061687 308:0.0361623318128513 337:0.0374174808347037 351:0.0347329937800643 387:0.0690970538458777 408:0.0288195477724883 423:0.0741629177979597 480:0.0719961218888683 565:0.0520577748209694 580:0.0442849093862884 593:0.329982711875242 598:0.0517245325094578 613:0.0452655621746453 641:0.0387269206869957 643:0.0398205809532254 644:0.0466353065571088 657:0.0508331832990127 717:0.0495981406619795 727:0.104798994968809 764:0.0452655621746453 827:0.0418050310923008 1027:0.05114477444793 1281:0.0633241153685135 1340:0.0657101916402099 1395:0.0522617631894159 1433:0.0471872599750513 1502:0.840963375098259 1506:0.0686138465829187 1558:0.0589627036028818 1598:0.0512079697459134 1726:0.0660884976719923 1836:0.0521934221969394 1943:0.0587388821544177 2433:0.0666767220421155 2646:0.0729483627336339 2731:0.071437898589286 2771:0.0706069752753547 3553:0.0783933439550538 3589:0.0774668403369963 

http://svm.chibi.ubc.ca//sample.test.matrix.txt

 corner feature_1 feature_2 feature_3 feature_4 example_11 -0.18 0.14 -0.06 0.54 example_12 0.16 -0.25 0.26 0.33 example_13 0.06 0.0 -0.2 -0.22 example_14 -0.12 -0.22 0.29 -0.01 example_15 -0.20 -0.23 -0.1 -0.71 

Is there any svm classifier that accepts plain text input and gives a classification result for it?

+5
source share
2 answers

My answer is double

There are implementations of SVMs that work directly with text data, for example, https://github.com/timshenkao/StringKernelSVM . You can also use LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/#libsvm_for_string_data . The key to using SVM directly for text data is called the so-called string core. The core is used in SVM to measure the distance between different data points, which are text documents. One example for the String core is the editing distance between different text documents, cf, http://www.jmlr.org/papers/volume2/lodhi02a/lodhi02a.pdf

The question is whether it is a good idea to use the text core for text classification.

Simplifying a support vector machine is a feature

 f(x) = sgn( <w,phi(x)> +b) 

What usually happens is that you take your input document, calculate the word representation bags for them, and then take the standard kernel as linear. So something like:

 f(x) = sgn( <w,phi(bag-of-words(x))> +b) 

What you most likely want is an SVM with a core that combines a bag of words with a linear core. This implementation is reasonably simple, but has drawbacks.

  • Word bags are very compact compared to text documents
  • You cannot normalize text documents for length, but you can normalize a function in a bag of words
  • Without separating these steps, your code is more difficult to reuse

The lower part of both parts: this is not about the core SVM.

+5
source

Yes, you can do it in scikit-learn.

First use the CountVectorizer to convert your text documents into a document matrix . (This is called the word bag representation and is one way to extract functions from text.) The document-time matrix is ​​used as your input to support a vector machine or any other classification model.

The following is a brief description of the document matrix matrix from the scikit-learn documentation :

In this scheme, functions and samples are defined as follows: Each frequency of an individual token appearance (normalized or not) is considered as a function . The vector of all token frequencies for this document considered a multi-dimensional model .

However, using a support support machine (SVM) might not be the best idea in this case. From the scikit-learn documentation :

If the number of functions is much larger than the number of samples, this method is likely to give poor results.

As a rule, a matrix with documents has much more possibilities (unique terms) than samples (documents), and therefore SVMs are usually not the best choice for this type of problem.

Here is a tutorial for lessons explaining and demonstrating this whole process in scikit-learn, although it uses a different classification model (Naive Bayes).

+1
source

Source: https://habr.com/ru/post/1204326/


All Articles