During Natural Language Processing (NLP), how do you make effective downsizing?

In NLP, it always happens that the dimension of functions is very large. For example, for one of the projects, the dimension of the objects is almost 20 thousand (p = 20 000), and each function is an integer 0-1 to show whether a particular word or bigram is represented in the document (one document is a data point $ x \ in R ^ {p} $).

I know that the redundancy among the functions is huge, so you need to reduce the size. I have three questions:

1) I have 10 thousand data points (n = 10000), and each data point has 10 thousand functions (p = 10 000). What is an effective way to downsize? The matrix $ X \ in R ^ {n \ times p} $ is so large that the PCA (or SVD, truncated SVD is fine, but I don’t think SVD is a good way to reduce the dimension for binary functions) and Bag of words (or K-funds) is difficult to spend directly on $ X $ (of course, it is sparse). I do not have a server, I just use my computer: - (.

2) How to judge the similarity or distance between two data points? I think Euclidean distance may not work well for binary objects. How about the norm of L0? What are you using?

3) If I want to use the SVM machine (or other kernel methods) to carry out the classification, which kernel should I use?

Thank you very much!

+6
source share
1 answer

1) You do not need to reduce the dimension. If you really want this, you can use the L1 linear classifier to reduce to the most useful functions.

2) Often, cosine similarity or cosine similarity of vectors with a modified TFIDF coefficient is used.

3) Linear SVMs work best with so many features.

There is a good tutorial in python on how to make the classification as follows: http://scikit-learn.org/dev/tutorial/text_analytics/working_with_text_data.html

+2
source

Source: https://habr.com/ru/post/978493/


All Articles