In NLP, it always happens that the dimension of functions is very large. For example, for one of the projects, the dimension of the objects is almost 20 thousand (p = 20 000), and each function is an integer 0-1 to show whether a particular word or bigram is represented in the document (one document is a data point $ x \ in R ^ {p} $).
I know that the redundancy among the functions is huge, so you need to reduce the size. I have three questions:
1) I have 10 thousand data points (n = 10000), and each data point has 10 thousand functions (p = 10 000). What is an effective way to downsize? The matrix $ X \ in R ^ {n \ times p} $ is so large that the PCA (or SVD, truncated SVD is fine, but I donβt think SVD is a good way to reduce the dimension for binary functions) and Bag of words (or K-funds) is difficult to spend directly on $ X $ (of course, it is sparse). I do not have a server, I just use my computer: - (.
2) How to judge the similarity or distance between two data points? I think Euclidean distance may not work well for binary objects. How about the norm of L0? What are you using?
3) If I want to use the SVM machine (or other kernel methods) to carry out the classification, which kernel should I use?
Thank you very much!
source share