CountVectorizer and Out-Of-Vocabulary (OOV) Tokens?

Now I use CountVectorizerto extract functions. However, I need to count words that were not visible during installation.

During conversion, the default behavior CountVectorizeris to ignore words that were not followed during installation. But I need to remember how many times this happens!

How can i do this?

Thank!

+4
source share
1 answer

There is no built-in training method in scikit-in-build, you need to write additional code to do this. However, you can use the attribute to achieve this vocabulary_ CountVectorizer.

  • Current vocabulary cache
  • Call fit_transform
  • diff
+1

Source: https://habr.com/ru/post/1658757/


All Articles