Scikit Learn - calculation of TF-IDF from the body of function arrays, but not from the composition of unprocessed documents

Question

Scikit Learn - calculation of TF-IDF from the body of function arrays, but not from the composition of unprocessed documents

Scikit-Learn TfidfVectorizer converts a set of raw documents into a matrix of TF-IDF functions. Instead of raw documents, I would like to convert the matrix of function names into TF-IDF functions.

The body you are feeding fit_transform()should be an array of unprocessed documents, but instead I want it to be able to supply it (or a comparable function) with an array of attribute arrays per document. For example:

corpus = [
    ['orange', 'red', 'blue'],
    ['orange', 'yellow', 'red'],
    ['orange', 'green', 'purple (if you believe in purple)'],
    ['orange', 'reddish orange', 'black and blue']
]

... unlike a one-dimensional array of strings.

I know that I can define my own dictionary for TfidfVectorizer to use, so I could easily make a set of unique functions in my body and their indices in function vectors. But the function is still expecting raw documents, and since my functions have different lengths and sometimes overlap (for example, “orange” and “reddish orange”), I cannot just combine my functions into single lines and use ngrams.

Is there another Scikit-Learn function that I can use for this that I cannot find? Is there a way to use TfidfVectorizer that I don't see? Or should I create my own TF-IDF function for this?

+4

python scikit-learn machine-learning tf-idf

Andrew LaPrise 15 sept. '15 at 16:57

source share

1

Andrew LaPrise · Accepted Answer · 2015-09-15T17:32:23+0000

, .

:

. , ( ) , , , . HTML, ..
Tokenizer - , , .

( ). , . :

tfidf = TfidfVectorizer(preprocessor=lambda x: x, tokenizer=lambda x: x)
tfidf_matrix = tfidf.fit_transform(corpus)

, , lambda x: x. , TF-IDF "" .

Scikit Learn - calculation of TF-IDF from the body of function arrays, but not from the composition of unprocessed documents

More articles: