Huge sparse data block for a sparse sparse matrix without dense transformation

They have data with more than 1 million rows and 30 columns, one of the columns is user_id (more than 1500 different users). I want to hot-code this column and use the data in ML-algorithms (xgboost, FFM, scikit). But due to the huge line numbers and unique values โ€‹โ€‹of user values, there will be ~ 1 million X 1500, so you need to do this in a sparse format (otherwise the data destroys the entire RAM).

For me, a convenient way to work with data through pandas DataFrame, which also now supports a sparse format:

df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)

It works quite quickly and has a small size in RAM. But to work with scikit algos and xgboost, you need to convert the dataframe to a sparse matrix.

Is there a way to do this and not iterate over the columns and hstack them in a single scipy sparse matrix? I tried df.as_matrix () and df.values, but everyone first converts the data to dense, which causes a MemoryError :(

PS Same as DMatrix for xgboost

UPDATE:

So, I released the following solution (I will be grateful for the optimization suggestions):

 def sparse_df_to_saprse_matrix (sparse_df):
    index_list = sparse_df.index.values.tolist()
    matrix_columns = []
    sparse_matrix = None

    for column in sparse_df.columns:
        sps_series = sparse_df[column]
        sps_series.index = pd.MultiIndex.from_product([index_list, [column]])
        curr_sps_column, rows, cols = sps_series.to_coo()
        if sparse_matrix != None:
            sparse_matrix = sparse.hstack([sparse_matrix, curr_sps_column])
        else:
            sparse_matrix = curr_sps_column
        matrix_columns.extend(cols)

    return sparse_matrix, index_list, matrix_columns

And the following code allows you to get a sparse data frame:

one_hot_df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
full_sparse_df = one_hot_df.to_sparse(fill_value=0)

I created a sparse matrix of 1.1 million rows x 1150 columns. But when creating it, it still uses a significant amount of RAM (~ 10 Gbit on the edge with my 12 GB).

I donโ€™t know why, because the resulting sparse matrix uses only 300 MB (after booting from the hard drive). Any ideas?

+4
2

.to_coo() pandas [1] :

one_hot_df = pd.get_dummies(df, columns=['user_id', 'type'], sparse=True)
one_hot_df, idx_rows, idx_cols = one_hot_df.stack().to_sparse().to_coo()

DataFrame (rows/columns) Series MultiIndex ( .stack()). Series MultiIndex SparseSeries, SparseDataFrame, .stack() Series. , .to_coo() .to_sparse().

Series, .stack(), SparseSeries , , , ( , np.nan, np.float).

+1

?

Pandas

, .

scipy sparse , pandas .

0

Source: https://habr.com/ru/post/1613995/


All Articles