What is function hashing (hashing)?

I know that hashing hashing (hashing) is used to reduce the dimension and handle sparse bit vectors, but I don't understand how this works. Can someone explain this to me. Is there any python library for hashing functions?

Thanks.

+6
source share
3 answers

In Pandas, you can use something like this:

import pandas as pd import numpy as np data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]} data = pd.DataFrame(data) def hash_col(df, col, N): cols = [col + "_" + str(i) for i in range(N)] def xform(x): tmp = [0 for i in range(N)]; tmp[hash(x) % N] = 1; return pd.Series(tmp,index=cols) df[cols] = df[col].apply(xform) return df.drop(col,axis=1) print hash_col(data, 'state',4) 

The output will be

  pop year state_0 state_1 state_2 state_3 0 1.5 2000 0 1 0 0 1 1.7 2001 0 1 0 0 2 3.6 2002 0 1 0 0 3 2.4 2001 0 0 0 1 4 2.9 2002 0 0 0 1 

Also at the row level you could

import numpy as np, os import sys, pandas as pd

 def hash_col(df, col, N): df = df.replace('',np.nan) cols = [col + "_" + str(i) for i in range(N)] tmp = [0 for i in range(N)] tmp[hash(df.ix[col]) % N] = 1 res = df.append(pd.Series(tmp,index=cols)) return res.drop(col) a = pd.Series(['new york',30,''],index=['city','age','test']) b = pd.Series(['boston',30,''],index=['city','age','test']) print hash_col(a,'city',10) print hash_col(b,'city',10) 

This will work for one series, the column name will be considered the pandas index. It also replaces empty lines with nan and swims everything.

 age 30 test NaN city_0 0 city_1 0 city_2 0 city_3 0 city_4 0 city_5 0 city_6 0 city_7 1 city_8 0 city_9 0 dtype: object age 30 test NaN city_0 0 city_1 0 city_2 0 city_3 0 city_4 0 city_5 1 city_6 0 city_7 0 city_8 0 city_9 0 dtype: object 

If, however, there is a dictionary, and you just want one-hot-encode, you can use

 import numpy as np import pandas as pd, os import scipy.sparse as sps def hash_col(df, col, vocab): cols = [col + "=" + str(v) for v in vocab] def xform(x): tmp = [0 for i in range(len(vocab))]; tmp[vocab.index(x)] = 1; return pd.Series(tmp,index=cols) df[cols] = df[col].apply(xform) return df.drop(col,axis=1) data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]} df = pd.DataFrame(data) df2 = hash_col(df, 'state', ['Ohio','Nevada']) print sps.csr_matrix(df2) 

which will give

  pop year state=Ohio state=Nevada 0 1.5 2000 1 0 1 1.7 2001 1 0 2 3.6 2002 1 0 3 2.4 2001 0 1 4 2.9 2002 0 1 

I also added the separation of the final piece of data. In incremental configuration, where we might not have met all the values ​​in advance (but somehow we somehow got a list of all possible values), you can use the approach described above. Incremental ML methods would need the same number of functions for each increment, so a single coding should result in the same number of lines in each batch.

+5
source

Here (sorry, I can't add this as a comment for some reason.) Also, the first page, β€œHashing Functions for Multiscale Learning on a Large Scale” explains it beautifully.

+2
source

A large sparse function can be removed from the interaction, U as a user and X as an email, so the size of U x X is memory intensive. Typically, a task like filtering spam also has a time limit.

A hash trick, like any other hash function, stores the binary bits (index) that make large-scale preparation possible. Theoretically, more hashed lengths increase performance, as shown in the original article.

It allocates the original function into another bucket (finite space length) in order to preserve their semantics. Even when a spammer uses a typo to skip the radar. Although there is a misrepresentation error, the hash form of the heir remains close.

For instance,

"fast brown fox" is converted to:

 h(the) mod 5 = 0 h(quick) mod 5 = 1 h(brown) mod 5 = 1 h(fox) mod 5 = 3 

Use an index, not a text value, saves space.

To summarize some applications:

  • dimensionality reduction for a high dimensional vector function

    • text in the email classification task, joint spam filtering
  • sparsification

  • bag of words on the fly

  • cross product features

  • multitasking training

Link:

  • Original paper:

    • Hash functions for large scale multitasking training

    • Shi, Q., Petterson, J., Dror, G., Langford, J., Smola, A., Strehl, A. and Vishwanathan, V. (2009). Hash kernels

  • What is a hash trick

  • Quora

  • Gionis, A., Indyk, P., and Motwani, R. (1999). Search for similarities in large sizes with hashing

Implementation:

  • Langford, J., Li, L., and Strehl, A. (2007). Wow-pal wabbit online learning project (technical report). http://hunch.net/?p=309
+1
source

Source: https://habr.com/ru/post/904701/


All Articles