Convert graphlab sframe to dictionary {key: values}

For SFrame as such:

+------+-----------+-----------+-----------+-----------+-----------+-----------+ | X1 | X2 | X3 | X4 | X5 | X6 | X7 | +------+-----------+-----------+-----------+-----------+-----------+-----------+ | the | -0.060292 | 0.06763 | -0.036891 | 0.066684 | 0.024045 | 0.099091 | | , | 0.026625 | 0.073101 | -0.027073 | -0.019504 | 0.04173 | 0.038811 | | . | -0.005893 | 0.093791 | 0.015333 | 0.046226 | 0.032791 | 0.110069 | | of | -0.050371 | 0.031452 | 0.04091 | 0.033255 | -0.009195 | 0.061086 | | and | 0.005456 | 0.063237 | -0.075793 | -0.000819 | 0.003407 | 0.053554 | | to | 0.01347 | 0.043712 | -0.087122 | 0.015258 | 0.08834 | 0.139644 | | in | -0.019466 | 0.077509 | -0.102543 | 0.034337 | 0.130886 | 0.032195 | | a | -0.072288 | -0.017494 | -0.018383 | 0.001857 | -0.04645 | 0.133424 | | is | 0.052726 | 0.041903 | 0.163781 | 0.006887 | -0.07533 | 0.108394 | | for | -0.004082 | -0.024244 | 0.042166 | 0.007032 | -0.081243 | 0.026162 | | on | -0.023709 | -0.038306 | -0.16072 | -0.171599 | 0.150983 | 0.042044 | | that | 0.062037 | 0.100348 | -0.059753 | -0.041444 | 0.041156 | 0.166704 | | ) | 0.052312 | 0.072473 | -0.02067 | -0.015581 | 0.063368 | -0.017216 | | ( | 0.051408 | 0.186162 | 0.03028 | -0.048425 | 0.051376 | 0.004989 | | with | 0.091825 | -0.081649 | -0.087926 | -0.061273 | 0.043528 | 0.107864 | | was | 0.046042 | -0.058529 | 0.040581 | 0.067748 | 0.053724 | 0.041067 | | as | 0.025248 | -0.012519 | -0.054685 | -0.040581 | 0.051061 | 0.114956 | | it | 0.028606 | 0.106391 | 0.025065 | 0.023486 | 0.011184 | 0.016715 | | by | -0.096704 | 0.150165 | -0.01775 | -0.07178 | 0.004458 | 0.098807 | | be | -0.109489 | -0.025908 | 0.025608 | 0.076263 | -0.047246 | 0.100489 | +------+-----------+-----------+-----------+-----------+-----------+-----------+ 

How to convert SFrame to a dictionary so that column X1 is the key and X2 - X7 like np.array() ?

I tried iterating through the original row-by-row row and doing something like this:

 >>> import graphlab as gl >>> import numpy as np >>> x = gl.SFrame() >>> a = np.array([1,2,3]) >>> w = 'foo' >>> x.append(gl.SFrame({'word':[w], 'vector':[a]})) Columns: vector array word str Rows: 1 Data: +-----------------+------+ | vector | word | +-----------------+------+ | [1.0, 2.0, 3.0] | foo | +-----------------+------+ [1 rows x 2 columns] 

Is there any other way to do the same?


EDITED

Having tried @papayawarrior's solution, it works if I can load the entire data frame into memory, but there are several quriks that make it odd.

Assuming that my original input in SFrame is presented above (with 501 columns), but in a .csv file, I have code to read them into the desired dictionary:

 def get_embeddings(embedding_gzip, size): coltypes = [str] + [float] * size sf = gl.SFrame.read_csv('compose-vectors/' + embedding_gzip, delimiter='\t', column_type_hints=coltypes, header=False, quote_char='\0') sf = sf.pack_columns(['X'+str(i) for i in range(2, size+1)]) df = sf.to_dataframe().set_index('X1') print list(df) return df.to_dict(orient='dict')['X2'] 

But strangely this gives this error:

  File "sts_compose.py", line 28, in get_embeddings return df.to_dict(orient='dict')['X2'] KeyError: 'X2' 

So when I check the column names before converting to a dictionary, I find that my column names are not “X1” and “X2”, but list(df) prints ['X501', 'X3'] .

Is there something wrong with the way I converted graphlab.SFrame -> pandas.DataFrame -> dict ?

I know that I can solve the problem by doing this instead, but the question remains: "How did the column names become so weird?":

 def get_embeddings(embedding_gzip, size): coltypes = [str] + [float] * size sf = gl.SFrame.read_csv('compose-vectors/' + embedding_gzip, delimiter='\t', column_type_hints=coltypes, header=False, quote_char='\0') sf = sf.pack_columns(['X'+str(i) for i in range(2, size+1)]) df = sf.to_dataframe().set_index('X1') col_names = list(df) return df.to_dict(orient='dict')[col_names[1]] 
+5
source share
2 answers

Edited to meet new issues in the message.

@Adrien Renaud is in place using the SFrame.pack_columns method, but I would suggest using Pandas dataframe to_dict for the last question if your dataset fits in memory.

 >>> import graphlab as gl >>> sf = gl.SFrame({'X1': ['cat', 'dog'], 'X2': [1, 2], 'X3': [3, 4]}) >>> sf +-----+----+----+ | X1 | X2 | X3 | +-----+----+----+ | cat | 1 | 3 | | dog | 2 | 4 | +-----+----+----+ >>> sf2 = sf.rename({'X1': 'word'}) >>> sf2 = sf.pack_columns(column_prefix='X', new_column_name='vector') >>> sf2 +------+--------+ | word | vector | +------+--------+ | cat | [1, 3] | | dog | [2, 4] | +------+--------+ >>> df = sf2.to_dataframe().set_index('word') >>> result = df.to_dict(orient='dict')['vector'] >>> result {'cat': [1, 3], 'dog': [2, 4]} 
+2
source

Is there any other way to do the same? Yes, you can use the pack_columns method from the SFrame class.

 import graphlab as gl data = gl.SFrame() data.add_column(gl.SArray(['foo', 'bar']), 'X1') data.add_column(gl.SArray([1., 3.]), 'X2') data.add_column(gl.SArray([2., 4.]), 'X3') print data +-----+-----+-----+ | X1 | X2 | X3 | +-----+-----+-----+ | foo | 1.0 | 2.0 | | bar | 3.0 | 4.0 | +-----+-----+-----+ [2 rows x 3 columns] import array data = data.pack_columns(['X2', 'X3'], dtype=array.array, new_column_name='vector') data = data.rename({'X1':'word'}) print data +------+------------+ | word | vector | +------+------------+ | foo | [1.0, 2.0] | | bar | [3.0, 4.0] | +------+------------+ [2 rows x 2 columns] b=data['vector'][0] print type(b) <type 'array.array'> 

How to convert SFrame to dictionary so that column X1 is key and X2 is X7 as np.array ()?

I did not find a built-in method to convert SFrame to dict. You can try the following (this can be very slow):

 a={} def dump_sframe_to_dict(row, a): a[row['word']]=row['vector'] data.apply(lambda x: dump_sframe_to_dict(x, a)) print a {'foo': array('d', [1.0, 2.0]), 'bar': array('d', [3.0, 4.0])} 
+3
source

Source: https://habr.com/ru/post/1240286/


All Articles