For SFrame as such:
+------+-----------+-----------+-----------+-----------+-----------+-----------+ | X1 | X2 | X3 | X4 | X5 | X6 | X7 | +------+-----------+-----------+-----------+-----------+-----------+-----------+ | the | -0.060292 | 0.06763 | -0.036891 | 0.066684 | 0.024045 | 0.099091 | | , | 0.026625 | 0.073101 | -0.027073 | -0.019504 | 0.04173 | 0.038811 | | . | -0.005893 | 0.093791 | 0.015333 | 0.046226 | 0.032791 | 0.110069 | | of | -0.050371 | 0.031452 | 0.04091 | 0.033255 | -0.009195 | 0.061086 | | and | 0.005456 | 0.063237 | -0.075793 | -0.000819 | 0.003407 | 0.053554 | | to | 0.01347 | 0.043712 | -0.087122 | 0.015258 | 0.08834 | 0.139644 | | in | -0.019466 | 0.077509 | -0.102543 | 0.034337 | 0.130886 | 0.032195 | | a | -0.072288 | -0.017494 | -0.018383 | 0.001857 | -0.04645 | 0.133424 | | is | 0.052726 | 0.041903 | 0.163781 | 0.006887 | -0.07533 | 0.108394 | | for | -0.004082 | -0.024244 | 0.042166 | 0.007032 | -0.081243 | 0.026162 | | on | -0.023709 | -0.038306 | -0.16072 | -0.171599 | 0.150983 | 0.042044 | | that | 0.062037 | 0.100348 | -0.059753 | -0.041444 | 0.041156 | 0.166704 | | ) | 0.052312 | 0.072473 | -0.02067 | -0.015581 | 0.063368 | -0.017216 | | ( | 0.051408 | 0.186162 | 0.03028 | -0.048425 | 0.051376 | 0.004989 | | with | 0.091825 | -0.081649 | -0.087926 | -0.061273 | 0.043528 | 0.107864 | | was | 0.046042 | -0.058529 | 0.040581 | 0.067748 | 0.053724 | 0.041067 | | as | 0.025248 | -0.012519 | -0.054685 | -0.040581 | 0.051061 | 0.114956 | | it | 0.028606 | 0.106391 | 0.025065 | 0.023486 | 0.011184 | 0.016715 | | by | -0.096704 | 0.150165 | -0.01775 | -0.07178 | 0.004458 | 0.098807 | | be | -0.109489 | -0.025908 | 0.025608 | 0.076263 | -0.047246 | 0.100489 | +------+-----------+-----------+-----------+-----------+-----------+-----------+
How to convert SFrame to a dictionary so that column X1 is the key and X2 - X7 like np.array() ?
I tried iterating through the original row-by-row row and doing something like this:
>>> import graphlab as gl >>> import numpy as np >>> x = gl.SFrame() >>> a = np.array([1,2,3]) >>> w = 'foo' >>> x.append(gl.SFrame({'word':[w], 'vector':[a]})) Columns: vector array word str Rows: 1 Data: +-----------------+------+ | vector | word | +-----------------+------+ | [1.0, 2.0, 3.0] | foo | +-----------------+------+ [1 rows x 2 columns]
Is there any other way to do the same?
EDITED
Having tried @papayawarrior's solution, it works if I can load the entire data frame into memory, but there are several quriks that make it odd.
Assuming that my original input in SFrame is presented above (with 501 columns), but in a .csv file, I have code to read them into the desired dictionary:
def get_embeddings(embedding_gzip, size): coltypes = [str] + [float] * size sf = gl.SFrame.read_csv('compose-vectors/' + embedding_gzip, delimiter='\t', column_type_hints=coltypes, header=False, quote_char='\0') sf = sf.pack_columns(['X'+str(i) for i in range(2, size+1)]) df = sf.to_dataframe().set_index('X1') print list(df) return df.to_dict(orient='dict')['X2']
But strangely this gives this error:
File "sts_compose.py", line 28, in get_embeddings return df.to_dict(orient='dict')['X2'] KeyError: 'X2'
So when I check the column names before converting to a dictionary, I find that my column names are not “X1” and “X2”, but list(df) prints ['X501', 'X3'] .
Is there something wrong with the way I converted graphlab.SFrame -> pandas.DataFrame -> dict ?
I know that I can solve the problem by doing this instead, but the question remains: "How did the column names become so weird?":
def get_embeddings(embedding_gzip, size): coltypes = [str] + [float] * size sf = gl.SFrame.read_csv('compose-vectors/' + embedding_gzip, delimiter='\t', column_type_hints=coltypes, header=False, quote_char='\0') sf = sf.pack_columns(['X'+str(i) for i in range(2, size+1)]) df = sf.to_dataframe().set_index('X1') col_names = list(df) return df.to_dict(orient='dict')[col_names[1]]