Is processing the integer DataFrame index and SciPy result matrix the same bad idea?

I have a pipeline that accepts a pandas DataFrame,, dfwith several text columns, combines them into a document and vectorizes the document, as a result scipy.sparse.csr_matrix, call it X.

Later, I execute the nearest neighbor queries using the X strings (which correspond to the rows of my original DataFrame), and when I want to, say, display the text name of the nearest neighbors of the vectors, I use a vector integer position in X, like this:

>>> print "Nearest neighbor name is", df.iloc[position_in_x,:]['my_name']

Is this a bad move, or can an integer position in a DataFrame be considered static if I don't add or remove from the DataFrame?

I wonder how others dealt with this. One solution that arises for me is to create row vectors of the Xnew column in df.

Thanks!

+4
source share
1 answer

In this case, I’m not sure about iloc, but if you want more rigor, you can always use label selection with the loc attribute, you can use it even after you change the order of the lines or after adding new lines. Loc selects rows by index (not integer-location, such as iloc), which by default is just the row index in the numpy matrix.

In [132]: df1
Out[132]: 
   x   y events
0  5  20       
2  7  22       
4  9  24       

In [133]: df2
Out[133]: 
   x   y events
1  6  21       
3  8  23       

In [134]: df3 = df1.append(df2)

In [135]: df3
Out[135]: 
   x   y events
0  5  20       
2  7  22       
4  9  24       
1  6  21       
3  8  23       

In [137]: df3.loc[3,:]
Out[137]: 
x          8
y         23
events      
Name: 3, dtype: object

Also related

0
source

Source: https://habr.com/ru/post/1607358/


All Articles