Is it possible to perform massive operations during an appointment?

I scanned Qaru carefully and did not find any useful results. At the moment, I'm not even sure that this is possible, but since I'm just a beginner, I thought that at least I would ask about it here.

Basically, I have several data sets, each of which contains about 8 million rows, and I do not want to iterate over each row. I read in several places that vectorization is almost always the fastest operation with pandas DataFrames, but I cannot think of a way to write my script without a loop. Speed ​​is critical because I would rather not support my computer for a month in a row.

I need to take two values ​​from one DataFrame and use them as indices for another DataFrame and change the value to 1. Suppose the following code:

>>> import pandas as pd >>> df1 = pd.DataFrame([[1,2],[3,4],[5,6]]) >>> df1.columns = ['A','B'] >>> df1 AB 0 1 2 1 3 4 2 5 6 >>> df2 = pd.DataFrame(0, index = list(df1['B']), columns = list(df1['A'])) >>> df2 1 3 5 2 0 0 0 4 0 0 0 6 0 0 0 

Right now I have a for loop that works like this:

 >>> listA = list(df1['A']) >>> listB = list(df2['B']) >>> row_count = len(listB) >>> for index in range(row_count): ... col = listA[index] ... row = listB[index] ... df2[col][row] = 1 

Using range() for a loop over an iterator seems significantly faster than iterrows() . But I try to make my script run as fast as possible (because I have a huge amount of data), so I was wondering if I could get rid of the loop. I thought there would be a good chance of a pandas module having a method for DataFrames that I do not know that can make this work very efficiently.

Any help is appreciated.

Edit: a possible duplicate does not affect my problem, because my goal is not to change the diagonal values ​​to 1; it was just a coincidence in my example because I have very simple DataFrames. Also, I'm sorry if this is not the case, how should I format the changes; I am new to this community.

+6
source share
3 answers

answer
option number 6 is my best attempt.

change
For option 6, instead of overwriting the job, you can increase it. This little tweak should give you an account.

 df2.values[row_indexers, col_indexers] += 1 

option 1

 df1 = pd.DataFrame([[1,2], [3,4], [5,6], [1,6]], columns=['A', 'B']) df2 = pd.DataFrame(0, index = list(df1['B'].unique()), columns = list(df1['A'].unique())) df1.groupby(list('AB')).size().gt(0).mul(1) \ .reindex(df2.unstack().index, fill_value=0) \ .unstack(0) 

enter image description here


option 2

 df1 = pd.DataFrame([[1,2], [3,4], [5,6], [1,6]], columns=['A', 'B']) df2 = pd.DataFrame(0, index = list(df1['B'].unique()), columns = list(df1['A'].unique())) mux = pd.MultiIndex.from_arrays(df1.values.T).drop_duplicates() df2.update(pd.Series(1, mux).unstack(0)) df2 

enter image description here


option 3

 df1 = pd.DataFrame([[1,2], [3,4], [5,6], [1,6]], columns=['A', 'B']) df2 = pd.DataFrame(0, index = list(df1['B'].unique()), columns = list(df1['A'].unique())) mux = pd.MultiIndex.from_arrays(df1.values.T).drop_duplicates() df2.where(pd.Series(False, mux).unstack(0, fill_value=True), 1) 

enter image description here


option 4

 df1 = pd.DataFrame([[1,2], [3,4], [5,6], [1,6]], columns=['A', 'B']) df2 = pd.DataFrame(0, index = list(df1['B'].unique()), columns = list(df1['A'].unique())) mux = pd.MultiIndex.from_arrays(df1.values.T).drop_duplicates() df2[pd.Series(True, mux).unstack(0, fill_value=False)] = 1 df2 

enter image description here


option 5

 df1 = pd.DataFrame([[1,2], [3,4], [5,6], [1,6]], columns=['A', 'B']) df2 = pd.DataFrame(0, index = list(df1['B'].unique()), columns = list(df1['A'].unique())) for i, (a, b) in df1.iterrows(): df2.set_value(b, a, 1) df2 

enter image description here

option 6
inspired by @ayhan and @Divakar

 df1 = pd.DataFrame([[1,2], [3,4], [5,6], [1,6]], columns=['A', 'B']) df2 = pd.DataFrame(0, index = list(df1['B'].unique()), columns = list(df1['A'].unique())) row_indexers = df2.index.values.searchsorted(df1.B.values) col_indexers = df2.columns.values.searchsorted(df1.A.values) df2.values[row_indexers, col_indexers] = 1 df2 

enter image description here


time
given sample
the code:

 df1 = pd.DataFrame([[1,2], [3,4], [5,6], [1,6]], columns=['A', 'B']) df2 = pd.DataFrame(0, index = list(df1['B'].unique()), columns = list(df1['A'].unique())) def pir1(): return df1.groupby(list('AB')).size().gt(0).mul(1) \ .reindex(df2.unstack().index, fill_value=0) \ .unstack(0) def pir2(): mux = pd.MultiIndex.from_arrays(df1.values.T).drop_duplicates() df2.update(pd.Series(1, mux).unstack(0)) def pir3(): mux = pd.MultiIndex.from_arrays(df1.values.T).drop_duplicates() return df2.where(pd.Series(False, mux).unstack(0, fill_value=True), 1) def pir4(): mux = pd.MultiIndex.from_arrays(df1.values.T).drop_duplicates() df2[pd.Series(True, mux).unstack(0, fill_value=False)] = 1 def pir5(): for i, (a, b) in df1.iterrows(): df2.set_value(b, a, 1) def pir6(): row_indexers = df2.index.values.searchsorted(df1.B.values) col_indexers = df2.columns.values.searchsorted(df1.A.values) df2.values[row_indexers, col_indexers] = 1 return df2 def ayhan1(): row_indexers = [df2.index.get_loc(r_label) for r_label in df1.B] col_indexers = [df2.columns.get_loc(c_label) for c_label in df1.A] df2.values[row_indexers, col_indexers] = 1 def jez1(): return pd.get_dummies(df1.set_index('B')['A']).groupby(level=0).max() 

enter image description here

much larger sample
the code:

 from itertools import combinations from string import ascii_letters letter_pairs = [t[0] + t[1] for t in combinations(ascii_letters, 2)] df1 = pd.DataFrame(dict(A=np.random.randint(0, 100, 10000), B=np.random.choice(letter_pairs, 10000))) df2 = pd.DataFrame(0, index = list(df1['B'].unique()), columns = list(df1['A'].unique())) 

enter image description here

+3
source

It seems to me that you need pd.get_dummies , but first set_index from column B :

 print (df1.set_index('B')['A']) B 2 1 4 3 6 5 Name: A, dtype: int64 print (pd.get_dummies(df1.set_index('B')['A'])) 1 3 5 B 2 1 0 0 4 0 1 0 6 0 0 1 

If duplicates, you need groupby with max aggregate:

 df1 = pd.DataFrame([[1,2],[3,4],[5,6], [1,6]]) df1.columns = ['A','B'] print (df1) AB 0 1 2 1 3 4 2 5 6 3 1 6 df2 = pd.get_dummies(df1.set_index('B')['A']) df2 = df2.groupby(level=0).max() print (df2) 1 3 5 B 2 1 0 0 4 0 1 0 6 1 0 1 

Alternative DYZ editing (resets the index and refers to the column):

 print(pd.get_dummies(df1.set_index('B')['A']).reset_index().groupb‌​y('B').max()) 
+5
source

numpy supports this type of indexing / assignment. As far as I know, pandas does not have this capability.

Suppose this is your DataFrame:

 df = pd.DataFrame(np.zeros((5, 5)), index=list('abcde'), columns=list('ABCDE')) df Out: ABCDE a 0.0 0.0 0.0 0.0 0.0 b 0.0 0.0 0.0 0.0 0.0 c 0.0 0.0 0.0 0.0 0.0 d 0.0 0.0 0.0 0.0 0.0 e 0.0 0.0 0.0 0.0 0.0 

And it has indexes:

 df1 = pd.DataFrame({'C1': ['a', 'c', 'a', 'd', 'e', 'b', 'd'], 'C2': ['B', 'D', 'A', 'E', 'A', 'A', 'E']}) df1 Out: C1 C2 0 a B 1 c D 2 a A 3 d E 4 e A 5 b A 6 d E 

At this point, you can drop double index pairs on

 df1 = df1.drop_duplicates() 

Now numpy supports the index type arr[df1.C1, df1.C2] , but this requires integer indexes - not labels. You can use index.get_loc for this; it's pretty fast.

 row_indexers = [df.index.get_loc(r_label) for r_label in df1.C1] col_indexers = [df.columns.get_loc(c_label) for c_label in df1.C2] 

If you access the numpy base array using the values ​​attribute in the DataFrame, you can do:

 df.values[row_indexers, col_indexers] = 1 df Out: ABCDE a 1.0 1.0 0.0 0.0 0.0 b 1.0 0.0 0.0 0.0 0.0 c 0.0 0.0 0.0 1.0 0.0 d 0.0 0.0 0.0 0.0 1.0 e 1.0 0.0 0.0 0.0 0.0 

The question was how to perform assignment using arrays. Therefore, I assumed that df2 already exists and looks like this:

 df1 = pd.DataFrame([[1,2], [3,4], [5,6], [1,6]], columns=list('AB')) rows = df1['B'].unique() cols = df1['A'].unique() df2 = pd.DataFrame(0.0, index=rows, columns=cols) df2 Out: 1 3 5 2 0.0 0.0 0.0 4 0.0 0.0 0.0 6 0.0 0.0 0.0 

Now, if you apply my solution, the result will be the same:

 row_indexers = [df2.index.get_loc(r_label) for r_label in df1.B] col_indexers = [df2.columns.get_loc(c_label) for c_label in df1.A] df2.values[row_indexers, col_indexers] = 1 df2 Out: 1 3 5 2 1.0 0.0 0.0 4 0.0 1.0 0.0 6 1.0 0.0 1.0 

Again, this is a solution that assumes that you already have df2 and want to complete the assignment. get_dummies or groupby will just read the index pairs and give you the binary matrix. If your main goal is to change shape, this probably makes more sense. But when you say assignment, I understand something more general (for example, df2.values[row_indexers, col_indexers] += 3 ).

+3
source

Source: https://habr.com/ru/post/1012863/


All Articles