Pandas column indexing for search?

In a relational database, we can create an index in columns to speed up querying and joining these columns. I want to do the same on a pandas dataframe. The row index does not seem to match the relational database.

The question is: are the columns indexed in pandas for the default search?

If not, is it possible to index columns manually and how to do it?

Edit: I read pandas docs and searched everywhere, but no one mentions indexing and search / merge performance on pandas. Nobody seems to care about this issue, although this is important in a relational database. Can someone make an expression about indexing and performance on pandas?

Thanks.

+5
source share
1 answer

As @pvg mentioned - the pandas model does not apply to in-memory relational databases. Thus, this will not help us if we try to draw an analogy with pandas in terms of sql and it idiosyncracies. Instead, let's look at the problem mainly - you are effectively trying to speed up the search / join of columns.

You can speed up the join significantly by setting the column you want to join as an index on both data frames (left and right data frames you want to join), and then sort as indexes .

Here is an example showing you the speed of acceleration that you can get when combined by sorted indices:

import pandas as pd from numpy.random import randint # Creating DATAFRAME #1 columns1 = ['column_1', 'column_2'] rows_df_1 = [] # generate 500 rows # each element is a number between 0 and 100 for i in range(0,500): row = [randint(0,100) for x in range(0, 2)] rows_df_1.append(row) df1 = pd.DataFrame(rows_df_1) df1.columns = columns1 print(df1.head()) 

The first data file is as follows:

 Out[]: column_1 column_2 0 83 66 1 91 12 2 49 0 3 26 75 4 84 60 

Create a second data framework:

 columns2 = ['column_3', 'column_4'] rows_df_2 = [] # generate 500 rows # each element is a number between 0 and 100 for i in range(0,500): row = [randint(0,100) for x in range(0, 2)] rows_df_2.append(row) df2 = pd.DataFrame(rows_df_1) df2.columns = columns2 

The second data structure is as follows:

 Out[]: column_3 column_4 0 19 26 1 78 44 2 44 43 3 95 47 4 48 59 

Now let's say that you want to join these two data frames at column_1 == column_3

 # setting the join columns as indexes for each dataframe df1 = df1.set_index('column_1') df2 = df2.set_index('column_3') # joining %time df1.join(df2) Out[]: CPU times: user 4 ms, sys: 0 ns, total: 4 ms Wall time: 46 ms 

As you can see, simply setting the join columns as dataframe indices and attaching them about 46 milliseconds. Now try joining * after sorting the indices *

 # sorting indexes df1 = df1.sort_index() df2 = df2.sort_index() Out[]: CPU times: user 0 ns, sys: 0 ns, total: 0 ns Wall time: 9.78 ยตs 

It takes about 9.78 ฮผs, much faster.

I believe that you can apply the same sorting method to pandas columns - sort the columns lexicographically and modify the dataframe. I have not tested the code below, but something like this should give you faster column searches:

 import numpy as np # Lets assume df is a dataframe with thousands of columns df = read_csv('csv_file.csv') columns = np.sort(df.columns) df = df[columns] 

Column searches should be much faster now - it would be great if someone could check this on a data frame with a thousand columns

+2
source

Source: https://habr.com/ru/post/1265119/


All Articles