As @pvg mentioned - the pandas model does not apply to in-memory relational databases. Thus, this will not help us if we try to draw an analogy with pandas in terms of sql and it idiosyncracies. Instead, let's look at the problem mainly - you are effectively trying to speed up the search / join of columns.
You can speed up the join significantly by setting the column you want to join as an index on both data frames (left and right data frames you want to join), and then sort as indexes .
Here is an example showing you the speed of acceleration that you can get when combined by sorted indices:
import pandas as pd from numpy.random import randint
The first data file is as follows:
Out[]: column_1 column_2 0 83 66 1 91 12 2 49 0 3 26 75 4 84 60
Create a second data framework:
columns2 = ['column_3', 'column_4'] rows_df_2 = []
The second data structure is as follows:
Out[]: column_3 column_4 0 19 26 1 78 44 2 44 43 3 95 47 4 48 59
Now let's say that you want to join these two data frames at column_1 == column_3
# setting the join columns as indexes for each dataframe df1 = df1.set_index('column_1') df2 = df2.set_index('column_3')
As you can see, simply setting the join columns as dataframe indices and attaching them about 46 milliseconds. Now try joining * after sorting the indices *
# sorting indexes df1 = df1.sort_index() df2 = df2.sort_index() Out[]: CPU times: user 0 ns, sys: 0 ns, total: 0 ns Wall time: 9.78 ยตs
It takes about 9.78 ฮผs, much faster.
I believe that you can apply the same sorting method to pandas columns - sort the columns lexicographically and modify the dataframe. I have not tested the code below, but something like this should give you faster column searches:
import numpy as np
Column searches should be much faster now - it would be great if someone could check this on a data frame with a thousand columns