What is the fastest (and most efficient) way to create a new column in DataFrame
, which is a function of the other rows in pandas
?
Consider the following example:
import pandas as pd
d = {
'id': [1, 2, 3, 4, 5, 6],
'word': ['cat', 'hat', 'hag', 'hog', 'dog', 'elephant']
}
pandas_df = pd.DataFrame(d)
What gives:
id word
0 1 cat
1 2 hat
2 3 hag
3 4 hog
4 5 dog
5 6 elephant
Suppose I want to create a new column bar
containing a value based on the output using a function foo
to compare a word in the current row with other rows in DataFrame
.
def foo(word1, word2):
return foobar
threshold = some_threshold
for index, _id, word in pandas_df.itertuples():
value = sum(
pandas_df[pandas_df['word'] != word].apply(
lambda x: foo(x['word'], word),
axis=1
) < threshold
)
pandas_df.loc[index, 'bar'] = value
This leads to the correct conclusion, but uses itertuples()
and apply()
, which does not work for large ones DataFrames
.
Is there a way to vectorize (is this the correct term?) This approach? Or is there another better (faster) way to do this?
Notes / Updates:
- distance/levenshtein
foo
. , . , .
foo
nltk.metrics.distance.edit_distance
, threshold
2
( ), :
id word bar
0 1 cat 1.0
1 2 hat 2.0
2 3 hag 2.0
3 4 hog 2.0
4 5 dog 1.0
5 6 elephant 0.0
spark dataframes
. , , . , , pandas
spark
.
spark
, cartesian pandas
. , ( , ). , apply()
.
:
from nltk.metrics.distance import edit_distance as edit_dist
pandas_df2 = pd.DataFrame(d)
i, j = np.where(np.ones((len(pandas_df2), len(pandas_df2))))
cart = pandas_df2.iloc[i].reset_index(drop=True).join(
pandas_df2.iloc[j].reset_index(drop=True), rsuffix='_r'
)
cart['dist'] = cart.apply(lambda x: edit_dist(x['word'], x['word_r']), axis=1)
pandas_df2 = (
cart[cart['dist'] < 2].groupby(['id', 'word']).count()['dist'] - 1
).reset_index()