New Dataframe column as a common function of other rows (pandas)

What is the fastest (and most efficient) way to create a new column in DataFrame , which is a function of the other rows in pandas ?

Consider the following example:

import pandas as pd

d = {
    'id': [1, 2, 3, 4, 5, 6],
    'word': ['cat', 'hat', 'hag', 'hog', 'dog', 'elephant']
}
pandas_df = pd.DataFrame(d)

What gives:

   id word
0   1  cat
1   2  hat
2   3  hag
3   4  hog
4   5  dog
5   6  elephant

Suppose I want to create a new column barcontaining a value based on the output using a function footo compare a word in the current row with other rows in DataFrame.

def foo(word1, word2):
    # do some calculation
    return foobar  # in this example, the return type is numeric

threshold = some_threshold

for index, _id, word in pandas_df.itertuples():
    value = sum(
        pandas_df[pandas_df['word'] != word].apply(
            lambda x: foo(x['word'], word),
            axis=1
        ) < threshold
    )
    pandas_df.loc[index, 'bar'] = value

This leads to the correct conclusion, but uses itertuples()and apply(), which does not work for large ones DataFrames.

Is there a way to vectorize (is this the correct term?) This approach? Or is there another better (faster) way to do this?

Notes / Updates:

  • distance/levenshtein foo. , . , .

foo nltk.metrics.distance.edit_distance, threshold 2 ( ), :

   id word        bar
0   1  cat        1.0
1   2  hat        2.0
2   3  hag        2.0
3   4  hog        2.0
4   5  dog        1.0
5   6  elephant   0.0
  1. spark dataframes. , , . , , pandas spark.

  2. spark , cartesian pandas. , ( , ). , apply().

:

from nltk.metrics.distance import edit_distance as edit_dist

pandas_df2 = pd.DataFrame(d)

i, j = np.where(np.ones((len(pandas_df2), len(pandas_df2))))
cart = pandas_df2.iloc[i].reset_index(drop=True).join(
    pandas_df2.iloc[j].reset_index(drop=True), rsuffix='_r'
)

cart['dist'] = cart.apply(lambda x: edit_dist(x['word'], x['word_r']), axis=1)
pandas_df2 = (
    cart[cart['dist'] < 2].groupby(['id', 'word']).count()['dist'] - 1
).reset_index()
+4
2

:

N, N*N "", . ( , ). , O (n ^ 2) .

. :


1. :

DataFrame, . () , , 16 , () 16- .

, df , . :

from multiprocessing import cpu_count,Pool

def work(part):
    """
    Args:
        part (DataFrame) : a part (collection of rows) of the whole DataFrame.

    Returns:
        DataFrame: the same part, with the desired property calculated and added as a new column
    """
     # Note that we are using the original df (pandas_df) as a global variable
     # But changes made in this function will not be global (a side effect of using multiprocessing).
    for index, _id, word in part.itertuples(): # iterate over the "part" tuples
        value = sum(
            pandas_df[pandas_df['word'] != word].apply( # Calculate the desired function using the whole original df
                lambda x: foo(x['word'], word),
                axis=1
            ) < threshold
        )
        part.loc[index, 'bar'] = value
    return part

# New code starts here ...

cores = cpu_count() #Number of CPU cores on your system

data_split = np.array_split(data, cores) # Split the DataFrame into parts
pool = Pool(cores) # Create a new thread pool
new_parts = pool.map(work , data_split) # apply the function `work` to each part, this will give you a list of the new parts
pool.close() # close the pool
pool.join()
new_df = pd.concat(new_parts) # Concatenate the new parts

. OP-. , .


2. " ":

- /. , .


3. :

, , - , . , , . :

, (LD), , >= . .. LD(s1,s2) >= abs(len(s1)-len(s2)).

, . l1 l2, abs(l1-l2) <= limit. ( - , 2 ).

, LD(s1,s2) = LD(s2,s1). 2 .

O(n) ( ).
? .
, 10^9, 10^3 "" , 10^9 * 10^3 /2 10^9 * 10^9. ( ) . , ( ) , 3.

+2

(groupby)

2, . 2, . ( 3. H)

, .

df["length"] = df.word.str.len() 
df.groupby("length")["id", "word"]

, 2. , , .

( )

, . , , , , . , , .

Pandas , .

# assuming we had groupped the df.
df_len_3 = pd.DataFrame({"word": ['cat', 'hat', 'hag', 'hog', 'dog']})
# turn it into chars
splitted = df_len_3.word.apply(lambda x: pd.Series(list(x)))

    0   1   2
0   c   a   t
1   h   a   t
2   h   a   g
3   h   o   g
4   d   o   g

splitted.loc[0] == splitted # compare one word to all words

    0       1       2
0   True    True    True  -> comparing to itself is always all true.
1   False   True    True
2   False   True    False
3   False   False   False
4   False   False   False


splitted.apply(lambda x: (x == splitted).sum(axis=1).ge(len(x)-1), axis=1).sum(axis=1) - 1

0    1
1    2
2    2
3    2
4    1
dtype: int64

splitted.apply(lambda x: (x == splitted).sum(axis=1).ge(len(x)-1), axis=1).sum(axis=1) - 1

lambda x: (x == splitted) df, splitted.loc[0] == splitted. / .

< <25 > (x == splitted).

, . , ge, . 1, len(x)-1.

, 1, . .

. . - , .

+2

Source: https://habr.com/ru/post/1691943/


All Articles