A new column based on a conditional selection from the values of two other columns in the Pandas DataFrame

Question

A new column based on a conditional selection from the values of two other columns in the Pandas DataFrame

I have a DataFrame that contains stock values.

It looks like this:

 >>>Data Open High Low Close Volume Adj Close Date 2013-07-08 76.91 77.81 76.85 77.04 5106200 77.04

When I try to create a conditional new column with the following if statement:

 Data['Test'] =Data['Close'] if Data['Close'] > Data['Open'] else Data['Open']

I get the following error:

 Traceback (most recent call last): File "<pyshell#116>", line 1, in <module> Data[1]['Test'] =Data[1]['Close'] if Data[1]['Close'] > Data[1]['Open'] else Data[1]['Open'] ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Then I used a.all() :

 Data[1]['Test'] =Data[1]['Close'] if all(Data[1]['Close'] > Data[1]['Open']) else Data[1]['Open']

As a result, an entire column ['Open'] was selected. I did not receive a condition that I wanted to select the highest value each time between the ['Open'] and ['Close'] columns.

Any help is appreciated.

Thanks.

+6

python pandas python-3.3

Uninvited guest Jul 21 '13 at 16:11

source share

3 answers

 In [7]: df = DataFrame(randn(10,2),columns=list('AB')) In [8]: df Out[8]: AB 0 -0.954317 -0.485977 1 0.364845 -0.193453 2 0.020029 -1.839100 3 0.778569 0.706864 4 0.033878 0.437513 5 0.362016 0.171303 6 2.880953 0.856434 7 -0.109541 0.624493 8 1.015952 0.395829 9 -0.337494 1.843267

This is a conditional expression giving me a value for A if A> B, otherwise give me B

 # this syntax is EQUIVALENT to # df.loc[df['A']>df['B'],'A'] = df['B'] In [9]: df['A'].where(df['A']>df['B'],df['B']) Out[9]: 0 -0.485977 1 0.364845 2 0.020029 3 0.778569 4 0.437513 5 0.362016 6 2.880953 7 0.624493 8 1.015952 9 1.843267 dtype: float64

In this case, max equivalent

 In [10]: df.max(1) Out[10]: 0 -0.485977 1 0.364845 2 0.020029 3 0.778569 4 0.437513 5 0.362016 6 2.880953 7 0.624493 8 1.015952 9 1.843267 dtype: float64

+3

Jeff Jul 21 '13 at 16:44

source share

The problem is that you are asking python to evaluate a condition ( Data['Close'] > Data['Open'] ) that contains more than one boolean value. You do not want to use any or all , since either, as this will set Data['Test'] to Data['Open'] or Data['Close'] .

There may be a cleaner method, but one approach is to use a mask (logical array):

 mask = Data['Close'] > Data['Open'] Data['Test'] = pandas.concat([Data['Close'][mask].dropna(), Data['Open'][~mask].dropna()]).reindex_like(Data)

0

Sajjan singh Jul 21 '13 at 16:26

source share

DSM · Accepted Answer · 2013-07-21T16:44:26+0000

From a DataFrame type:

 >>> df Date Open High Low Close Volume Adj Close 0 2013-07-08 76.91 77.81 76.85 77.04 5106200 77.04 1 2013-07-00 77.04 79.81 71.81 72.87 1920834 77.04 2 2013-07-10 72.87 99.81 64.23 93.23 2934843 77.04

The simplest thing I can think of will be:

 >>> df["Test"] = df[["Open", "Close"]].max(axis=1) >>> df Date Open High Low Close Volume Adj Close Test 0 2013-07-08 76.91 77.81 76.85 77.04 5106200 77.04 77.04 1 2013-07-00 77.04 79.81 71.81 72.87 1920834 77.04 77.04 2 2013-07-10 72.87 99.81 64.23 93.23 2934843 77.04 93.23

df.ix[:,["Open", "Close"]].max(axis=1) might be a little faster, but I don't think it's nice to watch.

Alternatively, you can use .apply for strings:

 >>> df["Test"] = df.apply(lambda row: max(row["Open"], row["Close"]), axis=1) >>> df Date Open High Low Close Volume Adj Close Test 0 2013-07-08 76.91 77.81 76.85 77.04 5106200 77.04 77.04 1 2013-07-00 77.04 79.81 71.81 72.87 1920834 77.04 77.04 2 2013-07-10 72.87 99.81 64.23 93.23 2934843 77.04 93.23

Or go back to numpy:

 >>> df["Test"] = np.maximum(df["Open"], df["Close"]) >>> df Date Open High Low Close Volume Adj Close Test 0 2013-07-08 76.91 77.81 76.85 77.04 5106200 77.04 77.04 1 2013-07-00 77.04 79.81 71.81 72.87 1920834 77.04 77.04 2 2013-07-10 72.87 99.81 64.23 93.23 2934843 77.04 93.23

The main problem is that if/else does not play well with arrays, because if (something) always forces something to a single bool . This is not equivalent to "for each element in the array something if the condition is met" or something like that.

A new column based on a conditional selection from the values ​​of two other columns in the Pandas DataFrame

More articles:

A new column based on a conditional selection from the values of two other columns in the Pandas DataFrame