A new column based on a conditional selection from the values โ€‹โ€‹of two other columns in the Pandas DataFrame

I have a DataFrame that contains stock values.

It looks like this:

 >>>Data Open High Low Close Volume Adj Close Date 2013-07-08 76.91 77.81 76.85 77.04 5106200 77.04 

When I try to create a conditional new column with the following if statement:

 Data['Test'] =Data['Close'] if Data['Close'] > Data['Open'] else Data['Open'] 

I get the following error:

 Traceback (most recent call last): File "<pyshell#116>", line 1, in <module> Data[1]['Test'] =Data[1]['Close'] if Data[1]['Close'] > Data[1]['Open'] else Data[1]['Open'] ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() 

Then I used a.all() :

 Data[1]['Test'] =Data[1]['Close'] if all(Data[1]['Close'] > Data[1]['Open']) else Data[1]['Open'] 

As a result, an entire column ['Open'] was selected. I did not receive a condition that I wanted to select the highest value each time between the ['Open'] and ['Close'] columns.

Any help is appreciated.

Thanks.

+6
source share
3 answers

From a DataFrame type:

 >>> df Date Open High Low Close Volume Adj Close 0 2013-07-08 76.91 77.81 76.85 77.04 5106200 77.04 1 2013-07-00 77.04 79.81 71.81 72.87 1920834 77.04 2 2013-07-10 72.87 99.81 64.23 93.23 2934843 77.04 

The simplest thing I can think of will be:

 >>> df["Test"] = df[["Open", "Close"]].max(axis=1) >>> df Date Open High Low Close Volume Adj Close Test 0 2013-07-08 76.91 77.81 76.85 77.04 5106200 77.04 77.04 1 2013-07-00 77.04 79.81 71.81 72.87 1920834 77.04 77.04 2 2013-07-10 72.87 99.81 64.23 93.23 2934843 77.04 93.23 

df.ix[:,["Open", "Close"]].max(axis=1) might be a little faster, but I don't think it's nice to watch.

Alternatively, you can use .apply for strings:

 >>> df["Test"] = df.apply(lambda row: max(row["Open"], row["Close"]), axis=1) >>> df Date Open High Low Close Volume Adj Close Test 0 2013-07-08 76.91 77.81 76.85 77.04 5106200 77.04 77.04 1 2013-07-00 77.04 79.81 71.81 72.87 1920834 77.04 77.04 2 2013-07-10 72.87 99.81 64.23 93.23 2934843 77.04 93.23 

Or go back to numpy:

 >>> df["Test"] = np.maximum(df["Open"], df["Close"]) >>> df Date Open High Low Close Volume Adj Close Test 0 2013-07-08 76.91 77.81 76.85 77.04 5106200 77.04 77.04 1 2013-07-00 77.04 79.81 71.81 72.87 1920834 77.04 77.04 2 2013-07-10 72.87 99.81 64.23 93.23 2934843 77.04 93.23 

The main problem is that if/else does not play well with arrays, because if (something) always forces something to a single bool . This is not equivalent to "for each element in the array something if the condition is met" or something like that.

+4
source
 In [7]: df = DataFrame(randn(10,2),columns=list('AB')) In [8]: df Out[8]: AB 0 -0.954317 -0.485977 1 0.364845 -0.193453 2 0.020029 -1.839100 3 0.778569 0.706864 4 0.033878 0.437513 5 0.362016 0.171303 6 2.880953 0.856434 7 -0.109541 0.624493 8 1.015952 0.395829 9 -0.337494 1.843267 

This is a conditional expression giving me a value for A if A> B, otherwise give me B

 # this syntax is EQUIVALENT to # df.loc[df['A']>df['B'],'A'] = df['B'] In [9]: df['A'].where(df['A']>df['B'],df['B']) Out[9]: 0 -0.485977 1 0.364845 2 0.020029 3 0.778569 4 0.437513 5 0.362016 6 2.880953 7 0.624493 8 1.015952 9 1.843267 dtype: float64 

In this case, max equivalent

 In [10]: df.max(1) Out[10]: 0 -0.485977 1 0.364845 2 0.020029 3 0.778569 4 0.437513 5 0.362016 6 2.880953 7 0.624493 8 1.015952 9 1.843267 dtype: float64 
+3
source

The problem is that you are asking python to evaluate a condition ( Data['Close'] > Data['Open'] ) that contains more than one boolean value. You do not want to use any or all , since either, as this will set Data['Test'] to Data['Open'] or Data['Close'] .

There may be a cleaner method, but one approach is to use a mask (logical array):

 mask = Data['Close'] > Data['Open'] Data['Test'] = pandas.concat([Data['Close'][mask].dropna(), Data['Open'][~mask].dropna()]).reindex_like(Data) 
0
source

Source: https://habr.com/ru/post/949936/


All Articles