I find it hard to understand how pandas and / or numpy handle NaN values. I am extracting subsets of the pandas frame to compute t-stats, for example. I want to know if there is a significant difference on average for x2 for a group whose value x1 is equal to A, compared to those whose value is x1 B (sorry that this is not a working example, but I do not know how to recreate the values The NaNs that appear in my data frame, the source data is read using read_csv, and csv indicates the missing values with NA):
import numpy as np
import pandas as pd
import scipy.stats as st
A = data[data['x1']=='A']['x2']
B = data[data['x1']=='B'].x2
A
2 3
3 1
5 2
6 3
10 3
12 2
15 2
16 0
21 0
24 1
25 1
28 NaN
31 0
32 3
...
677 0
681 NaN
682 3
683 1
686 2
Name: praxiserf, Length: 335, dtype: float64
That is, I have two objects pandas.core.series.Seriesthat I then want to perform a t-test. However, using
st.ttest_ind(A, B)
returns:
(array(nan), nan)
, , ttest_ind , , , NaN . , :
A.mean(), B.mean()
1.5802, 1.2
, , :
A_array = np.asarray(A)
A_array
array([ 3., 1., 2., 3., 3., 2., 2., 0., 0., 1., 1.,
nan, 0., 3., ..., 1., nan, 0., 3. ])
, NaN NaN, :
A.mean()
nan
, /?