NaN treatment

I find it hard to understand how pandas and / or numpy handle NaN values. I am extracting subsets of the pandas frame to compute t-stats, for example. I want to know if there is a significant difference on average for x2 for a group whose value x1 is equal to A, compared to those whose value is x1 B (sorry that this is not a working example, but I do not know how to recreate the values The NaNs that appear in my data frame, the source data is read using read_csv, and csv indicates the missing values ​​with NA):

import numpy as np
import pandas as pd
import scipy.stats as st
A = data[data['x1']=='A']['x2']
B = data[data['x1']=='B'].x2
A

2      3
3      1
5      2
6      3
10     3
12     2
15     2
16     0
21     0
24     1
25     1
28   NaN
31     0
32     3
...
677     0
681   NaN
682     3
683     1
686     2
Name: praxiserf, Length: 335, dtype: float64

That is, I have two objects pandas.core.series.Seriesthat I then want to perform a t-test. However, using

st.ttest_ind(A, B)

returns:

(array(nan), nan)

, , ttest_ind , , , NaN . , :

A.mean(), B.mean()

1.5802, 1.2

, , :

A_array = np.asarray(A)
A_array

array([ 3., 1., 2., 3., 3., 2., 2., 0., 0., 1., 1.,
        nan, 0., 3., ..., 1., nan, 0., 3. ])

, NaN NaN, :

A.mean()

nan

, /?

+4
2
,

pandas , bottleneck nanmean, nan s. numpy . nan - t-:

mask = numpy.logical_and(numpy.isfinite(A), numpy.isfinite(B))
st.ttest_ind(A[mask], B[mask])
+5

ttest_ind , "nan_policy", , nans. nan_policy "", , nan. "" , - . "" nan.

st.ttest_ind(A, B, nan_policy="omit")

.

+1

Source: https://habr.com/ru/post/1539767/


All Articles