Pandas dataframe - remove values from a group with less than X rows

Question

Pandas dataframe - remove values from a group with less than X rows

I need to calculate the average value of std from a time series (monthly frequency), but I also need to exclude from the calculation "incomplete" years (from less than 12 months)

Abnormal / meager "working" version:

import numpy as np import scipy.stats as sts url='http://www.cpc.ncep.noaa.gov/data/indices/sstoi.indices' npdata = np.genfromtxt(url, skip_header=1) unique_enso_year = [int(value) for value in set(npdata[:, 0])] nin34 = np.zeros(len(unique_enso_year)) for ind, year in enumerate(unique_enso_year): indexes = np.flatnonzero(npdata[:, 0]==year) if len(indexes) == 12: nin34[ind] = np.mean(npdata[indexes, 9]) else: nin34[ind] = np.nan nin34x = (nin34 - sts.nanmean(nin34)) / sts.nanstd(nin34) array([[ 1.02250000e+00, 5.15000000e-01, -6.73333333e-01, -7.02500000e-01, 1.16666667e-01, 1.32916667e+00, -1.10333333e+00, -8.11666667e-01, 1.51666667e-01, 6.42500000e-01, 6.49166667e-01, 3.71666667e-01, 4.05000000e-01, -1.98333333e-01, -4.79166667e-01, 1.24666667e+00, -1.44166667e-01, -1.18166667e+00, -8.89166667e-01, -2.51666667e-01, 7.36666667e-01, 3.02500000e-01, 3.83333333e-01, 1.19166667e-01, 1.70833333e-01, -5.25000000e-01, -7.35000000e-01, 3.75000000e-01, -4.50833333e-01, -8.30000000e-01, -1.41666667e-02, nan]])

Pandas attempt:

 import pandas as pd from datetime import datetime def parse(yr, mon): date = datetime(year=int(yr), day=2, month=int(mon)) return date url='http://www.cpc.ncep.noaa.gov/data/indices/sstoi.indices' data = pd.read_table(url, sep=' ', header=0, skiprows=0, parse_dates = [['YR', 'MON']], skipinitialspace=True, index_col=0, date_parser=parse) grouped = data.groupby(lambda x: x.year) zscore = lambda x: (x - x.mean()) / x.std() transformed = grouped.transform(zscore) print transformed['ANOM.3'] YR_MON 1982-01-02 -0.986922 1982-02-02 -1.179216 1982-03-02 -1.179216 1982-04-02 -0.885119 1982-05-02 -0.376105 1982-06-02 0.087664 1982-07-02 -0.161188 1982-08-02 0.098975 1982-09-02 0.415695 1982-10-02 1.049134 1982-11-02 1.286674 1982-12-02 1.829622 1983-01-02 1.715072 1983-02-02 1.428598 1983-03-02 0.976272 ... 2012-03-02 -0.999284 2012-04-02 -0.663736 2012-05-02 -0.063283 2012-06-02 0.572491 2012-07-02 0.961020 2012-08-02 1.314227 2012-09-02 0.925699 2012-10-02 0.537170 2012-11-02 0.660793 2012-12-02 -0.169245 2013-01-02 -1.001483 2013-02-02 -0.924445 2013-03-02 0.462223 2013-04-02 1.386668 2013-05-02 0.077037 Name: ANOM.3, Length: 377, dtype: float64

This is not what I want .. because the bill is also 2013 (which has only 5 months)

To extract what I want, I need to do something like:

 (grouped.mean()['ANOM.3'][:-1] - sts.nanmean(grouped.mean()['ANOM.3'][:-1])) / sts.nanstd(grouped.mean()['ANOM.3'][:-1])

but this suggests that I already k now that the last year was incomplete, and then I lost np.NAN, where I should have the value of 2013

so I was now trying to make a request in pandas, for example:

 grouped2 = data.groupby(lambda x: x.year).apply(lambda sdf: sdf if len(sdf) > 11 else None).reset_index(drop=True)

This gives me the "correct values" .. but it generated a new dataframe "no index with a timestamp" .. I'm sure there is a simple and beautiful way to do this .. thanks for any help!

+4

python numpy scipy pandas

user1013346 Jun 22 '13 at 23:26

source share

2 answers

Here's a solution, a little hacky from time to time, since your dates are on the 2nd of every month.

It starts the same:

 In [205]: import pandas as pd In [206]: from datetime import datetime In [207]: from datetime import timedelta In [208]: In [208]: def parse(yr, mon): .....: date = datetime(year=int(yr), day=2, month=int(mon)) .....: return date .....: In [209]: In [209]: url='http://www.cpc.ncep.noaa.gov/data/indices/sstoi.indices' In [210]: data = pd.read_table(url, sep=' ', header=0, skiprows=0, parse_dates = [['YR', 'MON']], skipinitialspace=True, index_col=0, date_parser=parse) In [211]: grouped = data.groupby(lambda x: x.year)

Get full years:

 In [212]: full_year = grouped['NINO1+2'].count() == 12 In [213]: full_year Out[213]: 1982 True 1983 True 1984 True 1985 True 1986 True 1987 True 1988 True 1989 True 1990 True 1991 True 1992 True 1993 True 1994 True 1995 True 1996 True 1997 True 1998 True 1999 True 2000 True 2001 True 2002 True 2003 True 2004 True 2005 True 2006 True 2007 True 2008 True 2009 True 2010 True 2011 True 2012 True 2013 False dtype: bool

Now we are dealing with getting indexes in the right data type and alignment. Perhaps this has been simplified a bit:

 In [214]: strt = data.index[0] - timedelta(1) In [215]: idx = pd.DatetimeIndex(start=strt, periods=len(full_year - 1), freq='BA-JAN') In [216]: idx = idx + timedelta(1) # Get to 2nd of each month In [232]: idx Out[232]: <class 'pandas.tseries.index.DatetimeIndex'> [1982-01-02 00:00:00, ..., 2013-01-02 00:00:00] Length: 32, Freq: None, Timezone: None In [233]: full_year.index = idx

This is a key step:

 In [234]: full_year = full_year.reindex_like(data, method='ffill')

And hopefully this is correct:

 In [235]: data.ix[full_year].tail() Out[235]: NINO1+2 ANOM NINO3 ANOM.1 NINO4 ANOM.2 NINO3.4 ANOM.3 \ YR_MON 2012-08-02 20.99 0.35 25.72 0.73 29.10 0.42 27.55 0.73 2012-09-02 20.83 0.49 25.28 0.43 29.12 0.43 27.24 0.51 2012-10-02 20.68 -0.11 24.93 0.01 29.16 0.50 26.98 0.29 2012-11-02 21.21 -0.38 25.11 0.14 29.17 0.54 27.01 0.36 2012-12-02 22.13 -0.68 24.91 -0.23 28.71 0.23 26.46 -0.11 Unnamed: 10 YR_MON 2012-08-02 NaN 2012-09-02 NaN 2012-10-02 NaN 2012-11-02 NaN 2012-12-02 NaN

Just work with data.ix [full_year] and you should be good to go.

0

Tomugspurger Jun 23 '13 at 16:18

source share

user1013346 · Accepted Answer · 2013-06-23T10:48:19+0000

I found this way:

 import pandas as pd url='http://www.cpc.ncep.noaa.gov/data/indices/sstoi.indices' ts_raw = pd.read_table(url, sep=' ', header=0, skiprows=0, parse_dates = [['YR', 'MON']], skipinitialspace=True, index_col=0, date_parser=parse) ts_year_group = ts_raw.groupby(lambda x: x.year).apply(lambda sdf: sdf if len(sdf) > 11 else None) ts_range = pd.date_range(ts_year_group.index[0][1], ts_year_group.index[-1][1]+pd.DateOffset(months=1), freq="M") ts = pd.DataFrame(ts_year_group.values, index=ts_range, columns=ts_year_group.keys()) ts_fullyears_group = ts.groupby(lambda x: x.year) nin_anomalies = (grouped.mean()['ANOM.3'] - sts.nanmean(grouped.mean()['ANOM.3'])) / sts.nanstd(grouped.mean()['ANOM.3']) nin_anomalies 1982 1.527215 1983 0.779877 1984 -0.970047 1985 -1.012997 1986 0.193297 1987 1.978809 1988 -1.603259 1989 -1.173755 1990 0.244837 1991 0.967632 1992 0.977449 1993 0.568807 1994 0.617893 1995 -0.270568 1996 -0.684120 1997 1.857320 1998 -0.190803 1999 -1.718612 2000 -1.287880 2001 -0.349106 2002 1.106301 2003 0.466953 2004 0.585987 2005 0.196978 2006 0.273062 2007 -0.751613 2008 -1.060856 2009 0.573715 2010 -0.642396 2011 -1.200752 2012 0.000633 Name: ANOM.3, dtype: float64

I am sure there is a better way to do the same: /

Pandas dataframe - remove values ​​from a group with less than X rows

More articles:

Pandas dataframe - remove values from a group with less than X rows