Building pandas cdf series in python

Question

Building pandas cdf series in python

Is there any way to do this? I can't seem to be a simple way for a pandas interface with building CDF.

+27

python pandas series cdf

wolfsatthedoor Aug 29 '14 at 23:05

source share

7 answers

The plot of the CDF function or cumulative distribution is basically a plot, the x-axis is the sorted values, and the y-axis is the cumulative distribution. So, I would create a new series with sorted values as an index and cumulative distribution as values.

First create an example series:

 import pandas as pd import numpy as np ser = pd.Series(np.random.normal(size=100))

Series Sort:

 ser = ser.sort_values()

Now, before continuing, add the last (and largest) value again. This step is especially important for small sample sizes to get an unbiased CDF:

 ser[len(ser)] = ser.iloc[-1]

Create a new series with sorted values as an index and cumulative distribution as values:

 cum_dist = np.linspace(0.,1.,len(ser)) ser_cdf = pd.Series(cum_dist, index=ser)

Finally, build the function as steps:

 ser_cdf.plot(drawstyle='steps')

+12

kadee Aug 12 '15 at 16:57

source share

This is the easiest way.

 import pandas as pd df = pd.Series([i for i in range(100)]) df.hist( cumulative = True )

Cumulative Bar Graph Image

+8

wroscoe Sep 21 '16 at 23:52

source share

I came here looking for such a plot with bars and a CDF line:

This can be achieved like this:

 import pandas as pd import numpy as np import matplotlib.pyplot as plt series = pd.Series(np.random.normal(size=10000)) fig, ax = plt.subplots() ax2 = ax.twinx() n, bins, patches = ax.hist(series, bins=100, normed=False) n, bins, patches = ax2.hist( series, cumulative=1, histtype='step', bins=100, color='tab:orange') plt.savefig('test.png')

If you want to remove the vertical line, then he explained how to achieve this here . Or you can just do:

 ax.set_xlim((ax.get_xlim()[0], series.max()))

I also saw an elegant solution here on how to do this with seaborn .

+5

tommy.carstensen Aug 30 '18 at 0:13

source share

It seemed to me in a simple way:

 import numpy as np import pandas as pd import matplotlib.pyplot as plt heights = pd.Series(np.random.normal(size=100)) # empirical CDF def F(x,data): return float(len(data[data <= x]))/len(data) vF = np.vectorize(F, excluded=['data']) plt.plot(np.sort(heights),vF(x=np.sort(heights), data=heights))

+2

annon Jan 18 '16 at 4:01

source share

I found another solution in "clean" pandas that does not require specifying the number of bins to use in the histogram:

 import pandas as pd import numpy as np # used only to create example data series = pd.Series(np.random.normal(size=10000)) cdf = series.value_counts().sort_index().cumsum() cdf.plot()

0

jk. Sep 27 '18 at 9:14

source share

If you are also interested in the meanings, not just the plot.

 import pandas as pd # If you are in jupyter %matplotlib inline

It will always work (discrete and continuous distribution)

 # Define your series s = pd.Series([9, 5, 3, 5, 5, 4, 6, 5, 5, 8, 7], name = 'value') df = pd.DataFrame(s)

 # Get the frequency, PDF and CDF for each value in the series # Frequency stats_df = df \ .groupby('value') \ ['value'] \ .agg('count') \ .pipe(pd.DataFrame) \ .rename(columns = {'value': 'frequency'}) # PDF stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency']) # CDF stats_df['cdf'] = stats_df['pdf'].cumsum() stats_df = stats_df.reset_index() stats_df

 # Plot the discrete Probability Mass Function and CDF. # Technically, the 'pdf label in the legend and the table the should be 'pmf' # (Probability Mass Function) since the distribution is discrete. # If you don't have too many values / usually discrete case stats_df.plot.bar(x = 'value', y = ['pdf', 'cdf'], grid = True)

An alternative example with a sample taken from a continuous distribution, or you have many individual values:

 # Define your series s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')

 # ... all the same calculation stuff to get the frequency, PDF, CDF

 # Plot stats_df.plot(x = 'value', y = ['pdf', 'cdf'], grid = True)

For continuous distributions only

Please note, if it is very reasonable to make an assumption that there is only one case of each value in the sample (usually found in the case of continuous distributions), then groupby() + agg('count') not required (since the number is always 1).

In this case, the percentage rank can be used for direct access to cdf.

When choosing a shortcut, use your judgment! :)

 # Define your series s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value') df = pd.DataFrame(s)

 # Get to the CDF directly df['cdf'] = df.rank(method = 'average', pct = True)

 # Sort and plot df.sort_values('value').plot(x = 'value', y = 'cdf', grid = True)

0

Raphvanns Jan 22 '19 at 22:09

source share

Dan Frank · Accepted Answer · 2014-10-15 23:57

I believe that the functionality you are looking for is in the Hist method of the Series object, which wraps the hist () function in matplotlib.

Here is the relevant documentation.

In [10]: import matplotlib.pyplot as plt In [11]: plt.hist? ... Plot a histogram. Compute and draw the histogram of *x*. The return value is a tuple (*n*, *bins*, *patches*) or ([*n0*, *n1*, ...], *bins*, [*patches0*, *patches1*,...]) if the input contains multiple data. ... cumulative : boolean, optional, default : True If 'True', then a histogram is computed where each bin gives the counts in that bin plus all bins for smaller values. The last bin gives the total number of datapoints. If 'normed' is also 'True' then the histogram is normalized such that the last bin equals 1. If 'cumulative' evaluates to less than 0 (eg, -1), the direction of accumulation is reversed. In this case, if 'normed' is also 'True', then the histogram is normalized such that the first bin equals 1. ...

for example

 In [12]: import pandas as pd In [13]: import numpy as np In [14]: ser = pd.Series(np.random.normal(size=1000)) In [15]: ser.hist(cumulative=True, density=1, bins=100) Out[15]: <matplotlib.axes.AxesSubplot at 0x11469a590> In [16]: plt.show()

Building pandas cdf series in python

It will always work (discrete and continuous distribution)

For continuous distributions only

More articles: