Python: Matplotlib - Probability Graph for Multiple Datasets

Question

Python: Matplotlib - Probability Graph for Multiple Datasets

I have several datasets (distribution) as follows:

set1 = [1,2,3,4,5] set2 = [3,4,5,6,7] set3 = [1,3,4,5,8]

How do I plot a scatter plot with the above datasets when the y axis is the probability (i.e., the percentile of the distribution in the set: 0% -100%) and the x axis is the names of the data sets? in JMP this is called the "Quantile Plot".

Something like an attached image: enter image description here

Please enlighten. Thanks.

[EDIT]

My data in csv as such:

enter image description here

Using the JMP analysis tool, I can plot the probability distribution (QQ-plot / Normal Quantile Plot, as shown below):

I believe that Joe Kington has almost solved my problem, but I'm wondering how to handle the raw csv data in probability arrays or percentiles.

I do this to automate some statistics analysis in Python, and not depending on the JMP for plotting.

+4

python numpy matplotlib probability percentile

siva Jun 13 '11 at 3:21

source share

1 answer

Joe kington · Accepted Answer · 2011-06-14T05:06:38+0000

I don’t quite understand what you want, so I'm going to guess here ...

Do you want the Probability / Percentage to be a cumulative histogram?

So, for one plot, will you have something like this? (Mark it with markers, as you showed above, instead of the more traditional step) ...

 import scipy.stats import numpy as np import matplotlib.pyplot as plt # 100 values from a normal distribution with a std of 3 and a mean of 0.5 data = 3.0 * np.random.randn(100) + 0.5 counts, start, dx, _ = scipy.stats.cumfreq(data, numbins=20) x = np.arange(counts.size) * dx + start plt.plot(x, counts, 'ro') plt.xlabel('Value') plt.ylabel('Cumulative Frequency') plt.show()

If this is roughly what you want for a single plot, there are several ways to draw multiple plots on a figure. The easiest way is to use subheadings.

Here we will create some data sets and build them on different subtitles with different characters ...

 import itertools import scipy.stats import numpy as np import matplotlib.pyplot as plt # Generate some data... (Using a list to hold it so that the datasets don't # have to be the same length...) numdatasets = 4 stds = np.random.randint(1, 10, size=numdatasets) means = np.random.randint(-5, 5, size=numdatasets) values = [std * np.random.randn(100) + mean for std, mean in zip(stds, means)] # Set up several subplots fig, axes = plt.subplots(nrows=1, ncols=numdatasets, figsize=(12,6)) # Set up some colors and markers to cycle through... colors = itertools.cycle(['b', 'g', 'r', 'c', 'm', 'y', 'k']) markers = itertools.cycle(['o', '^', 's', r'$\Phi$', 'h']) # Now let actually plot our data... for ax, data, color, marker in zip(axes, values, colors, markers): counts, start, dx, _ = scipy.stats.cumfreq(data, numbins=20) x = np.arange(counts.size) * dx + start ax.plot(x, counts, color=color, marker=marker, markersize=10, linestyle='none') # Next we'll set the various labels... axes[0].set_ylabel('Cumulative Frequency') labels = ['This', 'That', 'The Other', 'And Another'] for ax, label in zip(axes, labels): ax.set_xlabel(label) plt.show()

If we want this to look like one continuous storyline, we can just squeeze the subheadings together and turn off some borders. Just add the following before calling plt.show()

 # Because we want this to look like a continuous plot, we need to hide the # boundaries (aka "spines") and yticks on most of the subplots for ax in axes[1:]: ax.spines['left'].set_color('none') ax.spines['right'].set_color('none') ax.yaxis.set_ticks([]) axes[0].spines['right'].set_color('none') # To reduce clutter, let leave off the first and last x-ticks. for ax in axes: xticks = ax.get_xticks() ax.set_xticks(xticks[1:-1]) # Now, we'll "scrunch" all of the subplots together, so that they look like one fig.subplots_adjust(wspace=0)

Hope this helps, anyway!

Edit: if you want to get percentile values, and instead a cumulative histogram (I really shouldn't have used 100 as the sample size!), This is easy to do.

Just do something like this (using numpy.percentile instead of normalizing things manually):

 # Replacing the for loop from before... plot_percentiles = range(0, 110, 10) for ax, data, color, marker in zip(axes, values, colors, markers): x = np.percentile(data, plot_percentiles) ax.plot(x, plot_percentiles, color=color, marker=marker, markersize=10, linestyle='none')

Python: Matplotlib - Probability Graph for Multiple Datasets

More articles: