How can I build approx. 20 million scatterplots?

I am trying to create a scatterplot with matplotlib, which consists of approx. ca. 20 million data points. Even after setting the alpha value to the lowest level before it will not be displayed without visible data, the result will be completely black.

plt.scatter(timedPlotData, plotData, alpha=0.01, marker='.') 

The x axis represents a continuous timeline of about 2 months, and the y axis consists of 150k consecutive integer values.

Is there a way to build all the points so that their distribution is still visible over time?

Thank you for your help.

+6
source share
4 answers

There is more than one way to do this. Many people have suggested estimating a heat map / core density / 2d histogram. @Bucky suggested using a moving average. In addition, you can fill in between a moving min and a moving maximum, as well as build a moving average above the top. I often call it "chunkplot", but it's a terrible name. The implementation below assumes that your time (x) values โ€‹โ€‹are monotonically increasing. If this is not the case, simply sort the y by x before "chunking" in the chunkplot function.

Here are a few different ideas. This will best depend on what you want to emphasize in the plot. Note that this will start rather slowly, but mainly because of the scatter chart. Other building styles are much faster.

 import numpy as np import matplotlib.pyplot as plt import matplotlib.dates as mdates import datetime as dt np.random.seed(1977) def main(): x, y = generate_data() fig, axes = plt.subplots(nrows=3, sharex=True) for ax in axes.flat: ax.xaxis_date() fig.autofmt_xdate() axes[0].set_title('Scatterplot of all data') axes[0].scatter(x, y, marker='.') axes[1].set_title('"Chunk" plot of data') chunkplot(x, y, chunksize=1000, ax=axes[1], edgecolor='none', alpha=0.5, color='gray') axes[2].set_title('Hexbin plot of data') axes[2].hexbin(x, y) plt.show() def generate_data(): # Generate a very noisy but interesting timeseries x = mdates.drange(dt.datetime(2010, 1, 1), dt.datetime(2013, 9, 1), dt.timedelta(minutes=10)) num = x.size y = np.random.random(num) - 0.5 y.cumsum(out=y) y += 0.5 * y.max() * np.random.random(num) return x, y def chunkplot(x, y, chunksize, ax=None, line_kwargs=None, **kwargs): if ax is None: ax = plt.gca() if line_kwargs is None: line_kwargs = {} # Wrap the array into a 2D array of chunks, truncating the last chunk if # chunksize isn't an even divisor of the total size. # (This part won't use _any_ additional memory) numchunks = y.size // chunksize ychunks = y[:chunksize*numchunks].reshape((-1, chunksize)) xchunks = x[:chunksize*numchunks].reshape((-1, chunksize)) # Calculate the max, min, and means of chunksize-element chunks... max_env = ychunks.max(axis=1) min_env = ychunks.min(axis=1) ycenters = ychunks.mean(axis=1) xcenters = xchunks.mean(axis=1) # Now plot the bounds and the mean... fill = ax.fill_between(xcenters, min_env, max_env, **kwargs) line = ax.plot(xcenters, ycenters, **line_kwargs)[0] return fill, line main() 

enter image description here

+13
source

For each day, count the frequency of each value (collection. The counter will do it beautifully), and then draw a memory card of the values โ€‹โ€‹once a day. For publication, use shades of gray for the colors of the heatmap.

+3
source

My recommendation would be to use a sorting algorithm and a moving average on raw data before building it. This should leave the averages and trends intact for the time period of interest, while at the same time giving you less clutter on the chart.

+1
source

Group values โ€‹โ€‹in ranges every day and use a three-dimensional histogram of the counter, range of values, day.

Thus, you can get the number of events in this band every day clearly.

+1
source

Source: https://habr.com/ru/post/954092/


All Articles