Filtering histogram edges and counts

Consider calculating a histogram of a numpy array that returns percentages:

# 500 random numbers between 0 and 10,000 values = np.random.uniform(0,10000,500) # Histogram using eg 200 buckets perc, edges = np.histogram(values, bins=200, weights=np.zeros_like(values) + 100/values.size) 

The above returns two arrays:

  • perc containing % (ie, percent) of the values โ€‹โ€‹within each pair of consecutive edges[ix] and edges[ix+1] out of the total.
  • edges len(hist)+1 length len(hist)+1

Now say that I want to filter perc and edges so that in the end I get only percentages and edges for the values contained in the new range [m, M] .

That is, I want to work with perc and edges corresponding to the range of values โ€‹โ€‹inside [m, M] . Needless to say, the new percent array will still refer to the total count of fractions of the input array. We just want to filter out perc and edges to get the correct subarrays.

How can I execute the perc and edges post process to do this?

The values โ€‹โ€‹of m and m can be any number. In the above example, we can assume, for example, m = 0 and M = 200 .

+5
source share
2 answers
 m = 0; M = 200 mask = [(m < edges) & (edges < M)] >>> edges[mask] array([ 37.4789683 , 87.07491593, 136.67086357, 186.2668112 ]) 

Let you work with a smaller dataset to make it easier to understand:

 np.random.seed(0) values = np.random.uniform(0, 100, 10) values.sort() >>> values array([ 38.34415188, 42.36547993, 43.75872113, 54.4883183 , 54.88135039, 60.27633761, 64.58941131, 71.51893664, 89.17730008, 96.36627605]) # Histogram using eg 10 buckets perc, edges = np.histogram(values, bins=10, weights=np.zeros_like(values) + 100./values.size) >>> perc array([ 30., 0., 20., 10., 10., 10., 0., 0., 10., 10.]) >>> edges array([ 38.34415188, 44.1463643 , 49.94857672, 55.75078913, 61.55300155, 67.35521397, 73.15742638, 78.9596388 , 84.76185122, 90.56406363, 96.36627605]) m = 0; M = 50 mask = (m <= edges) & (edges < M) >>> mask array([ True, True, True, False, False, False, False, False, False, False, False], dtype=bool) >>> edges[mask] array([ 38.34415188, 44.1463643 , 49.94857672]) >>> perc[mask[:-1]][:-1] array([ 30., 0.]) m = 40; M = 60 mask = (m < edges) & (edges < M) >>> edges[mask] array([ 44.1463643 , 49.94857672, 55.75078913]) >>> perc[mask[:-1]][:-1] array([ 0., 20.]) 
+2
source

Well, for this you may need math. The cells have the same distance, so you can determine which one is the first to turn on and which is the last, using the width of each bin:

 bin_width = edges[1] - edges[0] 

Now calculate the first and last valid bit:

 first = math.floor((m - edges[0]) / bin_width) + 1 # How many bins from the left last = math.floor((edges[-1] - M) / bin_width) + 1 # How many bins from the right 

(Ignore +1 for both if you want to include a box containing m or m -, but then be careful not to enter negative values โ€‹โ€‹for the first and last!)

Now you know how many bins to include:

 valid_edges = edges[first:-last] valid_perc = perc[first:-last] 

This excludes the first first points and the last last points.

Perhaps I did not pay enough attention to rounding, and it included an error "from one", but I think the idea sounds. :-)

You probably need to catch special cases, such as M > edges[-1] , but I haven't included them for readability.


Or, if the cells are not evenly distributed, use logical masks instead of the buffer:

 first = edged[edges < m].size + 1 last = edged[edges > M].size + 1 
+1
source

Source: https://habr.com/ru/post/1242827/


All Articles