Data structure for storing data and calculating average

Question

Data structure for storing data and calculating average

In this problem, we are interested in a data structure that supports the storage of infinite numbers of parallel vectors of the Y axis.

Each node contains a location (X-axis value) and height (Y-axis value). We can assume that there are no two vectors in one place.

Please report an effective data structure that supports:

init((x1,y1)(x2,y2)(x3,y3)...(xn,yn)) - DS will contain all n vectors, and VECTOR # i - xi VECTOR # i hieght is yi. We also know that x1 < x2 < x3 < ... < xn (nothing is known about y) - complexity = O (n) on average
insert(x,y) - add a vector with location x and height y. - complexity = O (logn), amortized on average.
update(x,y) - update the vector # x height to y. - complexity = O (logn) worst case
average_around(x) - return the average value of the average number of log neighbors of x-complexity = O (1) on average

Cosmic complexity: O (n)

+4

data-structures average

user1168623 Jan 25 '12 at 6:59

source share

1 answer

phimuemue · Answer 1 · 2012-03-28T09:57:39+0000

I cannot give a complete answer, but it may be a hint in the right direction.

Key ideas :

Suppose you calculated the average value of n numbers a_1,...,a_n , then this is the average value of avg=(a_1+...+a_n)/n . If we now replace a_n with b , we can recalculate the new average value as follows: avg'=(a_1+...+a_(n-1)+b)/n , or - more simply - avg'=((avg*n)-a_n+b)/n . This means that if we exchange one element, we can recalculate the average value using the original average value using simple quick operations and do not need to iterate over all the elements involved in the average.

Note. I assume that you want to have log n neighbors on each side, i.e. overall we have 2 log(n) neighbors. You can simply adapt it if you want to have log(n) neighbors in general. Moreover, since log n in most cases not be a natural number, I assume that you are talking about floor(log n) , but I will just write log n for simplicity.

The main thing that I am considering is the fact that you must specify the middle element x in O (1). So, I suppose, you need to somehow recompute this mean and save it. So, I would save the following in node:

x value
y value
middle around

Note that update(x,y) is done strictly in O (log n), if you have this structure: if you update the x element to a height y , you should consider the 2log(n) neighbors whose average value is affected by this change . You can recalculate each of these averages in O (1):

Suppose update(x,y) affects element b , whose average should also be updated. Then just multiply average(b) by the number of neighbors ( 2log(n) as above). Then we subtract the old y-value of x and add the new (updated) y y-value x . After that, divide by 2 log(n) . This ensures that we now update the average for element b . This is due only to some calculations and therefore can be performed in O (1). Since we have 2log n neighbors, update works in O(2log n)=O(log n) .

When you insert a new element e , you need to update the average of all elements affected by this new element e . This essentially does, as in the update routine. However, you must be careful when log n (or exactly floor(log n) ) changes its value. If floor(log n) remains the same (which in most cases will be), you can just do the analog things described in update , however you will have to “remove” the height of one element and “add” the height of the newly added element. In these “good” cases, the runtime is again strictly O(log n) .

Now, when floor(log n) changes (increases by 1), you need to update for all elements. That is, you must perform operation O (1) for elements n , which will lead to runtime O(n) . However, it very rarely happens that floor(log n) incremented by 1 (you need to double the value of n to increase floor(log n) by 1 - if we are talking about log to base 2, which is not unusual in computer science). Denote this time by c*n or just cn .

Thus, we consider the sequence of inserts: the first insert needs to be updated: c*1 , the second insert needs to be updated: 2*c . The next time the expensive insert rises, this is the fourth insert: 4*c , then eight inserts: 8c , sixteenth insert: 16*c . The distance between two expensive inserts doubles each time:

 insert # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 .. cost 1c 2c 1 4c 1 1 1 8c 1 1 1 1 1 1 1 16c 1 1 ..

Since remove is not required, we can continue our analysis without any “special cases” and consider only the sequence of inserts. You see that most inserts cost 1, and some of them are expensive (1,2,4,8,16,32, ...). So, if we have m inserts in general, we are rough log m expensive inserts and about m-log m cheap inserts. For simplicity, we assume just m cheap inserts.

Then we can calculate the cost for m inserts:

  log m ---- \ i m*1 + / 2 ---- i=0

m*1 counts cheap operations, the sum is expensive. It can be shown that all this is no more than 4m (in fact, you can even show the best ratings quite easily, but for us it is enough).

Thus, we see that the cost of operations m not more than 4m . Thus, the operation with one insert costs no more than 4m/m=4 , therefore, O(1) amortized.

So, there are 2 things left:

How to save all records?
How to initialize data structure in O (n)?

I suggest storing all entries in a skip list or some tree that guarantees logarithmic search operations (otherwise, inserting and updating requires more than O (log n) to find the correct position). Note that the data structure must be built in O (n) - this should not be a big problem, provided that the elements are sorted according to their x coordinate.

To initialize the data structure in O (n), I suggest starting with an element with index log n and calculating its average simple way (sum, 2log n neighbors, divide by 2 log n ).

Then you move the index one more and calculate average(index) using average(index-1) : average(index)=average(index-1)*log(n)-y(index-1-log(n))+y(index+log(n)) .

That is, we take a similar approach, as with the update. This means that the calculation of average costs is O(log n + n*1)=O(n) . Thus, we can calculate the average values in O (n).

Please note that you need to take into account some details that I have not described here (for example, borderline cases: an element with index 1 does not have log(n) neighbors on both sides - how do you do this?).

Data structure for storing data and calculating average

More articles: