Calculation of MySQL covariance on a single table

I have one financial MySQL MySQL database with the following schema:

+-----------------+---------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------------+---------------------+------+-----+---------+-------+ | symbol_id | tinyint(3) unsigned | YES | MUL | NULL | | | timestamp | timestamp(6) | YES | MUL | NULL | | | buy_sell | char(1) | YES | | NULL | | | price | decimal(10,6) | YES | MUL | NULL | | +-----------------+---------------------+------+-----+---------+-------+ 

There are 200 unique symbol_id s. Ultimately, I want to be able to calculate the covariance (by waiting time) of the price of all these pairs. At first, I can only rely on calculating the covariance of one pair, and then I can iterate.

To calculate the covariance, I need two arrays of the same length (in this case, price ). I struggle with how to write this as a single query, and avoiding having all the records returned to me for local covariance calculation.

Here is what I am trying to accomplish in two pseudo -SQL queries:

 SELECT (AVG(price1*price2) - AVG(price1)*AVG(price2)) as covar FROM data 

and

 SELECT price AS price1 WHERE HOUR(timestamp)=1 AND symbol_id=1 LIMIT(MIN(COUNT(price1,price2))) SELECT price AS price2 WHERE HOUR(timestamp)=1 AND symbol_id=2 LIMIT(MIN(COUNT(price1,price2))) 

The first statement takes two arrays of equal lengths price1 and price2 and computes the covariance. The second statement is that it selects two different types: everything happens within 1 hour of transactions and limits the return values ​​to an equal length.

In my limited knowledge of SQL, I had trouble understanding how I would combine these queries. Any help is much appreciated. Ultimately, being able to run a single query that calculates pairwise covariance for a specific period of time will be great.

+4
source share
1 answer

I am a bit confused. Covariance is calculated for simultaneous data collection. (Like two measurements taken simultaneously). (See, for example, the answer to the question http://www.mathworks.com/matlabcentral/newsreader/view_thread/134856 )

With the LIMIT clause, you throw valuable data, which affects accuracy. Also, I'm not sure about this, but I think that LIMIT can return different rows at different times, so your calculation may not be defined.

If you do the covariances by the hour, this means that you think that the prices that occur per hour will be the same, so I would suggest that you calculate the covariance on average at the prices of this hour.

If you do not consider prices within an hour to be part of the same measurement, then you have a missing data problem, which means that you do not have enough data for price2 when price1 occurred and vice versa. (See, for example, https://stats.stackexchange.com/questions/20457/is-it-possible-to-compute-a-covariance-matrix-with-unequal-sample-sizes )

0
source

Source: https://habr.com/ru/post/1487093/


All Articles