Algorithm for assessing the similarity of sets of numbers

What is an algorithm for comparing multiple sets of numbers with a target set to determine which ones are the most “similar”?

One way to use this algorithm would be to compare today's hourly weather forecast with historical weather records to find a day that had similar weather.

The similarity of the two sets is a bit subjective, so the algorithm really just needs to distinguish between good matches and bad matches. We have a lot of historical data, so I would like to try to narrow down the number of days that users need to view, automatically throwing out sets that are not close and trying to put the “best” matches at the top of the list.

Edit : Ideally, the result of the algorithm would be comparable to the results using different data sets. For example, using the root-mean-square error proposed by Niles gives pretty good results, but the numbers generated by comparing the temperature cannot be compared with the numbers generated by other data, such as Wind speed or precipitation, because the scale of the data is different. Some of the non-weather data is very large, so the root-mean-square error algorithm generates numbers in hundreds of thousands, compared to tens or hundreds that are generated using temperature.

+4
source share
11 answers

I think the standard error may work for applications like weather. It is easy to calculate and give numbers that make sense.

Since you want to compare measurements over time, you can simply leave the missing values ​​from the calculation.

For values ​​that are not temporary or even unsorted, multidimensional scatter data is a little more complicated. Choosing a good distance indicator becomes part of the analysis of the analysis of such data.

+4
source

Use pearson correlation coefficient. I figured out how to calculate it in the SQL query, which can be found here: http://vanheusden.com/misc/pearson.php

+2
source

In financing, they use Beta to measure the correlation of two series of numbers. EG, Beta can answer the question: "Over the past year, how much would IBM have risen in price on the day when the price of the S & P 500 rose by 5%?" It deals with the percentage of progress, so 2 series can have different scales.

In my example, beta is covariance (IBM, S & P 500) / scatter (S & P 500).

Wikipedia has pages explaining Covariance , Variance, and Beta: http://en.wikipedia.org/wiki/Beta_(finance)

+1
source

Look at the statistical sites. I think you are looking for correlation.

+1
source

As an example, suppose you measure pace, wind, and draft. We will call these elements "functions." Thus, valid values ​​can be:

  • Pace: -50 to 100F (I'm in Minnesota, USA).
  • Wind: 0 to 120 mph (not sure if this is realistic, but carrying with me).
  • Precipitation: 0 to 100

Start by normalizing your data. Temp has a range of 150 units, Wind 120 units and Precip 100 units. Multiply your wind units by 1.25 and Precip by 1.5 to make them approximately the same “scale” as your pace. You can get fancy here and make rules that weigh one function as more valuable than others. In this example, the wind can have a huge range, but usually it remains in a smaller range, so you want to weigh it less so that it does not distort your results.

Now imagine each dimension as a point in multidimensional space. This example measures three-dimensional space (pace, wind, draft). The best part is, if you add additional functions, we simply increase the dimension of our space, but the mathematics remains unchanged. In any case, we want to find historical moments that are closest to our current point. The easiest way to do this is Euclidean distance . Therefore, measure the distance from our current point to each historical point and observe the closest matches:

for each historicalpoint distance = sqrt( pow(currentpoint.temp - historicalpoint.temp, 2) + pow(currentpoint.wind - historicalpoint.wind, 2) + pow(currentpoint.precip - historicalpoint.precip, 2)) if distance is smaller than the largest distance in our match collection add historicalpoint to our match collection remove the match with the largest distance from our match collection next 

This is a brute force approach. If you have time, you can become much more attractive. Multidimensional data can be represented as trees, such as kd-tree or g-trees . If you have a lot of data, comparing your current observation with each historical observation will be too slow. Trees speed up the search. Perhaps you should take a look at Data Clustering and Nearest Neighbor Search .

Greetings.

+1
source

Refer to the statistics.

Really.

They do it for a living.

You write that "the similarity of the two sets is a little subjective," but it is not subjective at all - it is a matter of determining the appropriate similarity criteria for your problem area.

This is one of those situations when you are talking much better with a professional than wondering about programmers.

+1
source

First of all, ask yourself if these are collections or ordered collections.

I guess these are ordered collections with duplicates. The most obvious algorithm is to select a tolerance in which the numbers are considered the same, and count the number of intervals in which the numbers are the same to this extent.

0
source

I have a solution implemented for this in my application, but I am looking to see if there is something better or more “correct”. For each historic day, I do the following:

 function calculate_score(historical_set, forecast_set) { double c = correlation(historical_set, forecast_set); double avg_history = average(historical_set); double avg_forecast = average(forecast_set); double penalty = abs(avg_history - avg_forecast) / avg_forecast return c - penalty; } 

Then I sort all the results from high to low.

Since the correlation is a value from -1 to 1, which says that numbers are falling or growing together, I then “punish” that with a percentage difference, the averages of two sets of numbers.

0
source

Several times, you mentioned that you do not know the distribution of data, which, of course, is true. I mean, tomorrow may be a day that is 150 degrees Fahrenheit at a speed of 2000 km / h, but this seems unlikely.

I would say that you have a very good idea about distribution, since you have a long historical record. Given this, you can put everything in terms of quantiles of historical distribution and do something with the absolute or square difference of the quantiles in all dimensions. This is another normalization method, but one that takes into account the non-linearity of the data.

Normalization in any style should lead to comparability of all variables.

As an example, let's say that on a day it is a windy, hot day: it can have a temp of.75 quantum and a wind quantile of 0.75. Camille .76 for heat can be 1 degree, and one for wind can be 3 km.

This focus on empirical distribution is easy to understand and can be more reliable than a conventional estimate (e.g., standard error).

0
source

Will two datasets be ordered or not?

If ordered, are the indices the same? at equal distance?

If the indices are common (temperatures measured on the same days (but in different places), for example, you can regress the first data set against the second, and then check that the slope is 1 and the intercept is 0.
http://stattrek.com/AP-Statistics-4/Test-Slope.aspx?Tutorial=AP

Otherwise, you can do two regressions, y = values ​​against their indices. http://en.wikipedia.org/wiki/Correlation . You still want to compare slopes and intercepts.

====

If disordered, I think you want to look at the cumulative distribution functions of http://en.wikipedia.org/wiki/Cumulative_distribution_function

One of the current tests is Kolmogorov-Smirnov: http://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test

You can also watch

Student t-test, http://en.wikipedia.org/wiki/Student%27s_t-test

or Wilcoxon Signature Level Criteria http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test

to verify the equality of means between the two samples.

And you can check the equality of variances with the Levene test http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm

Note. It is possible that dissimilar datasets have the same average and variance - depending on how strictly you want to be (and how much data you have), you might consider checking the equality of higher points, as good.

0
source

Perhaps you can see your set of numbers as a vector (each set number is a vector layout).

Then you can simply use the point product to calculate the similarity of two given vectors (i.e. a set of numbers).

You may need to normalize your vectors.

Read More: Cosine Similarities

0
source

Source: https://habr.com/ru/post/1277220/


All Articles