How is the "checksum" array of floating point numbers?

What is a quick and easy way to “checksum” an array of floating point numbers, given the slight slight inaccuracy?

eg. I have two algorithms that should (theoretically, with infinite precision) output the same array. But they work differently, and therefore floating point errors will accumulate differently, although the lengths of the arrays should be the same. I would like a quick and easy way to check if arrays are the same. I could, of course, compare the numbers in pairs and report the maximum error; but one algorithm is in C ++ and the other is in Mathematica, and I don’t want to worry about writing numbers to a file or inserting them from one system to another. That is why I need a simple checksum.

I could just add all the numbers in the array. If the length of the array is N, and I can tolerate an error of 0.0001 in each number, then I would check if abs(sum1-sum2)<0.0001*N But this simplified "checksum" is not reliable, for example. to an error of +10 in one record and -10 in another. (In general, probability theory says that the error probably grows as sqrt (N), and not like N.) Of course, any checksum is a low-dimensional summary of a piece of data, so it skips some errors, if not most ... but simple checksums, however, are useful for detecting errors that are not a malicious error.

Or I could create a two-dimensional checksum [sum(x[n]), sum(abs(x[n]))] . But the best I can do, i.e. Is there any other function I could use that would be "more orthogonal" for sum(x[n]) ? And if I used some arbitrary functions, for example. [sum(f1(x[n])), sum(f2(x[n]))] , how does my "raw error tolerance" translate into "checksum error tolerances"?

I program in C ++, but I'm glad to see the answers in any language.

+4
source share
4 answers

I spent some time searching for a deterministic answer and could not find it. If there is a good answer, it will probably require heavy mathematical skills (functional analysis).

I am sure that there is no solution based on "discretization in some tricky way, and then apply a discrete checksum", for example. "discretize into lines 0/1 /?, where? means wildcard." Any discretization will have the property that two floating-point numbers that are very close to each other can have different discrete codes, and then the discrete checksum will not tell us what we want to know.

However, a very simple randomized circuit should work fine. Create a pseudo-random string S from the alphabet {+ 1, -1} and calculate csx = sum (X_i * S_i) and csy = sum (Y_i * S_i), where X and Y are my original arrays of floating point numbers. If we model errors as independent normal random variables with an average value of 0, then it is easy to calculate the csx-csy distribution. We could do this for several lines of S, and then perform a hypothesis test that the average error is 0. The number of lines S required for the test is fixed, it does not grow linearly in the size of the arrays, so it satisfies my need for a “low-dimensional summary” . This method also provides an estimate of the standard deviation of the error, which may be convenient.

+2
source

I have the feeling that what you want is possible with something like gray codes . if you can translate your values ​​into gray codes and use some kind of checksum that could fix n bits, you could determine if these two arrays were the same except for n-1 error bits, right? (each bit of the error means that the number is "disabled by one", where the display will be such that this is a change in the least significant digit).

but the exact data is superior to me - especially for floating point values.

I don’t know if this helps, but what gray codes are being solved is a problem of pathological rounding. rounding sounds as if this solves the problem - a naive solution can round, and then a checksum. but simple rounding always has pathological cases - for example, if we use gender, then 0.9999999 and 1 are different. the gray-code approach seems to address this since neighboring values ​​are always single-bit, so a bit-based checksum accurately reflects the "distance".

[update:] more precisely, what you want is a checksum that gives an estimate of the hamming distance between your sequences with gray coding (and the gray coded part is simple if you just care about 0.0001 since you can repeatedly by 10000 and use whole numbers).

and it seems that such checksums exist : any error correction code can be used to detect errors. A code with a minimum Hamming distance, d, can detect up to d - 1 errors in a codeword. The use of error correction codes at a minimum distance to detect errors may be appropriate if a strict restriction on the minimum number of errors to be detected is required.

so just in case it’s not clear:

  • somewhat by minimal error to get integers
  • convert to gray code equivalent
  • Use an error detection code with a minimum distance from interference in excess of the allowable error.

but I'm still not sure that it is. you still get pathological rounding in conversion from float to integer. so it seems to you that you need a minimum distance for hamming equal to 1 + len (data) (in the worst case, with a rounding error for each value). is it possible? probably not for large arrays.

it is possible to ask again with the best tags / description now, what is the general direction possible? or just add tags now? we need someone who does this for a living. [I added some tags]

+3
source

Try the following:

 #include <complex> #include <cmath> #include <iostream> // PARAMETERS const size_t no_freqs = 3; const double freqs[no_freqs] = {0.05, 0.16, 0.39}; // (for example) int main() { std::complex<double> spectral_amplitude[no_freqs]; for (size_t i = 0; i < no_freqs; ++i) spectral_amplitude[i] = 0.0; size_t n_data = 0; { std::complex<double> datum; while (std::cin >> datum) { for (size_t i = 0; i < no_freqs; ++i) { spectral_amplitude[i] += datum * std::exp( std::complex<double>(0.0, 1.0) * freqs[i] * double(n_data) ); } ++n_data; } } std::cout << "Fuzzy checksum:\n"; for (size_t i = 0; i < no_freqs; ++i) { std::cout << real(spectral_amplitude[i]) << "\n"; std::cout << imag(spectral_amplitude[i]) << "\n"; } std::cout << "\n"; return 0; } 

It returns only a few, arbitrary Fourier transform points of the entire data set. They do the so-called fuzzy checksum.

+2
source

How about calculating the standard integer checksum on the data obtained by zeroing the least significant digits of the data, the ones you don't need?

+1
source

Source: https://habr.com/ru/post/1402155/


All Articles