Let me break it down a bit. There are two main types of behavior you want to test:
The HyperLogLog Twitter implementation is working correctly, i.e. gives a good estimate of the number of elements.
Your HyperLogLog structures consuming your code (for example, counters) increase them if necessary.
Please note that behavior # 2 is easy to verify during assembly using unit tests, not integration tests. This is preferable and will catch most of the problems.
Case No. 1 can also be divided into three cases:
A when the number of elements is 0;
B, when the number of elements is small (5, 100 or 1000);
C when the number of elements is large (millions / billions).
Again, cases A and B can and should be tested during assembly using unit tests. You have to decide on the acceptable margins of error depending on your application and state that the grades are within these grades - it doesnβt really matter that you have chosen HyperLogLog as the base grading method, the tests should treat the grading as a black box. I would say that a 10% error is reasonable for most purposes, but it really depends on your specific application. These limitations should represent the worst possible accuracy with which your application can live. For example, the critical error counter may not live with ANY rating errors at all, so using HyperLogLog for it should break the unit test. A counter for the number of individual users may be able to live with a 50% rating error - it is up to you.
So, this leaves us with the last case - testing that the HyperLogLog implementation gives a good estimate for a large number of elements. This cannot be verified during assembly, and indeed, an integration test is the way to go. However, depending on how much you trust the HyperLogLog implementation on Twitter, you might consider DO NOT TEST it at all - Twitter should have done it already. This may seem like a break in best practice, but given the overhead that may be associated with the integration test, it might be worth it in your case.
If you decide to write an integration test, you will need to simulate the traffic that you expect during the production process and generate it from several sources, since you will generate millions / billions of requests. You can save a sample of real production traffic and use it for testing (perhaps the most accurate method) or determine how your traffic looks and create similar test traffic. Again, the margin of error should be chosen according to the application, and you should be able to exchange the assessment method for the best, without breaking the test.