Estimating Results in the Google App Engine Query

I am trying to estimate the total volume of results for queries from the application engine that will return large volumes of results.

To do this, I assigned a random floating-point number between 0 and 1 for each object. Then I completed a query for which I wanted to evaluate the overall results with the following three settings:

  * I ordered by the random numbers that I had assigned in ascending order
  * I set the offset to 1000
  * I fetched only one entity

Then I connected the entities to a random value that I assigned for this purpose, in the following equation to evaluate the overall results (since I used 1000 as the offset above, in this case the OFFSET value will be 1000):

  1 / RANDOM * OFFSET

The idea is that since each object is assigned a random number, and I sort by this random number, the assignment of random numbers to the entity should be proportional to the beginning and end of the results relative to its offset (in this case, 1000).

The problem I am facing is that the results that I get give me low marks. And lower scores, the smaller the bias. I expected that the lower the bias I used, the less accurate the estimate, but I thought the error would be higher and lower than the actual number of results.

Below is a chart demonstrating what I'm talking about. As you can see, the predictions become more consistent (accurate) as the bias increases from 1000 to 5000. But then the predictions predictably follow the 4-part polynomial. (y = -5E-15x4 + 7E-10x3 - 3E-05x2 + 0.3781x + 51608).

Am I making a mistake here, or is the standard python random number generator not distributing the numbers evenly enough for this purpose?

Thanks!

enter image description here

Edit:

It turns out this problem is due to my mistake. In another part of the program, I grabbed objects from the beginning of the series, performed an operation, and then reassigned a random number. This led to a tighter distribution of random numbers towards the end.

I went a little deeper into this concept, fixed the problem and tried it again on a different request (therefore, the number of results differs from the previous one). I found that this idea can be used to evaluate the overall results for a query. It should be noted that the β€œerror” is very similar to offsets close to each other. When I made the scatter chart in excel, I expected that the forecast accuracy at each offset would be cloudy. This means that offsets at the very begging will lead to the appearance of a larger, less dense cloud, which converges to a very small, dense, around the actual value, since the offsets become larger. This is not what happened, as you can see below in the cart about how far the predictions were at each offset. Where I thought there would be a point cloud, there is a line instead.

enter image description here

This is a graph of the maximum after each offset. For example, the maximum error for any offset after 10,000 was less than 1%:

enter image description here

+4
source share
3 answers

Quick thought:

For instance:

If the number of actual entities is 60,000, the question is β€œwhat is the likelihood that your 1000th (2000th, 3000th, ....) sample will fall in the interval [l, u], so the estimated total numbers based on this sample will have an acceptable error of up to 60,000. "

If the margin of error is 5%, the interval [l, u] will be [0.015873015873015872, 0.017543859649122806] I think the probability will not be very large.

+1
source

When using GAE, it’s much more meaning not to try to do a lot of reading work - it is built and optimized for very fast queries. In this case, it is actually more efficient to maintain a count of your results as you create objects.

If you have a standard query, it's pretty simple - just create sharded counter objects. You can sow this using the map reduction job to get an initial score.

If you have queries that can be dynamic, this is harder. If you know the range of possible queries that you could execute, you need to create a counter for each query that can be launched.

If the range of possible queries is infinite, you might consider combining counters or using them in more creative ways.

If you tell us the request that you are trying to run, maybe someone has a better idea.

+2
source

This does not directly concern the computing aspect of your question, but will the count attribute of the request object be used for you? Or have you tried this and it does not fit? According to the docs, it is only slightly faster than extracting all the data, but on the plus side it will give you the actual amount of results.

http://code.google.com/appengine/docs/python/datastore/queryclass.html#Query_count

+1
source

Source: https://habr.com/ru/post/1389219/


All Articles