How to minimize data warehouse records initiated by mapreduce library?

Question

How to minimize data warehouse records initiated by mapreduce library?

I have 3 parts of this question:

I have an application in which users create objects that other users can update within 5 minutes. After 5 minutes, the objects lose time and are invalid. I store objects as objects. To timeout, I have a cron job that runs once per minute to clear expired objects.

In most cases, I do not have active objects. In this case, the mapreduce handler checks the entity that it receives, and does nothing, if it is not active, does not write. However, my free quota for writing to the data warehouse ends with mapreduce calls in about 7 hours. According to my rough calculations, it seems that only starting mapreduce calls ~ 120 write / call. (Rough math, 60 calls / hour * 7 hours = 420 calls, 50,000 options / 420 calls ~ 120 entries / calls)

Q1: Can anyone verify that only triggered mapreduce ~ 120 datastore triggers write?

To get around this, I check the data store before starting with mapreduce:

def cronhandler(): count = model.all(keys_only=True).count(limit=1000) if count: shards = (count / 100) + 1; from mapreduce import control control.start_map("Timeout open objects", "expire.maphandler", "expire.OpenOrderInputReader", {'entity_kind' : 'model'}, shard_count=shards) return HttpResponse()

Q2: Is this the best way to avoid writing data created using mapreduce? Is there a better way to configure mapreduce to avoid extraneous entries? I thought this was possible with the best custom InputReader

Q3: I assume that more fragments lead to the emergence of more extraneous data warehouse records from mapreduce accounting. Is the shard limit the expected number of objects that I need to write accordingly?

+4

python google-app-engine mapreduce google-cloud-datastore

dragonx Feb 22 '12 at 20:20

source share

3 answers

What if you saved your objects in memcache instead of data storage? My only concern is whether memcache is compatible between all instances using this application, but if so, the problem has a very clear solution.

+2

rbanffy Feb 24 '12 at 16:46

source share

This doesn’t exactly answer your question, but could you reduce the frequency of cron?

Instead of deleting models as soon as they become invalid, just remove them from the queries that your users see.

For instance:

 import datetime now = datetime.datetime.now(created_at) five_minutes_ago = now - datetime.timedelta(minutes=5) q = model.all() q.filter('create_at >=', five_minutes_ago)

Or, if you do not want to use the inequality filter, you can use == based on five minute blocks.

Then you run your cron every hour or so to clear inactive models.

The disadvantage of this approach is that entities will only be returned using the key, in which case you will need to verify that they are still valid before returning them to the user.

+1

Kyle finley Mar 2 '12 at 18:52

source share

dragonx · Accepted Answer · 2012-03-15T16:43:10+0000

I guess what I did is the best way to do things. The Mapreduce API seems to be using a data warehouse to track running jobs and synchronize workers. By default, the API uses 8 workers. Reducing the number of workers reduces the number of data warehouse records, but it also reduces the performance of wall-time.

How to minimize data warehouse records initiated by mapreduce library?

More articles: