We are currently using Cassandra ( http://cassandra.apache.org/ ) for Time Series data. Cassandra reads very quickly, but we must perform a series of calculations according to our data before we present them (effectively we simulate the functionality of SUM and GROUP BY SQL - something Cassandra does not support out of the box)
We are familiar with Python (to some extent) and decided to build a script to query our Cassandra cluster, as well as do the math and present the result in JSON format:
query = ( "SELECT query here...") startTimeQuery = time.time() # Executes cassandra query rslt = cassession.execute(query) print("--- %s seconds to query ---" % (time.time() - startTimeQuery)) tally = {} startTimeCalcs = time.time() for row in rslt: userid = row.site_user_id revenue = (int(row.revenue) - int(row.reversals_revenue or 0)) accepted = int(row.accepted or 0) reversals_revenue = int(row.reversals_revenue or 0) error = int(row.error or 0) impressions_negative = int(row.impressions_negative or 0) impressions_positive = int(row.impressions_positive or 0) rejected = int(row.rejected or 0) reversals_rejected = int(row.reversals_rejected or 0) if tally.has_key(userid): tally[userid]["revenue"] += revenue tally[userid]["accepted"] += accepted tally[userid]["reversals_revenue"] += reversals_revenue tally[userid]["error"] += error tally[userid]["impressions_negative"] += impressions_negative tally[userid]["impressions_positive"] += impressions_positive tally[userid]["rejected"] += rejected tally[userid]["reversals_rejected"] += reversals_rejected else: tally[userid] = { "accepted": accepted, "error": error, "impressions_negative": impressions_negative, "impressions_positive": impressions_positive, "rejected": rejected, "revenue": revenue, "reversals_rejected": reversals_rejected, "reversals_revenue": reversals_revenue } print("--- %s seconds to calculate results ---" % (time.time() - startTimeCalcs)) startTimeJson = time.time() jsonOutput =json.dumps(tally) print("--- %s seconds for json dump ---" % (time.time() - startTimeJson)) print("--- %s seconds total ---" % (time.time() - startTimeQuery)) print "Array Size: " + str(len(tally))
This is the output of the output:
--- 0.493520975113 seconds to query --- --- 23.1472680569 seconds to calculate results --- --- 0.546246051788 seconds for json dump --- --- 24.1871240139 seconds total --- Array Size: 198124
We spend a lot of time on our calculations, we know that the problem is not so much in the sums and in the groups themselves: this is just the size of the array, which is the problem.
We have heard some good things about numpy , but the nature of our data makes the size of the matrix unknown.
We are looking for any advice on how to approach this. Including a completely different approach to programming.
source share