Large web datasets in Python - how to work with very large arrays?

Question

Large web datasets in Python - how to work with very large arrays?

We are currently using Cassandra ( http://cassandra.apache.org/ ) for Time Series data. Cassandra reads very quickly, but we must perform a series of calculations according to our data before we present them (effectively we simulate the functionality of SUM and GROUP BY SQL - something Cassandra does not support out of the box)

We are familiar with Python (to some extent) and decided to build a script to query our Cassandra cluster, as well as do the math and present the result in JSON format:

query = ( "SELECT query here...") startTimeQuery = time.time() # Executes cassandra query rslt = cassession.execute(query) print("--- %s seconds to query ---" % (time.time() - startTimeQuery)) tally = {} startTimeCalcs = time.time() for row in rslt: userid = row.site_user_id revenue = (int(row.revenue) - int(row.reversals_revenue or 0)) accepted = int(row.accepted or 0) reversals_revenue = int(row.reversals_revenue or 0) error = int(row.error or 0) impressions_negative = int(row.impressions_negative or 0) impressions_positive = int(row.impressions_positive or 0) rejected = int(row.rejected or 0) reversals_rejected = int(row.reversals_rejected or 0) if tally.has_key(userid): tally[userid]["revenue"] += revenue tally[userid]["accepted"] += accepted tally[userid]["reversals_revenue"] += reversals_revenue tally[userid]["error"] += error tally[userid]["impressions_negative"] += impressions_negative tally[userid]["impressions_positive"] += impressions_positive tally[userid]["rejected"] += rejected tally[userid]["reversals_rejected"] += reversals_rejected else: tally[userid] = { "accepted": accepted, "error": error, "impressions_negative": impressions_negative, "impressions_positive": impressions_positive, "rejected": rejected, "revenue": revenue, "reversals_rejected": reversals_rejected, "reversals_revenue": reversals_revenue } print("--- %s seconds to calculate results ---" % (time.time() - startTimeCalcs)) startTimeJson = time.time() jsonOutput =json.dumps(tally) print("--- %s seconds for json dump ---" % (time.time() - startTimeJson)) print("--- %s seconds total ---" % (time.time() - startTimeQuery)) print "Array Size: " + str(len(tally))

This is the output of the output:

 --- 0.493520975113 seconds to query --- --- 23.1472680569 seconds to calculate results --- --- 0.546246051788 seconds for json dump --- --- 24.1871240139 seconds total --- Array Size: 198124

We spend a lot of time on our calculations, we know that the problem is not so much in the sums and in the groups themselves: this is just the size of the array, which is the problem.

We have heard some good things about numpy , but the nature of our data makes the size of the matrix unknown.

We are looking for any advice on how to approach this. Including a completely different approach to programming.

+5

python numpy cassandra bigdata

joao Feb 04 '16 at 16:08

source share

2 answers

Lucas franceschi · Answer 1 · 2016-05-20T19:25:49+0000

I did a very similar part of the processing, and the processing times also bothered me. I think that you do not take into account something important: the result object that you get from cassandra as the return of the execute() function does not contain all the lines you want. instead, it contains a paginated result and will receive strings as the object passes through the for list. This is based on personal observation, although I do not know more technical details to provide about this.

I suggest you isolate the query and processing the results by adding a simple rslt = list(rslt) immediately after the execute command, which will force python to go through all the lines in the result before processing, and also force the cassandra driver to get all the lines you want before go to processing.

I think you will find that a lot of the processing time that you had was actually a request, but it was masked by the driver using a paginated result.

Mikhail Baksheev · Answer 2 · 2016-02-29T13:50:33+0000

Cassandra 2.2 and later allows users to define aggregate functions. You can use it to colonize columns on the cassandra side. See the DataStax article for custom aggregate data.

Large web datasets in Python - how to work with very large arrays?

More articles: