MapReduce using SQL Server as a data source

Question

MapReduce using SQL Server as a data source

I am currently exploring the possibility of using MapReduce to support incremental views in SQL Server.

Basically, use MapReduce to create materialized views.

I'm a little stuck. thinking about how to split my map outputs. Now I really don't have a BigData situation, with a maximum size of 50 GB, but I have a lot of difficulties and associated performance issues. I want to see if my MapReduce / NoSQL approach might disappear.

Regarding MapReduce, I am currently having separation problems. Since I use SQL Server as a data source, the locality of the data is not really a problem, and therefore I do not need to send data everywhere, and every worker should be able to retrieve the data section based on the definition of map .

I intend to fully display the data using LINQ, and perhaps something like the Entity Framework, just to provide a familiar interface, it is somewhat more than that, but it is the current route that I am exploring.

Now, how do I share my data? I have a primary key, I have map and reduce definitions in terms of expression trees (AST, if you are not familiar with LINQ).

First, how do I create a way to split the entire input and split the original problem (I think I will need to use window aggregates in SQL Server such as ROW_NUMBER and TILE ).
Secondly, and more importantly, how can I do this gradually? That is, if I add or make changes to the original problem, how can I effectively minimize the number of recalculations that should take place?

I was looking for CouchDB for inspiration, and they seem to have a way to do this, but how can I take advantage of some of these benefits with SQL Server?

+6

sql-server couchdb sql-server-2005 mapreduce

John leidegren Oct 26 '11 at 14:37

source share

1 answer

pavel242 · Answer 1 · 2012-02-12T23:00:57+0000

I come across something similar. I think you should forget about window functions as this makes your process serialized. In other words, all employees will wait for the request.

What we tested and it "works" consists in dividing the data into a larger number of tables (each month has its own x tables) and launches separate analytical flows on such sections. Marking processed / unprocessed / possibly bad / etc data after Reduce.

Tests with a single partitioned table caused as a lock lock problem.

You will certainly add a little more complexity to your current solution.

MapReduce using SQL Server as a data source

More articles: