I need help in choosing the database that we must choose for our project. We are developing a web application that collects data about user behavior and analysis, which (a poor explanation, but I can not provide much more details, web analytics data is one of our main data sets). According to our estimates, we will inject about 200 million rows per week into the database + data calculated from these raw data. Data must be stored for at least six months.
I spent last week and half collected information on various solutions, but it seems so much that I feel lost. The most promising ones that I have found are Kassandra, Khbaza and Hive. I also looked at MongoDb, Redis, and some others, but they looked as if they met different needs, or the community was not so active.
- The entire application will be launched on Amazon EC2. As a start-up pay-as-you-go model suits us like a glove. The simpler the database to manage in the cloud, the better.
- Scalability is important. The amount of data that we will generate varies greatly and will grow over time.
- We cannot pay huge licensing fees. Otherwise, we could use something like http://www.vertica.com/ .
- We need to do all kinds of data analysis, and the simpler they write, the better. I was thinking about using Map / Reduce for a task; Hbase seems to have better support for this than Cassandra, and Hive has its own query language. Real time analysis is not required; we can calculate the results once a day and drag them back to the database for quick retrieval.
- Compression support will be nice, but not necessary (disk space is cheap :).
MySql ( ..), , , - - db , - , .