I have a MySQL database with several (five accurate) huge tables. It is essentially a data warehouse based on a star-based topology. Table sizes range from 700 GB (fact table) to 1 GB, and the entire database reaches 1 terabyte. Now I am entrusted with the work on analytics in these tables, which may even include associations. A simple analytical query in this database can be “find the number of smokers for each condition and display it in descending order”, this requirement can be converted into a simple query like
select state, count(smokingStatus) as smokers from abc having smokingstatus='current smoker' group by state....
This query (and many others of the same nature) requires a lot of time to execute in this database, the time spent in the order of tens of hours.
This database is also heavily used for insertion, which means that thousands of rows are added every few minutes.
In such a scenario, how can I solve this problem? I looked in Cassandra, which seemed to be easy to implement, but I'm not sure if it will be so easy to run analytic queries in a database, especially when I need to use the where clause and group by construct clause.
Also looked at Hadoop, but I'm not sure how to implement queries like RDBMS. I'm not too sure if I want to immediately invest in getting at least three machines for the names node, zookeeper and data-nodes !! First of all, our company prefers Windows-based solutions.
I also thought about precalculating all the data in simpler pivot tables, but this limits my ability to run various queries.
Are there any other ideas that I can implement?
EDIT
The following is the setup of the mysql environment
1) master-slave setup 2) master for inserts / updates 3) slave for reading and running stored procedures 4) all tables are innodb with files on table 5) indexes in a row, as well as in int columns.
Preliminary calculated values are an option, but since the requirements for these types of aggregated values continue to change.