Running analytics in a huge MySQL database

Question

Running analytics in a huge MySQL database

I have a MySQL database with several (five accurate) huge tables. It is essentially a data warehouse based on a star-based topology. Table sizes range from 700 GB (fact table) to 1 GB, and the entire database reaches 1 terabyte. Now I am entrusted with the work on analytics in these tables, which may even include associations. A simple analytical query in this database can be “find the number of smokers for each condition and display it in descending order”, this requirement can be converted into a simple query like

select state, count(smokingStatus) as smokers from abc having smokingstatus='current smoker' group by state....

This query (and many others of the same nature) requires a lot of time to execute in this database, the time spent in the order of tens of hours.

This database is also heavily used for insertion, which means that thousands of rows are added every few minutes.

In such a scenario, how can I solve this problem? I looked in Cassandra, which seemed to be easy to implement, but I'm not sure if it will be so easy to run analytic queries in a database, especially when I need to use the where clause and group by construct clause.

Also looked at Hadoop, but I'm not sure how to implement queries like RDBMS. I'm not too sure if I want to immediately invest in getting at least three machines for the names node, zookeeper and data-nodes !! First of all, our company prefers Windows-based solutions.

I also thought about precalculating all the data in simpler pivot tables, but this limits my ability to run various queries.

Are there any other ideas that I can implement?

EDIT

The following is the setup of the mysql environment

1) master-slave setup 2) master for inserts / updates 3) slave for reading and running stored procedures 4) all tables are innodb with files on table 5) indexes in a row, as well as in int columns.

Preliminary calculated values are an option, but since the requirements for these types of aggregated values continue to change.

+4

mysql cassandra hadoop analytics

Sap Mar 19 '12 at 20:18

source share

3 answers

1 TB is not that big. MySQL should be able to handle this. At the very least, such simple queries shouldn't take hours! It may not be very useful without knowing more context, but I can offer some questions that you can ask yourself, mainly related to the use of data:

Is there a way to separate reads and writes? How much do you read, what do you do per day, and how many letters? Can you live with some delay, for example, write a new table every day and combine it with an existing table at the end of the day?
What are most of your queries? Are they mostly aggregated queries? Can you do partial aggregation in advance? Can you pre-calculate the number of new smokers every day?
Can you use hadoop for the aggregation process above? Hadoop is very good at this. Basically use hadoop only for daily or batch processing and save the results in the database.
On the DB side, are you using InnoDB or MyISAM? Are indexes String columns? Can you make it int, etc.?

Hope that helps

+1

Hari menon Mar 20 '12 at 1:00

source share

MySQL has a serious limitation that prevents it from performing good actions in such scenarios. The problem is the lack of the possibility of parralel query - it cannot use multiple processors in one query.
Hadoop has an RDMBS add-on called Hive. This is an application that can translate your queries in Hive QL (sql engine) into MapReduce jobs. Since this is actually a small addition to the top of Hadoop, it inherits its linear scalability.
I would suggest deploying the bush with MySQL, replicating the daily data there and starting the heavy aggregates again. It will unload a serious portion of the load for MySQL. You still need this for short interactive queries, usually backed by indexes. You need them, because Hive iehently not-interactive - each request will take at least several tens of seconds.
Kassandra is built for a key-value access type and does not have scalable GroupBy built-in capabilities. There is a DataStax Brisk that integrates Cassandra with Hive / MapReduce, but it may not be trivial to display your schema in Cassandra, and you still don't get the flexibility and indexing capabilities of RDBMS.

As a bottom line - Hive along with MySQL should be a good solution.

0

David gruzman Mar 19 '12 at 10:03

source share

Jamie · Accepted Answer · 2012-03-27T00:31:48+0000

Looking at this from the perspective of trying to get MySQL to work better than creating a completely new architectural system:

First, check what really happens. EXPLAIN queries that cause problems, rather than guessing what is happening.

Having said that, I'm going to guess what is happening, since I have no query plans. I assume that (a) your indexes are not used correctly, and you get a bunch of preventable table scans, (b) your database servers are configured for OLTP, not analytical queries, (c) writing data while reading makes things slow down a lot, (d) working with strings is just crap and (e) you have some inefficient queries with terrible joins (all have some of them).

To improve the situation, I would investigate the following (approximately in this order):

Check query plans, make sure that existing indexes are used correctly - look at the table scan, make sure that the queries really make sense.
Move analytic queries from the OLTP system — the settings required for quick inserts and short queries are very different from the settings for those queries that potentially read most of the large table. This could mean having another analytic slave with a different configuration (and possibly table types - I'm not sure what the current state is with MySQL now).
Extract rows from the fact table, instead of having a smoking status column with the string values (say) "current smoker", "quit recently", "quit 1+ years", "never smoke", display these values in another table and have integer keys in the fact table (this will also help the size of the indexes).
Stop updating tables during query execution - if indexes move during query execution, I do not see good things. This (fortunately) has been a long time since I was worried about MySQL replication, so I can’t remember if you can upload records to the analytic query without too much drama.
If you get to this point without solving the performance problem, then it's time to think about switching from MySQL. First I would look at Infobright - it is open source / $$ and is based on MySQL, so it is probably easiest to place it on your existing system (make sure that the data goes to the InfoBright database, and then specify your analytical queries in the Infobright server, stay in the rest of the system as it is, do the work), or if Vertica ever releases its version of Community Edition. Hadoop + Hive has many moving parts - this is pretty cool (and great for resumes), but if it is used only for the analytic part of your system, it can be more careful and nourished than other options.

Running analytics in a huge MySQL database

EDIT

More articles: