Using Hive for real-time queries

First of all, I wanted to clarify what I learn about Hive and Hadoop (and big data in general), so sorry about the lack of proper vocabulary.

I am embarking on a huge (at least for me) project that requires dealing with a huge amount of data that I have not used in the past, since I have always worked mainly with MySQL.

For this project, a series of sensors will produce about 125,000,000 data points 5 times per hour (15,000,000,000 per day), which is several times more than everything I have ever inserted into each MySQL table.

I understand that one approach would be to use Hadoop MapReduce and Hive to query and analyze data.

The problem I am facing is that for what I could learn, I understand that Hive works primarily as a β€œcron jobs” and not with real-time queries, which can take many hours and require other infrastructure.

I was thinking of creating MySQL tables based on the results of Hive queries, since the data that will be needed for the query in real time will be about 1,000,000,000 rows, but I was wondering if this is the right thing to do or should I learn some other technologies.

Is there any technology I need to learn that is specifically designed for real-time queries for big data?

Any advice would be greatly appreciated!

+4
source share
1 answer

This is a difficult question. Let's start by looking at the technologies you mentioned in your question, and from there:

  • MySQL: this should be obvious to anyone who has used MySQL (or any other relational database) that the traditional ready-made MySQL version will not support the volume you are talking about. Calculating the back of the envelope is enough to tell us that assuming your sensor inserts are only 100 bytes, you are talking about 15 billion x 100 bytes = 1.5 trillion bytes or 1.396 terabytes per day. This is really big data, especially if you plan to store it for more than one or two days.

  • Hive: Hive, of course, can process such a volume of data (I and many others did it), but, as you note, you do not receive requests in real time. Each request will be in batch mode, and if you need fast requests, you need to first aggregate the data.

Now this brings us to the real question - what queries do you need to run? If you need to run arbitrary queries in real time and can never predict what those queries may be, you will probably have to look for relatively expensive proprietary data stores such as Vertica, Greenplum, Microsoft PDW, etc. It will cost a lot of money, but they and others can cope with the load that you are talking about.

If, on the other hand, you can, with some degree of accuracy, predict the type of queries that will be executed, then something like Hive might make sense. Store raw data there and use the power of batch queries to do the hard work and periodically create aggregated data tables in MySQL or another relational database to support your low latency query needs.

Another alternative is something like HBase. HBase gives you access to distributed data with low latency, but you lose two critical elements that you are probably used to - the query language (HBase does not have SQL) and the ability to aggregate data. To perform aggregation in HBase, you need to run the MapReduce job, although this job can then go back and save it back to HBase for low latency access.

+6
source

Source: https://habr.com/ru/post/1437496/


All Articles