First of all, I wanted to clarify what I learn about Hive and Hadoop (and big data in general), so sorry about the lack of proper vocabulary.
I am embarking on a huge (at least for me) project that requires dealing with a huge amount of data that I have not used in the past, since I have always worked mainly with MySQL.
For this project, a series of sensors will produce about 125,000,000 data points 5 times per hour (15,000,000,000 per day), which is several times more than everything I have ever inserted into each MySQL table.
I understand that one approach would be to use Hadoop MapReduce and Hive to query and analyze data.
The problem I am facing is that for what I could learn, I understand that Hive works primarily as a βcron jobsβ and not with real-time queries, which can take many hours and require other infrastructure.
I was thinking of creating MySQL tables based on the results of Hive queries, since the data that will be needed for the query in real time will be about 1,000,000,000 rows, but I was wondering if this is the right thing to do or should I learn some other technologies.
Is there any technology I need to learn that is specifically designed for real-time queries for big data?
Any advice would be greatly appreciated!
source share