Efficient and scalable JSON data warehouse with NoSQL databases

Question

Efficient and scalable JSON data warehouse with NoSQL databases

We are working on a project that should collect log and audit data and store it in a data warehouse for archiving and some viewing purposes. We are not quite sure which data file will work for us.

we need to store small JSON documents, about 150 bytes, for example. "audit:{timestamp: '86346512',host':'foo',username:'bar',task:'foo',result:0}" or "journal:{timestamp:'86346512',host':'foo',terminalid:1,type='bar',rc=0}"
we expect about a million records per day, about 150 MB of data
data will be saved and read but never changed
data should be stored in an efficient manner, for example. binary format used by Apache Avro
after deleting the storage time data
user queries such as 'get audit for user and time period' or 'get journal for terminalid and time period'
replicated database for fault tolerance
scalable

We are currently evaluating NoSQL databases such as Hadoop / Hbase, CouchDB, MongoDB, and Cassandra. Are these databases the right data warehouse for us? Which one is best suited? Are there any better options?

+6

json mongodb cassandra couchdb hadoop

Ismail Aug 18 '11 at 15:02

source share

4 answers

Avro supports the evolution of the circuit and is suitable for this problem.

If your system does not require low latency loads, consider getting data to files in a reliable file system rather than loading them directly into a real-time database system. Maintaining a reliable file system (such as HDFS) is simpler and less likely to lead to interruptions than to a real-time database. In addition, separation of duties ensures that your query traffic never affects the data collection system.

If you only have a few requests to run, you can leave the files in your native format and write your own map to generate the reports you need. If you need a higher-level interface, consider running Hive on your own data files. Hive allows you to run arbitrary, friendly SQL queries on your raw data files. Or, since you only have 150 MB / day, you can simply upload it to MySQL with read-only tables.

If for some reason you need the complexity of an interactive system, HBase or Cassandra or it can be a good bait, but be careful that you spend a considerable amount of time on “DBA”, and 150 MB / day is so little data that you probably , do not need complexity.

+4

pawstrong Sep 05 '11 at 18:45

source share

We use Hadoop / HBase, and I looked at Cassandra, and they usually use a row key as a means to get data faster, although of course (at least in HBase), you can still apply its filters by column data or by client side. For example, in HBase, you can say "give me all the lines starting from key1 to, but not including key2."

So, if you developed your keys correctly, you can get everything for 1 user or 1 host or 1 user on 1 host or something like that. But this requires a properly designed key. If most of your queries need to be run with a timestamp, you can include this as part of the key, for example.

How often do you need to request data / write data? If you plan to run your reports, and this is normal, if it takes 10, 15 or more minutes (maybe), but you do a lot of small recordings, then HBase w / Hadoop does MapReduce (or using Hive or Pig as a request for higher-level languages) will work very well.

+2

Drizzt321 Aug 18 '11 at 16:17

source share

If your JSON data has variable fields, then a model without a schema, such as Cassandra, can best suit your needs. I would decompose the data into columns, and then save it in binary format, which would simplify the query. For a given data transfer rate, it will take you 20 years to fill a 1 TB disk, so I won’t worry about compression.

In the above example, you can create two column families: Audit and Journal. The line keys would be TimeUUID (i.e. Timestamp + MAC Address to turn them into unique keys). Then the audit string you gave will have four columns: host:'foo' , username:'bar' , task:'foo' and result:0 . Other rows may have different columns.

Scanning a range by row keys will allow you to efficiently execute queries over time periods (if you use ByteOrderedPartitioner). Then you can use secondary indexes to query users and terminals.

+1

Theodore hong Aug 19 '11 at 12:18

source share

tom.wilkie · Accepted Answer · 2011-08-18T16:10:48+0000

One million investments per day is about 10 inserts per second. Most databases can handle this, and it is significantly lower than the maximum input speed that we get from Cassandra on reasonable hardware (50 thousand inserts / sec)
Your requirement “after deletion of the storage time data can be deleted” is perfect for Cassandra column columns - when you insert data, you can specify how long to save them, then background merging processes will delete this data when they reach this timeout.
"data should be stored in an efficient manner, such as the binary format used by Apache Avro" - Cassandra (like many other NOSQL stores) treats the values as opaque byte sequences, so you can encode the values as you like. You can also consider expanding the value into a series of columns, allowing you to perform more complex queries.
user queries, such as “get an audit for a user and a time period”, in Cassandra, you would model this by indicating that the row key is the user identifier and the column key is the time of the event (most likely timeuuid). Then you would use the get_slice call (or even the best CQL) to satisfy this query
or "get the log for termid and time period" - as indicated above, the row key must be terminal and the column column must be a timestamp. It should be noted that in Cassandra (as in most stores without connection) it is typical to insert data more than once (in different devices) to optimize for different requests.
Cassandra has a very sophisticated replication model where you can specify different levels of consistency for each operation. Cassandra is also a highly scalable system without a single point of failure or bottleneck. This is really the main difference between Cassandra and things like MongoDB or HBase (not that I want to start a flame!)

Having said all this, your requirements can easily be satisfied with a more traditional database and simple master-slave replication, nothing is too burdensome here

Efficient and scalable JSON data warehouse with NoSQL databases

More articles: