How reliable is ElasticSearch as the primary data warehouse against factors such as loss of record, data availability

I am working on a project with the requirement to approach a common dashboard, where users can perform various types of grouping, filtering and turning in different fields. To do this, we are looking for a search repository that allows you to make pieces and cubes of data.

There would be many data sources and would store them in the Search repository. A preliminary calculation of the source data may be required, which may be performed by intermediate components.

I looked through several blogs to see if ES can be reliably used as the primary data store. It mainly depends on what precedent we are looking for. Some information about the case used:

  • About 300 million records each year with 1-2 KB.
  • Assuming to store data for 1 year, today we have 300 GB, but the precedent may increase to 400-500 GB with data growth.
  • At the moment, I’m not sure how we will promote the data, but, roughly speaking, this can reach 2-3 million records in 5 minutes.
  • The search query is low, but requires complex queries that can search data from the last 6 weeks to 6 months.
  • the document will be indexed in almost all fields of the document.

Some blogs claim to be reliable enough to be used as a primary data warehouse -

And some blogs say ES has few restrictions -

Has anyone used Elastic Search as the only data truth without primary storage like PostgreSQL, DynamoDB or RDS? I looked that ES has certain problems, such as a split brain and index corruption, where there may be a problem with data loss. So, I am looking to find out if someone used ES and what data problems

Thank.

+47
search-engine nosql full-text-search elasticsearch
Apr 24 '15 at 7:32
source share
2 answers

Short answer: it depends on your use case, but you probably do not want to use it as the main repository.

Longer answer: you must really understand all the possible problems that may arise as a result of fault tolerance and data loss. Elastic has some excellent documentation on these issues that you should really understand before using it as your primary data warehouse. Also, Afir's Related Post is a good resource.

If you understand the risks that you take and think these risks are acceptable (for example, because a little data loss is not a problem for your application), then you should be prepared to go ahead and try.

+29
Jul 13 '15 at 8:47
source share

It is generally recommended that you create redundant storage solutions. For example, this could be a quick and reliable approach, just to first push everything like flat data onto a static storage like s3, then there is ES pull and index from data. If you need more flexibility when using ORM, you may have an RDS or Redshift layer in between. Thus, data can always be restored to ES.

It depends on your needs and requirements, how you strike a balance between redundancy and flexibility / performance. If there is a lot of data, you can store the source data statically and simply index some parts of it with ES.

Amazon Lambda offers great features:

Many developers store objects in Amazon S3 when using Amazon DynamoDB to store and index object metadata and provide high-speed searches. AWS Lambda simplifies synchronization by running the Amazon DynamoDB auto index update feature, each temporary object added or updated from Amazon S3.

+4
Apr 24 '15 at 7:57
source share



All Articles