Database normalization and quick search

I am working on a technical architecture for integrating content solutions. The solution provider data runs up to millions of lines and normalizes to 3NF. It is updated on a regular schedule (most likely daily), and its data is divided into a very narrow level of atomicity.

I need to search and query for this data, and my current inclination is to leave only normalized data and create a denormalized database from my data (OLAP for OLTP). A “transfer” can be a purpose-built program that can contain the necessary business logic in addition to the raw copying power and run according to a set schedule as needed. Then, a denormalized database will reduce atomicity and allow you to effectively perform keyword searches and queries. I considered using Lucene.NET to run a keyword on a denormalized database.

Therefore, before I sing loudly from the hills that this is the way forward, I wanted to get some kind of expert opinion on this matter and what is the perceived "best practice". Is the method that I suggested the best considering the data that will be provided to me? It has been suggested that perhaps I could use a “search engine” to search for normalized data. It scared me, but raised a question; what search engine and how?

Opinions, flames, bad language and help appreciated :)

+4
source share
2 answers

I created reporting databases and data warehouses based on data stored in a normalized form. There is quite a lot of work in the transfer program (ETL). Given your description of the data feed, it is possible that some of these works were performed by you by the feeder.

Millions of lines are not many these days. You can leave with report-oriented views in an existing database. Try and see.

The biggest advantage of creating an OLAP-centric database is not speed. This is flexibility. "We like this report, but now we want to see it weekly and quarterly, not monthly. Bam! Done!" "Can you break it down by marketing category, not production? Bam! Done!" And so on.

+2
source

The well-resolved model (3NF / BCNF) provides the best average performance and the least number of modification anomalies for the largest number of scenarios. It's big, so I'll start from there. Since your requirements are fuzzy, this seems like the most reasonable option.

Actually, the most reasonable would be to revise the requirements until they become more "clear";)

In addition, if you can get some early extracts from your data provider, you can experiment with it and get an idea of ​​the distribution of data (not all people live in one country, and in some countries there are more people than others Not all people have children, and the number of children per person is very different depending on the country). This is an important point, and it is very important that the optimizer can make the right decisions.

Other than that, I agree with everything that Walter said, and also gave him my vote.

0
source

Source: https://habr.com/ru/post/1334195/


All Articles