Should you duplicate data?

Question

Should you duplicate data?

This is a very general question about database design for products that have multiple running campaigns that can share datasets. I am trying to understand the concepts of why I should and should not do something like that.

I thought about having a raw dataset, and then copied part of that set to the campaign so that the campaign always had historical data. For example, even if raw data is updated, campaign data will not change. However, the problem is that there is so much duplication, and I'm not sure if this is a very good design. Any insight was appreciated.

+5

database-design database-schema

Strawberry Apr 01 '17 at 5:30

source share

3 answers

The answer to this question depends on the priorities of the project.

If the ability to view historical data is an important requirement, then such duplication will be necessary. There will be a subset of the tables that should be "versioned". For example, you might have a product_version table with date_from and date_to columns indicating when it / was effective. Or you can go ahead and put the version information in the abstract_version table, which contains version information (for example, period and status), referencing all versions of the table using foreign keys. Whenever a new version is created, you will initially need to take a copy of the old data, and then allow it to be changed.

But such an approach will inevitably come at the expense of increased complexity. For some anecdotal evidence, the project I'm working on is a big project that has changed significantly from the original budget - not least because of the difficulties associated with preserving historical data.

+1

Steve chambers Apr 11 '17 at 14:10

source share

When you see data repeating, you are referring to data in a row with another row in the same table. If there is a practical situation that we can have two rows with the same value, this cannot be considered as duplicate data. At least the timestamp of this line will be different. In the worst case, if we think that two lines are entered in one mSec, and the timestamp can be the same, then the person entered must be different. In a nutshell, if there are two lines with the same values, it is practically possible and functional, it should be correct, because there will be some hidden values that can make them different, such as campaign number, participant, time stamp, etc.

It is necessary to take into account the strategy of data archiving and the value of the stored data (if there is business value / need for management). If there is no such business value as the use of DWH, Mining, etc., it is recommended to have an archived DB so that OLTP effectively uses DB.

For you, if historical campaign data adds value to end users (e.g., showing in charts) or management (to show any response trend / explain repetitive behavior in the campaign), this is useful. Otherwise, I find no reason to store in one table.

+1

Barani Apr 11 '17 at 18:02

source share

TK Bruin · Accepted Answer · 2017-04-04T17:17:40+0000

Actually, this is a big question. The design of the database for transactional or OLTP systems is aimed at eliminating the storage of the same information in several places.

Thus, saving historical values does not violate data redundancy. In fact, you are storing a value other than normal transactional data.

For example, let's say you have a sales area associated with a specific customer in the customer table. When you rent a sale, you may want to save the region in the header table of the sales order. This is not necessarily duplication of data, but a good design in the event of a change in sales regions. In this case, you may want to capture a region that applies to the order at the time of order.

Tomorrow's customer region may change. And you can create reports based on the historically correct region.

Should you duplicate data?

More articles: