Normalization and data recording

Question

Normalization and data recording

I am the youngest. developer (5 months to work), and I have a question about data normalization. Now, as I understand it, the general principle of data normalization is the creation of an RDBMS where data redundancy is minimized. In my project, one of the database users created the database. We have 50+ tables, and the tables in the database are usually very fragmented, i.e. a table has two or three columns and what it is. Now, when it comes to writing sql queries, this has become a bit of a minor hassle, as each query involves combing several tables and combining them with each other. I was wondering if this is a side effect of data normalization? Or does it indicate something else?

I know that for me the simplest task would be to write tables based on the queries I need to write. This will create a database with a lot of redundant data, but I was curious if there is a happy environment?

As in the postscript, I do not want to meet as if I were whining about my work, but I'm really interested to know more about it. My work environment is not the friendliest, so I don’t feel comfortable asking this question to my colleagues. However, I would appreciate any thoughts, books, textbooks or opinions of more experienced people.

Thanks.

+6

design sql database database-normalization

jrdeveloper Jun 22 '11 at 0:40

source share

7 answers

Now, as I understand it, the general principle of data normalization is to create an RDBMS where data redundancy is minimized.

Ummm, ok.

In my project, one of the DB people created the database. We have 50+ tables and tables in the database are usually very fragmented, i.e. a table has two or three columns and what it is.

The number of tables says nothing about good or bad design. Some businesses require one or two. Others need more. I worked on databases in a Fortune 500 that had thousands of tables.

The number of columns says nothing about good or bad design. And the number of columns has nothing to do with fragmentation. I will say that tables with relatively few columns are usually a good sign. Not always a good sign, but usually a good sign.

Now, when it comes to writing sql queries, it has become a bit of a minor problem, as each query involves combing several tables and joining them together. I was wondering if this is the side effect of data normalization? Or does this indicate something else?

There are two different reasons for this:

When you normalize a table, you reduce redundancy (and increase data integrity) by defining functional dependencies, isolating functionally dependent columns in one or more new tables, and removing them from the original table. Thus, normalizing a table in the sense of moving from a lower normal form to a higher normal form

always increases the number of tables,
always reduces the number of columns in the source table and
sometimes a connection is required to retrieve data for people.

Another common practice is to replace strings with identifier numbers. This has nothing to do with normalization. (There is no such thing as a “normal number number form.”) Replacing strings with identifier numbers

always increases the number of tables,
does not change the number of columns in the source table (if at the same time as during normalization),
a connection is always required to retrieve data for people.

There seems to be confusion in other parts of this topic. I understand that, strictly speaking, none of the following is directly related to the OP issue.

1NF - the principle of "one value". This has nothing to do with the fact that the string is "atomic". In the relational model, atoms are not strings; this refers to values.

"One value" means that each intersection of a row and a column contains one value. (In other words, the meaning is “atomic.” But the word “atom” has some unfortunate connotations, so most modern practitioners avoid it.) This meaning should not be simple; it can be arbitrarily complex. But if it has parts that themselves make sense, dbms either completely ignores these parts, or provides functions to manipulate them. (You do not need to write functions to manage parts.)

I think the simplest example is a date. Dates have parts consisting of year, month, and day. Dbms either ignores these parts (as in SELECT CURRENT_DATE ), or provides functions to control them (as in SELECT EXTRACT(YEAR FROM CURRENT_DATE) ).

Attempts to evade the principle of "one value" lead to the corollary: the principle of "no duplicate groups."

A repeating group includes several values from the same domain, all values having the same value. Thus, a table like the following is an example of one type of repeating group. (There are other types.) The values for "phone_1" and "phone_2" come from the same domain, and they have the same value. User "n" has phone numbers (phone_1 and phone_2). (The primary key is "user_id".)

 user_id phone_1 phone_2 1 (111) 222-3333 (111) 222-3334 2 (111) 222-3335 (111) 222-3336

But the following table, although very similar, does not have a repeating group. Values come from the same domain, but they do not have the same value. (The primary key is "user_id".)

 user_id home_phone work_phone 3 (111) 222-3333 (111) 222-3334 4 (111) 222-3335 (111) 222-3336

2NF is the "all key" principle. It has nothing to do with the number of keys; a table with columns "n" may have "n" keys. (See, for example, this other SO answer .) In the relational model (and, in addition, when you perform normalization exercises), if you see the word key by itself, think of “candidate key”.

Instead, 2NF refers to candidate keys that have multiple columns. When a candidate key has multiple columns, 2NF requires that each non-prime attribute is functionally dependent on all columns of each candidate key, and not just on some columns of any candidate key. (The non-prime attribute is an attribute that is not part of any candidate key.)

The following example is adapted from a Wikipedia entry on 2nf . (The main key is {employee, skill}.)

 Table: employee_skills employee skill current_work_location -- Jones Typing 114 Main Street Jones Shorthand 114 Main Street Jones Whittling 114 Main Street Bravo Light Cleaning 73 Industrial Way Ellis Alchemy 73 Industrial Way Ellis Flying 73 Industrial Way Harrison Light Cleaning 73 Industrial Way

Although it is true that the non-prime current_work_location columns are functionally dependent on the primary key {employee, skill}, it is also functionally dependent on only part of the employee primary key. This table is not in 2NF.

You cannot avoid the 2NF problem by assigning a surrogate key to each row. (The primary key is es_id, there is a UNIQUE constraint for the first primary key, {employee, skill}).

 Table: employee_skills es_id employee skill current_work_location -- 1 Jones Typing 114 Main Street 2 Jones Shorthand 114 Main Street 3 Jones Whittling 114 Main Street 4 Bravo Light Cleaning 73 Industrial Way 5 Ellis Alchemy 73 Industrial Way 6 Ellis Flying 73 Industrial Way 7 Harrison Light Cleaning 73 Industrial Way

It should be obvious that adding the id number did nothing to remove the partial dependency employee->current_work_location . Without removing the partial dependency, this table is still not in 2NF.

3NF - the principle of the absence of transitive dependencies. This is not necessarily related to derivatives or calculated data, as you can tell from the Wikipedia example adapted here. (Primary key {tournament, year}. This table is not in 3NF.)

 Table: tournament_winners tournament year winner winner_date_of_birth -- Indiana Invitational 1998 Al Fredrickson 21 July 1975 Cleveland Open 1999 Bob Albertson 28 September 1968 Des Moines Masters 1999 Al Fredrickson 21 July 1975 Indiana Invitational 1999 Chip Masterson 14 March 1977

Two dependencies show that this table has a transitive dependency.

The values in winner_date_of_birth turn out to be functionally dependent on the primary key. Each primary key value defines one and only one value for winner_date_of_birth. But.,.
The values in winner_date_of_birth also prove to be functionally dependent on the winner. Each value for the winner defines one and only one value for winner_date_of_birth.

Given these two obvious functional dependencies and an understanding of what tournament, winner and date of birth mean, we can say that

winner → winner_date_of_birth is a functional dependency and
{tournament, year} → the winner is a functional addiction, and
{tournament, year} → winner_date_of_birth - transitive dependency.

+3

Mike Sherrill 'Cat Recall' Jun 24 '11 at 23:15

source share

Database views are a critical tool in this dilemma. This great introduction says:

Here is the good news: you don’t need to work with normalized tables! ... It’s very simple (at least for database administrators) to create an abstraction layer of connected views on top of normalized data tables by completely placing the underlying tables behind the scenes and out of sight.

+2

krubo Jun 22 '11 at 3:09

source share

It sounds like data normalization, but I will need to learn more about the schema, business case, etc., to make this call reliable. If you have control over the database, you can write a view , which is a general query that links tables. To improve performance, you can create an indexed or materialized view (the name depends on the database platform, in this case Oracle vs. Sql Server).

Almost any database primer will help you along with these concepts. If you use Sql Server and you are really interested in learning more, SQL Server Books Online is a great resource.

+1

Bobby d Jun 22 '11 at 0:50

source share

The presence of a large number of tables, of course, is a sign of a well-standardized database design. This can be a pain when writing queries, but it is much better than getting data from synchronization.

Sometimes I write reports that run a database with thousands of tables. Every night we have a program that launches and uploads data from production tables to the data warehouse so that we can more easily report it. Data warehouse tables are much less normalized, and this greatly simplifies query writing. You can think about it if that makes sense in your situation.

0

jncraton Jun 22 '11 at 0:48

source share

Without seeing the data, it is difficult to say whether your data is excessively normalized (or simply not normalized correctly - distributing fields across several tables does not mean that it is normalized). Generally speaking, you will probably have to join multiple tables to see useful data in a well-normalized database.

You can create views that join tables together, then you can request a view. This will probably help with data selection.

0

jlnorsworthy Jun 22 '11 at 0:51

source share

In a well-designed database, the connections you need in your queries should be fairly easy to copy. The downside is that you have verbose SQL. High sides are huge: -

Constantly update tables.
Quick change to suit your business needs. Well-designed databases can usually handle queries that were not even thought of during the initial design.
Quick placement of new objects. Its relatively easy to add new data objects and attributes to a well-designed database. It could be a nightmare, including seemingly simple changes to a de-normalized database.

0

James anderson Jun 22 '11 at 3:40

source share

S. Lott · Accepted Answer · 2011-06-22T00:52:23+0000

The general principle of data normalization is the creation of an RDBMS where data redundancy is minimized.

Partly true.

Normalization does not concern redundancy.

This is about updating anomalies.

1NF are the "do not use arrays" rules. A 1NF gap means that the string is not atomic, but the collection and independent updates in the collection will not work well. There would be blockage and slowness.

2NF is a one-key rule. Each line has exactly one key, and everything in the line depends on the key. There are no dependencies on part of the key. Some people like to talk about candidate keys and natural keys and foreign keys; they can exist or cannot. 2NF is executed when all attributes are dependent on one key. If the key is a one-column surrogate key, this normal form is trivial.

If 2NF is broken, you have columns that depend on part of the key, but not on the whole key. If you have a table with (part number, revision number) as the key and color and weight attributes, where the weight depends on the entire key, but the color depends only on the part number. You have a 2NF problem where you can update some colors of a part, but not others, creating data anomalies.

3NF is a key-only rule. If you put the derived data in a row and change the result, it will not match the source columns. If you change the source column without updating the derived value, you will also have a problem. Yes, triggers are a bad hamcaund to break the 3NF design. It is not important. The point is only to determine 3NF and show that it prevents the upgrade problem.

Each query includes combing several tables and combining them with each other. I was wondering if this is a side effect of data normalization?

It.

Normalization and data recording

More articles: