Now, as I understand it, the general principle of data normalization is to create an RDBMS where data redundancy is minimized.
Ummm, ok.
In my project, one of the DB people created the database. We have 50+ tables and tables in the database are usually very fragmented, i.e. a table has two or three columns and what it is.
The number of tables says nothing about good or bad design. Some businesses require one or two. Others need more. I worked on databases in a Fortune 500 that had thousands of tables.
The number of columns says nothing about good or bad design. And the number of columns has nothing to do with fragmentation. I will say that tables with relatively few columns are usually a good sign. Not always a good sign, but usually a good sign.
Now, when it comes to writing sql queries, it has become a bit of a minor problem, as each query involves combing several tables and joining them together. I was wondering if this is the side effect of data normalization? Or does this indicate something else?
There are two different reasons for this:
When you normalize a table, you reduce redundancy (and increase data integrity) by defining functional dependencies, isolating functionally dependent columns in one or more new tables, and removing them from the original table. Thus, normalizing a table in the sense of moving from a lower normal form to a higher normal form
- always increases the number of tables,
- always reduces the number of columns in the source table and
- sometimes a connection is required to retrieve data for people.
Another common practice is to replace strings with identifier numbers. This has nothing to do with normalization. (There is no such thing as a “normal number number form.”) Replacing strings with identifier numbers
- always increases the number of tables,
- does not change the number of columns in the source table (if at the same time as during normalization),
- a connection is always required to retrieve data for people.
There seems to be confusion in other parts of this topic. I understand that, strictly speaking, none of the following is directly related to the OP issue.
1NF - the principle of "one value". This has nothing to do with the fact that the string is "atomic". In the relational model, atoms are not strings; this refers to values.
"One value" means that each intersection of a row and a column contains one value. (In other words, the meaning is “atomic.” But the word “atom” has some unfortunate connotations, so most modern practitioners avoid it.) This meaning should not be simple; it can be arbitrarily complex. But if it has parts that themselves make sense, dbms either completely ignores these parts, or provides functions to manipulate them. (You do not need to write functions to manage parts.)
I think the simplest example is a date. Dates have parts consisting of year, month, and day. Dbms either ignores these parts (as in SELECT CURRENT_DATE ), or provides functions to control them (as in SELECT EXTRACT(YEAR FROM CURRENT_DATE) ).
Attempts to evade the principle of "one value" lead to the corollary: the principle of "no duplicate groups."
A repeating group includes several values from the same domain, all values having the same value. Thus, a table like the following is an example of one type of repeating group. (There are other types.) The values for "phone_1" and "phone_2" come from the same domain, and they have the same value. User "n" has phone numbers (phone_1 and phone_2). (The primary key is "user_id".)
user_id phone_1 phone_2 1 (111) 222-3333 (111) 222-3334 2 (111) 222-3335 (111) 222-3336
But the following table, although very similar, does not have a repeating group. Values come from the same domain, but they do not have the same value. (The primary key is "user_id".)
user_id home_phone work_phone 3 (111) 222-3333 (111) 222-3334 4 (111) 222-3335 (111) 222-3336
2NF is the "all key" principle. It has nothing to do with the number of keys; a table with columns "n" may have "n" keys. (See, for example, this other SO answer .) In the relational model (and, in addition, when you perform normalization exercises), if you see the word key by itself, think of “candidate key”.
Instead, 2NF refers to candidate keys that have multiple columns. When a candidate key has multiple columns, 2NF requires that each non-prime attribute is functionally dependent on all columns of each candidate key, and not just on some columns of any candidate key. (The non-prime attribute is an attribute that is not part of any candidate key.)
The following example is adapted from a Wikipedia entry on 2nf . (The main key is {employee, skill}.)
Table: employee_skills employee skill current_work_location -- Jones Typing 114 Main Street Jones Shorthand 114 Main Street Jones Whittling 114 Main Street Bravo Light Cleaning 73 Industrial Way Ellis Alchemy 73 Industrial Way Ellis Flying 73 Industrial Way Harrison Light Cleaning 73 Industrial Way
Although it is true that the non-prime current_work_location columns are functionally dependent on the primary key {employee, skill}, it is also functionally dependent on only part of the employee primary key. This table is not in 2NF.
You cannot avoid the 2NF problem by assigning a surrogate key to each row. (The primary key is es_id, there is a UNIQUE constraint for the first primary key, {employee, skill}).
Table: employee_skills es_id employee skill current_work_location -- 1 Jones Typing 114 Main Street 2 Jones Shorthand 114 Main Street 3 Jones Whittling 114 Main Street 4 Bravo Light Cleaning 73 Industrial Way 5 Ellis Alchemy 73 Industrial Way 6 Ellis Flying 73 Industrial Way 7 Harrison Light Cleaning 73 Industrial Way
It should be obvious that adding the id number did nothing to remove the partial dependency employee->current_work_location . Without removing the partial dependency, this table is still not in 2NF.
3NF - the principle of the absence of transitive dependencies. This is not necessarily related to derivatives or calculated data, as you can tell from the Wikipedia example adapted here. (Primary key {tournament, year}. This table is not in 3NF.)
Table: tournament_winners tournament year winner winner_date_of_birth -- Indiana Invitational 1998 Al Fredrickson 21 July 1975 Cleveland Open 1999 Bob Albertson 28 September 1968 Des Moines Masters 1999 Al Fredrickson 21 July 1975 Indiana Invitational 1999 Chip Masterson 14 March 1977
Two dependencies show that this table has a transitive dependency.
- The values in winner_date_of_birth turn out to be functionally dependent on the primary key. Each primary key value defines one and only one value for winner_date_of_birth. But.,.
- The values in winner_date_of_birth also prove to be functionally dependent on the winner. Each value for the winner defines one and only one value for winner_date_of_birth.
Given these two obvious functional dependencies and an understanding of what tournament, winner and date of birth mean, we can say that
- winner → winner_date_of_birth is a functional dependency and
- {tournament, year} → the winner is a functional addiction, and
- {tournament, year} → winner_date_of_birth - transitive dependency.