Normalization in plain English

I understand the concept of normalizing a database, but I always have difficulty explaining it in plain English - especially for an interview. I read the wikipedia post, but it's still hard to explain the concept to non-developers. “Create a database so you don’t get duplicate data” is the first thing that comes to mind.

Does anyone have a good way to explain the concept of normalizing a database in plain English? And what good examples show the differences between the first, second, and third normal forms?

Say that you will go for an interview for an interview, and the person asks: Explain the concept of normalization and how it will work on creating a normalized database.

What are the key points interviewers are looking for?

+35
sql database terminology normalization database-normalization
Feb 25 '10 at 5:26
source share
11 answers

Well, if I had to explain this to my wife, it would be something like this:

The basic idea is to avoid duplication of big data.

Let me take a look at the list of people and the country from which they came. Instead of holding the name of the country, which can be as long as “Bosnia and Herzegovina” for each person, we simply keep a number that refers to the country table. Therefore, instead of holding 100 Bosnia and Herzegovina, we hold 100 # 45. Now, in the future, as is often the case with the Balkan countries, they will be divided into two countries: Bosnia and Herzegovina, I will have to change it only in one place . well, sort of.

Now, to explain 2NF, I would modify this example and assume that we have a list of countries in which everyone visited each. Instead of holding the table as:

Person CountryVisited AnotherInformation DOB Faruz USA Blah Blah 1/1/2000 Faruz Canada Blah Blah 1/1/2000 

I would create three tables, one table with a list of countries, one table with a list of people and another table to connect them both. This gives me a lot of freedom, I can get information about a change of face or information about the country. This allows me to "delete duplicate rows" as normalization expects.

+24
Feb 25 '10 at 5:40
source share

The one-to-many relationship should be represented as two separate tables linked by a foreign key. If you try to translate the logical one-to-many relationship into one table, then you are breaking normalization, which leads to dangerous problems.

Say that you have a database of your friends and their cats. Since a person can have more than one cat, we have a one-to-many relationship between humans and cats. This requires two tables:

 Friends Id | Name | Address ------------------------- 1 | John | The Road 1 2 | Bob | The Belltower Cats Id | Name | OwnerId --------------------- 1 | Kitty | 1 2 | Edgar | 2 3 | Howard | 2 

( Cats.OwnerId is a foreign key for Friends.Id )

The above construction is fully normalized and meets all known levels of normalization.

But I will say that I tried to present the above information in one table as follows:

 Friends and cats Id | Name | Address | CatName ----------------------------------- 1 | John | The Road 1 | Kitty 2 | Bob | The Belltower | Edgar 3 | Bob | The Belltower | Howard 

(This is the design that I could have done if I had been used for Excel worksheets, but not for relational databases.) The single-table approach makes me repeat some information if I want the data to be consistent. The problem with this design is that some facts, such as the information that Bob addresses to the Bell Tower, are repeated twice, which is redundant and makes it difficult to query and modify data and (worst of all) to introduce logical inconsistencies.

Eg. if Bob is moving, I have to make sure that I change the address on both lines. If Bob gets another cat, I must definitely repeat the name and address in exactly the same way as on the other two lines. For example. if I make a typo in Bob's address in one of the lines, then suddenly there will be inconsistent information in the database about where Bob lives. A non-standardized database cannot prevent the introduction of inconsistent and inconsistent data, and therefore the database is not reliable. This is clearly unacceptable.

Normalization cannot prevent you from entering incorrect data. What prevents normalization is the possibility of inconsistent data.

It is important to note that normalization depends on business decisions. If you have a customer database and you decide to record only one address for each customer, then the table design (#CustomerID, CustomerName, CustomerAddress) will be fine. If, however, you decide that you allow each client to register more than one address, then the same table design will not normalize, because now you have a one-to-many relationship between the client and the address. Therefore, you cannot just look at the database to determine if it is normalized; you need to understand the business model behind the database.

+16
Jul 25 '10 at 14:34
source share

Here is what I ask interviewees:

Why don't we use a single table for the application instead of using multiple tables?

The answer is normalization. As already mentioned, redundancy should be avoided there too, thanks to update anomalies.

+9
Feb 25 '10 at 5:57
source share

This is not a complete explanation, but one of the goals of normalization is to ensure growth without embarrassment.

For example, if you have a user table, and each user will have one and only one phone number, it is fine to have a phonenumber column in that table.

However, if each user has a variable number of phone numbers, it would be inconvenient to have columns such as phonenumber1 , phonenumber2 , etc. There are two reasons for this:

  • If your columns go up to phonenumber3 and someone needs to add a fourth number, you need to add the column to the table.
  • For all users with less than 3 phone numbers, there are empty columns in their rows.

Instead, you want to have a phonenumber table, where each row contains a phone number and a link to a foreign key, which row in the user table it belongs to. No empty columns are needed, and each user can have as many or more phone numbers as possible.

+6
Feb 26 2018-10-29T00
source share

One side points to normalization: a fully normalized database is spatial efficiency, but is not necessarily the most efficient data processing time depending on usage patterns.

Moving to several tables to search for all pieces of information from their denormalized places takes time. In high-load situations (millions of rows per second flying around, thousands of simultaneous clients, for example, processing credit card transactions), where time is more valuable than storage space, accordingly denormalized tables can give better response times than fully normalized tables .

For more information about this, find the SQL books written by Ken Henderson.

+6
Apr 23 '10 at 17:16
source share

I would say that normalization is like recordings are seen efficiently, so to speak:

If you had a note stating that you need to go shopping for ice cream without normalization, you should have another remark saying that you need to go shopping for ice cream, only one in each pocket.

Now, in real life, you will never do it this way, why do it in the database?

For the development and implementation of the part, that is, when you can return to the "jargon" and save it from non-specialists, but I suppose you could simplify it. You would say what you need first, and then when normalization enters into it, you say that you will be convinced of the following:

  • The table should not have duplicate groups of information.
  • No table should contain data that does not depend on the main table file
  • For 3NF, I like the way Bill Kent takes over: each non-key attribute should provide information about the key, the entire key and nothing but the key.

I think this can be more impressive if you also talk about denormalization and that you may not always have a better structure and be in normal forms.

+5
Feb 25 '10 at 5:53
source share

Normalization is a set of rules that are used to create tables linked through relationships.

This helps to avoid repeated entries, reducing the required storage space, preventing the need to restructure existing tables to accommodate new data, increasing the speed of queries.

First normal form: data should be broken down into the smallest units. Tables should not contain duplicate groups of columns. Each row is identified by one or more primary keys. For example, in the table "User" there is a column with the name "First Name", it should be divided into "First Name" and "Last Name". In addition, Custom must have a column named CustiomID to identify a specific custom style.

The second normal form: each non-classical column must be directly associated with the entire primary key. For example, if there is a column named “City” in the “User” table, the city should have a separate table with the main key and the city name specified in the “User” table, replacing the “City” column with “City” and make “CityID” external key in a fairy tale.

Third normal form: each non-classic column should not be dependent on other non-key columns. For example, in the order table, the Total column depends on the Unit Price and Quantity, so the Total column should be deleted.

+5
Feb 26 '10 at 16:27
source share

I teach normalization in my access courses and break them down in several ways.

After discussing precursors with storyboard or planning a database, I then delve into normalization. I explain the following rules:

Each field should contain the least significant value:

I write the name field on the board, and then put the name and surname in it, like Bill Lemberg. Then we ask students and ask them what we will have problems with when the first name and surname are in the same field. As an example, I use my name, which is Jim Richards. If the students do not lead me along the road, I hold them by the hand and take them with me. :) I tell them that my name is a difficult name for some, because I have what some people will consider the first 2 names, and some call me Richard. If you tried to find your last name, it will be more difficult for an ordinary person (without wildcards), because my last name is buried at the end of the field. I also tell them that they will have problems with conveniently sorting the field by last name, because again my last name is buried at the end.

Then I let them know that relevance is based on an audience that will also use the database. We, in our work, will not need a separate field for the apartment number or number if we store the addresses of people, but shipping companies such as UPS or FEDEX may need this so that they can easily pull out the apartment or package where they need to go when they are on the road and work from delivery to delivery. So it doesn’t make sense to us, but it definitely matters to them.

Avoidance of spaces:

I use an analogy to explain to them why they should avoid spaces. I tell them that Access and most databases do not store spaces, such as Excel. Excel doesn’t care if you don’t type anything in the cell and increase the file size, but Access will save this space until this point in time when you really will use this field. Therefore, even if it is empty, it will still use the space and explain to them that it also slows down their search. The analogy I use is empty shoe boxes in the closet. If you have shoe boxes in the closet and you are looking for a pair of boots, you will need to open and look at each box for a pair of boots. If there are empty shoe boxes, you just lose space in the closet and also spend time when you need to look through them for that particular shoe.

Data Redundancy Prevention:

I show them a table with many duplicate values ​​for customer information, and then tell them that we want to avoid duplicates, because I have sausage fingers and will be wrong in the values ​​if I have to enter the same thing over and over again . This fat-finger data will cause my queries to not find the correct data. Instead, we split the data into a separate table and create relationships using the primary and foreign key fields. Thus, we save space because we do not type a name, address, etc. Several times, but instead, just use the client identification number in the field for the client. Then we will discuss drop-down lists / combined fields / search lists or everything that Microsoft wants to name later. :) As a user, you will not want to search and dial a customer number every time in this customer field, so we will configure a drop-down list that will give you a list of customers where you can select their name and it will fill in the customer ID for you. This will be a one-to-many relationship, while 1 customer will have many different orders.

Avoiding duplicate field groups:

I demonstrate this when I talk about many-to-many relationships. First, I draw 2 tables, 1 which will contain information about employees and 1 that will contain information about the project. Tables are arranged similarly to this.

 (Table1) tblEmployees * EmployeeID First Last (Other Fields)…. Project1 Project2 Project3 Etc. ********************************** (Table2) tblProjects * ProjectNum ProjectName StartDate EndDate ….. 

I explain to them that this will not be a good way to establish relations between the employee and all the projects on which they work. Firstly, if we have a new employee, then they will not have any projects, so we will waste all these fields, and secondly, if the employee has been here for a long time, then they could work on 300 projects, so we would including 300 project fields. Those people who are new and have only 1 project will have 299 projects in vain. This design is also erroneous because I have to search in each of the project fields to find all the people who worked on a particular project, because this project number can be in any of the project fields.

I reviewed quite a few basic concepts. Let me know if you have other questions or need help explaining / breaking down in plain English. The wiki page did not read like plain English, and for some it can be a daunting task.

+4
Dec 20 '13 at 15:38
source share

I have read wiki links to normalization many times, but I have found a better overview of normalization from this article . This simple easy to understand explanation of normalization to the fourth normal form. Let him read!

Preview:

What is normalization?

Normalization is the process of effectively organizing data into a database. There are two goals to the normalization process: eliminating redundant data (for example, storing the same data in more than one table) and ensuring data dependencies (only storing related data in a table). Both of these goals are worthy because they reduce the amount of space the database consumes and ensures that the data is logically stored.

http://databases.about.com/od/specificproducts/a/normalization.htm

+1
Apr 23 2018-10-23T00:
source share

Database normalization is the formal process of designing your database to eliminate redundant data. Design consists of:

  • scheduling information stored in a database.
  • which will state the information that users request from it.
  • documentation of assumptions for review

Use a data-dictionary or some other metadata view to validate the design.

The biggest problem with normalization is that you get multiple tables representing a conceptually single element, such as a user profile. Do not worry about normalizing the data in the table into which records will be inserted but not updated, such as history logs or financial transactions.

References

+1
Aug 28 2018-12-12T00:
source share

+1 for the analogy with your wife. I believe that talking to someone without a technical mind requires some lightness in this type of conversation.

but...

To add to this conversation, there is another side to the coin (which may be important when in an interview).

During normalization, you should look at how databases are indexed and how queries are written.

When in a really normalized database, I found that it was easier to write slow queries in situations due to unsuccessful join operations, poor indexing on tables and simple poor design on the tables themselves.

Honestly, it’s easier to write bad queries in high-level normalized tables.

I think that for each application there is an average level. At some point, you want to easily get all of several tables, without having to join a ton of tables to get one dataset.

-one
Apr 23 '10 at 17:30
source share



All Articles