Database Normalization for Dummies
I'm currently normalizing one of my client's database, and thought it's a nice time to mention about normalization.
Some of you may think that normalization is just for academic studies and does not apply to real-life situations. Some of you have "somewhat heard" about it. Some of you wonder whether it is a food to eat, or a country to visit.
Actually normalization is a way of optimizing data. While optimizing, you ensure data consistency at the same time. There are six levels of normalization which you can find more details in the wikipedia.
There are six levels of normalization. However most applications use third normal form (3NF) and therefore we will be dealing with the first three levels of normalization in this post.
1. First Normal Form (1NF)
This is the most elementary requirement. It states that data (the domain of attribute) should be atomic. No composite values are allowed. The database should contain a separate table for multi-valued data.
For instance the table below is not in its first normal form (and most of you are sure that it is not a good idea to store n favorite colors in n separate columns)
Now let us normalize it.
More organized, isn't it?
Now let us move to the next level of normalization.
2. Second Normal Form (2NF)
Second normal form requires the data to be in the first normal form.
A table is said to be in the 2NF if and only if it is in the 1NF and every non-key column is irreducably (i.e. not partially) dependent on the primary key (1)
(from #1) 2NF applies to tables that have composite primary keys, where two or more columns comprise the primary key. Because if the key were a singleton key, then the relation would be guaranteed to be in at least 2NF by definition (#1), provided that it fullfills the criteria of being 1NF. (2)
Let us clarify with an example:
Here, setting order number and client id as a composite primary key is correct (since they fully determine the relation: a client may have more than one order). The order total is also fully functionally dependent on the primary key.
However "client name" is only dependent on the client id and thus breaks the second normal form (2NF).
Why is it that important? Let us assume we have billions of those records, and all of a sudden alice decides to change her name to "Felicia"(well it's her name, and she can change it as she wishes and demands it from a court). Now you need to update half a billion rows, changing all instances of "Alice" to "Felicia".
Now let us put it into 2NF.
Now instead of a billion updates, you only need to update a single row.
And let us see the third normal form (3NF) finally.
3. Third Normal Form (3NF)
Third normal form states that, the data should be in the second normal form (2NF) first of all.
And in addition there should be no dependencies of a non-key column on a field other than the primary key.
Let us give an example by slightly modifying our first table given in the 2NF example.
This table is in 2NF by definition (#2). Since there is one primary key, all columns are irreducably dependent on that key.
However the table breaks 3NF because client city is dependent on the non-key attribute "client name". Thus we need to take it apart.
Here is the proper 3NF form:
Bottom Line
If all of these seem too complicated, just keep these two in mind and 99% of the time you are done:
Cheers.
bu yaziyi sevdin mi?
hemen
una ekle!
Some of you may think that normalization is just for academic studies and does not apply to real-life situations. Some of you have "somewhat heard" about it. Some of you wonder whether it is a food to eat, or a country to visit.
Actually normalization is a way of optimizing data. While optimizing, you ensure data consistency at the same time. There are six levels of normalization which you can find more details in the wikipedia.
There are six levels of normalization. However most applications use third normal form (3NF) and therefore we will be dealing with the first three levels of normalization in this post.
1. First Normal Form (1NF)
This is the most elementary requirement. It states that data (the domain of attribute) should be atomic. No composite values are allowed. The database should contain a separate table for multi-valued data.
For instance the table below is not in its first normal form (and most of you are sure that it is not a good idea to store n favorite colors in n separate columns)
| person | favorite color 1 | favorite color 2 | favorite color 2 |
|---|---|---|---|
| Alice | red | green | blue |
| Bob | cyan | magenta | black |
Now let us normalize it.
| person |
|---|
| Alice |
| Bob |
| person | favorite color |
|---|---|
| Alice | red |
| Alice | green |
| Alice | blue |
| Bob | cyan |
| Bob | magenta |
| Bob | black |
More organized, isn't it?
Now let us move to the next level of normalization.
2. Second Normal Form (2NF)
Second normal form requires the data to be in the first normal form.
A table is said to be in the 2NF if and only if it is in the 1NF and every non-key column is irreducably (i.e. not partially) dependent on the primary key (1)
(from #1) 2NF applies to tables that have composite primary keys, where two or more columns comprise the primary key. Because if the key were a singleton key, then the relation would be guaranteed to be in at least 2NF by definition (#1), provided that it fullfills the criteria of being 1NF. (2)
Let us clarify with an example:
| order number (PK) | client id (PK) | client name | order total |
|---|---|---|---|
| 1 | 1 | alice | $100 |
| 2 | 1 | alice | $120 |
| 3 | 2 | bob | $130 |
| 4 | 2 | bob | $130 |
Here, setting order number and client id as a composite primary key is correct (since they fully determine the relation: a client may have more than one order). The order total is also fully functionally dependent on the primary key.
However "client name" is only dependent on the client id and thus breaks the second normal form (2NF).
Why is it that important? Let us assume we have billions of those records, and all of a sudden alice decides to change her name to "Felicia"(well it's her name, and she can change it as she wishes and demands it from a court). Now you need to update half a billion rows, changing all instances of "Alice" to "Felicia".
Now let us put it into 2NF.
| order number (PK) | client id (PK) | order total |
|---|---|---|
| 1 | 1 | $100 |
| 2 | 1 | $120 |
| 3 | 2 | $130 |
| 4 | 2 | $130 |
| client id (PK) | client name |
|---|---|
| 1 | Alice |
| 2 | Bob |
Now instead of a billion updates, you only need to update a single row.
And let us see the third normal form (3NF) finally.
3. Third Normal Form (3NF)
Third normal form states that, the data should be in the second normal form (2NF) first of all.
And in addition there should be no dependencies of a non-key column on a field other than the primary key.
Let us give an example by slightly modifying our first table given in the 2NF example.
| order number (PK) | client name | client city | order total |
|---|---|---|---|
| 1 | alice | istanbul | $100 |
| 2 | alice | istanbul | $120 |
| 3 | bob | izmir | $130 |
| 4 | bob | izmir | $130 |
This table is in 2NF by definition (#2). Since there is one primary key, all columns are irreducably dependent on that key.
However the table breaks 3NF because client city is dependent on the non-key attribute "client name". Thus we need to take it apart.
Here is the proper 3NF form:
| order number (PK) | client name | order total |
|---|---|---|
| 1 | alice | $100 |
| 2 | alice | $120 |
| 3 | bob | $130 |
| 4 | bob | $130 |
| client name | client city |
|---|---|
| alice | istanbul |
| bob | izmir |
Bottom Line
If all of these seem too complicated, just keep these two in mind and 99% of the time you are done:
- Group logical parts apart.
- Create relations to protect referential integrity.
Cheers.
bu yaziyi sevdin mi?
hemen
una ekle!









