Normalization (database) - Normalization Techniques and Extensions
Understand how to apply 1NF‑5NF (including BCNF, 4NF, 5NF), why denormalization and refactoring are used, and how lossless‑join decomposition preserves data.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz
Quick Practice
What is the primary requirement for each field to satisfy First Normal Form?
1 of 11
Summary
Understanding Database Normalization: A Step-by-Step Guide
Introduction: Why Normalization Matters
Database normalization is the process of organizing data in a database schema to minimize redundancy and ensure data integrity. When you build a database table without following normalization principles, you often end up with repeated data, update anomalies, and storage inefficiency. Normalization solves these problems by breaking tables into well-structured, related tables through a series of normal forms. Each normal form represents a higher level of organization and removes specific types of data problems.
The good news? You don't need to understand all seven normal forms in existence. In practice, most databases are normalized to the third normal form (3NF), which handles the vast majority of real-world issues.
The Three Essential Normal Forms
First Normal Form (1NF): Atomic Values Only
The Rule: Every field in a table must contain a single, atomic (indivisible) value. No multi-valued fields are allowed.
Think about what "atomic" means: it's a value that cannot be broken down further into meaningful parts. If a field contains multiple values separated by commas or stored as a list, it violates first normal form.
The Problem: Consider this example table of employees and their skills:
Notice that Employee 426 has two different rows—one for "Typing" and one for "Shorthand". Similarly, Employee 519 appears twice, once for "Public Speaking" and once for "Carpentry". This is actually a violation of 1NF because the intent was to store multiple skills per employee in a single row. While the current display shows separate rows, if these were stored as comma-separated values in a single field (like "Typing, Shorthand"), that would clearly violate 1NF.
The Fix: To achieve 1NF, each skill must be in its own row. If Employee 426 has two skills, they get two separate rows in the table. Alternatively, you could create a separate "Skills" table and link it to the Employees table through a primary key relationship.
Why It Matters: Atomic values make it easy to search, sort, and update data. If you store "Typing, Shorthand" as one value, you can't easily query for all employees with Shorthand skills.
Second Normal Form (2NF): Full Key Dependency
The Rule: Every non-key attribute must depend on the entire primary key, not just part of it. This applies specifically to tables with composite (multi-part) primary keys.
This concept is a bit tricky, so let's break it down carefully. If your primary key is made up of two or more attributes, every other column in the table must relate to that complete key combination, not just one part of it.
The Problem: Imagine a table with a compound primary key of (Faculty ID + Course Code) that stores course and faculty information:
Here, the primary key is Faculty ID plus Course Code (the combination uniquely identifies each row). But notice that "Faculty Name" and "Faculty Hire Date" depend only on the Faculty ID part of the key—not on the complete key including Course Code. Faculty ID 389 will always be "Dr. Giddens" with hire date "10-Feb-1985", regardless of which course is listed. This violates 2NF because these attributes are partially dependent on the key.
The Fix: Split the table into two tables:
A Faculty table with columns: Faculty ID (primary key), Faculty Name, Faculty Hire Date
A Courses table with columns: Faculty ID (foreign key), Course Code (primary key)
Now, in the Faculty table, all non-key attributes depend on the entire key (Faculty ID). In the Courses table, Course Code fully describes what we're storing. Each table has a logical, complete dependency.
Why It Matters: Removing partial dependencies prevents update anomalies. In the original bad table, if you want to update Dr. Giddens' hire date, you'd have to update it in every row where Faculty ID 389 appears. Split it correctly, and you update it once.
Third Normal Form (3NF): No Transitive Dependencies
The Rule: No non-key attribute should depend on another non-key attribute. In other words, every non-key column must depend directly on the primary key—not through some other column.
A transitive dependency is when column C depends on column B, which depends on column A (the primary key). This indirect chain of dependency violates 3NF.
The Problem: Consider a table tracking student enrollments:
| Student ID | Student Name | Major ID | Major Name |
|---|---|---|---|
| 101 | Alice | M1 | Computer Science |
| 102 | Bob | M2 | Biology |
| 101 | Alice | M1 | Computer Science |
Here, Student ID is the primary key. Student Name depends directly on Student ID (good). But Major Name depends on Major ID, not directly on Student ID. If a student is enrolled in two majors, you'd have confusion about whether "Computer Science" belongs to the student or to Major ID "M1". The transitive dependency creates this problem.
The Fix: Create two tables:
A Students table: Student ID (primary key), Student Name, Major ID (foreign key)
A Majors table: Major ID (primary key), Major Name
Now Major Name depends directly on its own primary key (Major ID), and each table has proper, direct dependencies.
Why It Matters: This prevents data anomalies and makes updates cleaner. If Computer Science changes its name to "CS", you update it in one place (the Majors table), not in every student record.
<extrainfo>
Most databases stop at 3NF, and for good reason. By this point, you've eliminated the vast majority of data problems while keeping your schema relatively simple and easy to understand. Most real-world databases aim for 3NF as their normalization target.
</extrainfo>
Advanced Normal Forms
Boyce–Codd Normal Form (BCNF)
The Rule: For every determinant (an attribute that determines another attribute), that determinant must be a candidate key.
BCNF is a stricter version of 3NF. It's rarely necessary in practice, but it matters when you have overlapping candidate keys or complex key structures. Good news: if your table already satisfies 3NF and has only a compound primary key with no non-key attributes, it automatically satisfies BCNF. This means you're already there without extra work.
<extrainfo>
Fourth Normal Form (4NF)
The Rule: A table must satisfy BCNF and have no non-trivial multivalued dependencies.
A multivalued dependency occurs when one attribute's values are independent of another attribute's values. For example, if you track which courses a faculty member teaches and which textbooks they use, these might be independent facts. Storing both in one table can create confusing combinations. 4NF requires you to split these into separate tables. This is rarely tested in introductory database courses.
Fifth Normal Form (5NF)
The Rule: A table must satisfy 4NF and can be reconstructed from smaller tables through a lossless join without any data loss.
5NF is an extremely advanced concept dealing with rare edge cases where even 4NF-compliant tables have join-dependency problems. It's primarily of theoretical interest and virtually never appears in practice.
</extrainfo>
Important Supporting Concepts
Lossless Join Decomposition
When you normalize a database by splitting one table into multiple tables, you're performing a decomposition. A lossless join decomposition guarantees that you can rejoin the tables later and get back exactly the original data without any loss of information.
This is essential to normalization. If you couldn't reconstruct your original data from the decomposed tables, you'd have lost information—a disaster! When you split the Faculty table into separate Faculty and Courses tables (as we did above), a lossless join means that when you join Faculty back to Courses on Faculty ID, you recover all original rows and columns perfectly.
Every valid normalization transformation you perform maintains lossless joins—it's a core principle of the process.
Denormalization: When Normal Isn't Better
Definition: Denormalization is the intentional introduction of redundancy into a normalized database schema to improve query performance.
Wait—didn't we just spend all this time eliminating redundancy? Yes, but sometimes the performance cost of joining many tables together outweighs the benefit of having clean, normalized data.
When You Might Denormalize: If a frequently-run query requires joining 8 tables together, and the business can't afford the time that query takes, you might denormalize by storing some redundant data in one table to eliminate the need for those joins. You're trading storage space and update complexity for query speed.
The Trade-off: The downside is that now when you update that redundant data, you have to update it in multiple places. This is why you only denormalize when performance analysis shows it's necessary, and you do it carefully.
Database Refactoring
Definition: Database refactoring is the practice of making small, incremental changes to a database schema while preserving its functional behavior.
Think of this as evolution, not revolution. Rather than completely redesigning your database schema at once, you make small, testable changes—often guided by normalization principles—and verify that applications still work correctly. For example, you might split one table into two to improve normalization, but you'd do it gradually, perhaps creating the new table first, migrating data over time, then deprecating the old table.
Database refactoring is less about the theory of normalization and more about the practical, safe way to improve an existing schema in production systems where you can't afford downtime.
Flashcards
What is the primary requirement for each field to satisfy First Normal Form?
Each field must contain a single atomic value.
What type of fields, such as a set of subjects, violate the rules of First Normal Form?
Multi-valued fields.
To satisfy Second Normal Form, every non-key attribute must depend on what specific element?
The entire candidate key.
What specific type of dependency is eliminated when achieving Third Normal Form?
Transitive dependencies.
In the context of Third Normal Form, what is a transitive dependency?
A non-key attribute depending on another non-key attribute.
What two conditions allow a table already in Third Normal Form to automatically satisfy Boyce–Codd Normal Form?
It contains only a compound primary key.
It has no non-key attributes.
What specific issue does Fourth Normal Form address within a database table?
Multi-valued dependencies.
What is the primary goal of Fifth Normal Form regarding table decompositions?
Ensuring decompositions do not cause loss of information when joining tables.
What is the intentional purpose of introducing redundancy through Denormalization?
To improve query performance.
What does Database Refactoring involve while preserving the schema's functional behavior?
Making small, incremental changes.
What does a Lossless Join Decomposition ensure happens when a database is split and then reconstructed?
No loss of information occurs.
Quiz
Normalization (database) - Normalization Techniques and Extensions Quiz Question 1: To satisfy Second Normal Form (2NF), what must be true about every non‑key attribute?
- It depends on the whole candidate key (correct)
- It must be atomic
- It cannot be a foreign key
- It must be part of a composite primary key
Normalization (database) - Normalization Techniques and Extensions Quiz Question 2: Which of the following column designs would violate First Normal Form?
- A column that stores a list of subjects in a single field (correct)
- A column that stores a single integer ID
- A column that stores a date value
- A column that stores a foreign key referencing another table
Normalization (database) - Normalization Techniques and Extensions Quiz Question 3: Which of the following statements correctly describes a property of a relation that is in Third Normal Form (3NF)?
- No non‑key attribute depends on another non‑key attribute. (correct)
- Every attribute is part of a candidate key.
- A table may contain multivalued dependencies among attributes.
- The primary key must consist of a single column.
Normalization (database) - Normalization Techniques and Extensions Quiz Question 4: For a decomposition to be lossless, which condition must hold after joining the resulting tables?
- The original rows can be reconstructed exactly. (correct)
- The number of rows in each table remains unchanged.
- All foreign key constraints are eliminated.
- Each table contains a copy of the primary key.
Normalization (database) - Normalization Techniques and Extensions Quiz Question 5: Which of the following statements is true about a relation that is in Third Normal Form and whose only attributes form a composite primary key?
- It also satisfies Boyce–Codd Normal Form (BCNF) (correct)
- It may still violate BCNF because composite keys are not candidates
- It cannot be in BCNF unless it has a single‑column key
- BCNF is unrelated to the presence of non‑key attributes
Normalization (database) - Normalization Techniques and Extensions Quiz Question 6: A table with a compound primary key where one attribute does not uniquely determine the other attribute violates which normal form?
- Fourth Normal Form (4NF) (correct)
- Second Normal Form (2NF)
- Third Normal Form (3NF)
- Boyce–Codd Normal Form (BCNF)
Normalization (database) - Normalization Techniques and Extensions Quiz Question 7: If a decomposition of a relation can be joined back together without producing any spurious tuples, the relation is said to satisfy which normal form?
- Fifth Normal Form (5NF) (correct)
- Fourth Normal Form (4NF)
- Third Normal Form (3NF)
- First Normal Form (1NF)
Normalization (database) - Normalization Techniques and Extensions Quiz Question 8: Denormalization most directly improves which aspect of a database system?
- Query performance (correct)
- Storage efficiency
- Referential integrity
- Security enforcement
To satisfy Second Normal Form (2NF), what must be true about every non‑key attribute?
1 of 8
Key Concepts
Database Normalization Forms
First Normal Form
Second Normal Form
Third Normal Form
Boyce–Codd Normal Form
Fourth Normal Form
Fifth Normal Form
Database Optimization Techniques
Denormalization
Database Refactoring
Lossless Join Decomposition
Definitions
First Normal Form
A relational schema where each field contains only atomic, indivisible values.
Second Normal Form
A schema in which every non‑key attribute fully depends on the entire primary key.
Third Normal Form
A design that eliminates transitive dependencies among non‑key attributes.
Boyce–Codd Normal Form
A stricter form of 3NF where every determinant is a candidate key.
Fourth Normal Form
A normal form that removes undesirable multi‑valued dependencies.
Fifth Normal Form
A schema that guarantees lossless join decomposition for all join dependencies.
Denormalization
The deliberate addition of redundancy to a normalized database to boost performance.
Database Refactoring
Incremental, behavior‑preserving modifications to a database schema.
Lossless Join Decomposition
Splitting a database into tables that can be recombined without losing information.