Data warehouse - Modeling Design and Data Marts
Understand the differences between operational and analytic databases, the dimensional versus normalized modeling approaches, and the key design methodologies and data‑mart types.
Summary
Read Summary
Flashcards
Save Flashcards
Quiz
Take Quiz
Quick Practice
What core principles do operational databases prioritize for transaction recording?
1 of 27
Summary
Operational Databases versus Analytic Databases
Understanding the Fundamental Difference
Databases serve different purposes in organizations, and understanding these purposes is essential to database design. Operational databases and analytic databases are optimized for completely different workloads and questions.
Operational databases answer the question: How do I quickly record this transaction right now? Analytic databases answer the question: What patterns or trends exist in our data?
Operational Databases: Speed and Integrity
Operational databases prioritize data integrity and fast transaction recording. They use database normalization and entity-relationship models to organize data logically. Normalization means that a single business transaction—like a customer purchase—is typically spread across many tables (a customer table, an orders table, a products table, and so on).
This design is efficient because it avoids storing the same information multiple times. For example, instead of storing a customer's name and address in every single order record, you store it once in a customer table and reference it.
The database engine that powers operational databases is row-oriented, meaning it stores entire rows together. This makes inserting and updating individual records very fast, which is essential when you need to process thousands of transactions per second.
Analytic Databases: Answers over Speed
Analytic databases have the opposite priority. Instead of optimizing for inserting one row at a time, they optimize for answering complex questions about large amounts of data. These databases use column-oriented storage, meaning data from a single column is stored together, rather than entire rows.
Why does this matter? Consider a question like "What was our total revenue by product category last quarter?" You only need the revenue and category columns, not customer names, addresses, order IDs, or dozens of other fields. A column-oriented database can quickly find and aggregate just those two columns without loading all the other data.
Analytic databases store aggregated, historical data organized in special schemas (like star schemas) designed to make complex analyses fast. We'll explore these schemas shortly.
Transaction Processing Systems
OLTP: Online Transaction Processing
OLTP systems handle the high volume of short transactions typical of operational databases. These transactions are usually simple operations: inserting a new order, updating a customer record, or deleting an item.
Performance for OLTP is measured in transactions per second (TPS). A system that can handle 10,000 TPS is considered high-performing.
OLAP: Online Analytical Processing
OLAP systems handle a much lower rate of transactions—but the transactions are far more complex. Instead of "insert this one order," a transaction might be "show me total sales by region and product, broken down by month, for the last three years, comparing this year to last year."
Performance for OLAP is measured in response time. Users expect these complex queries to finish in seconds or minutes.
OLAP supports several specific types of operations:
Roll-up: Consolidating data at higher levels of aggregation (e.g., moving from daily sales to monthly sales)
Drill-down: Going into more detail (e.g., moving from regional sales to store-level sales)
Slicing and dicing: Selecting specific subsets of data based on different dimensions (e.g., showing only Q3 data for certain product categories)
Data Organization in Warehouses
Facts: The Numbers That Matter
A fact is a measurable value or metric stored in the warehouse. Think of facts as the numbers your business cares about: revenue, number of orders, customer count, page views, or any other quantifiable outcome.
Facts come in two forms:
Raw facts are reported directly by the source system. For example, "Customer ID 12345 placed an order for $150 on March 15."
Aggregated facts summarize raw facts across dimensions. For example, "Total revenue for the Northeast region in March was $2,000,000."
Two Approaches to Warehouse Storage
Data warehouses typically use one of two fundamentally different storage approaches. Each has tradeoffs, and choosing between them is a critical design decision.
The Dimensional Approach: Star Schema
The dimensional approach uses a star schema that separates facts from dimensions.
Dimensions are reference data that provide context for the facts. Common dimensions include:
Time: Date, month, quarter, year
Customer: Customer name, segment, geography
Product: Product name, category, supplier
Location: City, region, country
The star schema visualizes this relationship: a central fact table contains the measurable metrics and foreign keys pointing to dimension tables arranged around it like points on a star.
Why use a star schema?
The dimensional approach has clear advantages:
Business users can easily understand the structure (dimensions are things they know about; facts are numbers they care about)
Query performance is fast because the schema is optimized for analytical queries
The data cube structure makes multidimensional analysis natural and intuitive
However, there are challenges:
It can be complex to maintain referential integrity when loading data from multiple operational systems
Modifying the warehouse structure is difficult once it's built (adding a new dimension or changing an existing one requires careful planning)
The Normalized Approach: Third Normal Form
The normalized approach stores data following database normalization rules. Data is organized into many subject-area tables with detailed relationships between them. This approach uses many joins to combine data from different tables when answering questions.
Why might you use this approach?
Normalized storage reduces redundancy (the same information isn't stored multiple times). However, this comes with disadvantages:
Business users find it difficult to work with the schema because they must understand complex relationships and write queries with many joins
Using the schema requires detailed knowledge of the database structure
Query writing becomes more complex and error-prone
Design Methodologies for Data Warehouses
Building a data warehouse is a major undertaking, and the order and approach you take matters significantly. There are three primary design methodologies.
Bottom-Up Design
In bottom-up design, you start small and build up. Data marts (focused data warehouses for specific departments) are created first. Each data mart addresses a specific business process or department need—for example, a Sales data mart, a Marketing data mart, and an HR data mart.
As you build multiple data marts, you integrate them using a bus architecture that shares common dimensions and facts. For example, all data marts might use the same Customer and Time dimensions, making it possible to compare metrics across departments.
This approach is pragmatic—you can deliver value quickly to individual departments while building toward a comprehensive enterprise warehouse.
Top-Down Design
In top-down design, you build a comprehensive enterprise data model first. This normalized model stores the most detailed (atomic) data in the central warehouse.
Dimensional data marts are then derived from this warehouse for specific departmental reporting needs. This approach ensures consistency across the enterprise and reduces redundancy in storage.
The tradeoff: it takes longer to deliver results because you must build the entire enterprise model before departmental teams can get their analytics.
Hybrid Design
In hybrid design, you combine elements of both approaches. A spoke-hub paradigm consolidates data from legacy systems into an operational data store (ODS) before loading the warehouse.
The warehouse might store normalized data to minimize redundancy, while data marts built on top use dimensional modeling for user-friendly reporting.
This approach balances the speed-to-value of bottom-up with the consistency guarantees of top-down.
Data Marts
What Is a Data Mart?
A data mart is a focused data warehouse serving a single subject area or functional department. Instead of storing everything an enterprise knows, a data mart stores what a specific group needs: the Sales department's data mart has sales transactions and customer information; the Finance department's might focus on cost centers and budgets.
Data marts are smaller and more specialized than enterprise warehouses, making them faster to query and easier for users in a department to understand and navigate.
Three Types of Data Marts
Data marts can be created in different ways:
Dependent data marts obtain their data from a central enterprise warehouse. They're fed by the warehouse rather than building data directly from source systems. This ensures consistency across the organization.
Independent data marts are built directly from source systems without passing through a central warehouse. Teams build exactly what they need, without waiting for enterprise infrastructure. However, different departments might end up with conflicting data definitions.
Hybrid data marts combine characteristics of both dependent and independent approaches, drawing data from both the central warehouse and source systems.
Flashcards
What core principles do operational databases prioritize for transaction recording?
Data integrity and fast transaction recording.
Which design models are typically used in operational databases to prioritize data integrity?
Database normalization and the entity-relationship model.
How do normalized designs impact the storage of a single business transaction?
They spread the transaction across many tables.
What type of database management system architecture provides high insert and update performance for operational workloads?
Row-oriented DBMS.
For what specific type of data retrieval are analytic databases optimized?
Selecting specific fields rather than all fields.
Which database management system architecture is best suited for complex queries and aggregations in analytics?
Column-oriented DBMS.
What kind of data and schemas do analytic databases typically use?
Aggregated, historical data in multidimensional schemas (like star schemas).
What is the primary characteristic of the transactions handled by OLTP systems?
Large numbers of short transactions (inserts, updates, and deletes).
What is the key metric used to measure OLTP performance?
Number of transactions per second.
What kind of workload is characteristic of OLAP systems?
Low rate of transactions but complex, aggregation-heavy queries.
What is the primary performance metric for OLAP systems?
Response time.
What are the core operations included in OLAP?
Roll-up (consolidation)
Drill-down
Slicing and dicing
In the context of a data warehouse, what is the definition of a "fact"?
A measurable value or metric.
What is the difference between raw facts and aggregated facts?
Raw facts are reported directly by the source; aggregated facts summarize raw facts across dimensions.
How does the dimensional approach (star schema) separate data?
It separates facts (numeric data) from dimensions (contextual reference data).
What structure is formed when facts are linked to dimensions for multidimensional analysis?
A data cube.
What are the primary advantages of the dimensional storage approach?
Ease of understanding for business users
Fast query performance
What are the disadvantages of using a normalized approach in a data warehouse?
Difficulty for users to join many tables
Need for detailed knowledge of the schema
How are multiple data marts integrated into a comprehensive warehouse in the bottom-up approach?
Using a bus architecture that shares common dimensions and facts.
What is built first in a top-down design methodology?
A normalized enterprise data model storing atomic data.
How are departmental reporting needs met in the top-down design model?
Dimensional data marts are derived from the central warehouse.
What paradigm is used in hybrid design to consolidate legacy data before loading the warehouse?
The spoke-hub paradigm (using an operational data store).
In a typical hybrid design, how is data stored in the warehouse versus the data marts?
Normalized form in the warehouse; dimensional modeling in the data marts.
What is the definition of a data mart?
A focused data warehouse serving a single subject area or department.
From where do dependent data marts obtain their data?
A central data warehouse.
How do independent data marts differ from dependent ones in terms of data source?
They are built directly from source systems instead of a central warehouse.
What characterizes a hybrid data mart?
It combines characteristics of both dependent and independent approaches.
Quiz
Data warehouse - Modeling Design and Data Marts Quiz Question 1: In a data warehouse, what term describes a measurable metric stored for analysis?
- Fact (correct)
- Dimension
- Attribute
- Record
Data warehouse - Modeling Design and Data Marts Quiz Question 2: Which design methodology creates data marts first to address specific business processes?
- Bottom‑Up design (correct)
- Top‑Down design
- Hybrid design
- Star schema design
Data warehouse - Modeling Design and Data Marts Quiz Question 3: Which of the following is an OLAP operation?
- Roll‑up (consolidation) of data (correct)
- Index creation for faster inserts
- Normalization of tables
- Transaction logging for recovery
Data warehouse - Modeling Design and Data Marts Quiz Question 4: How does a dependent data mart obtain its data?
- From a central warehouse (correct)
- Directly from source operational systems
- By extracting data from a data lake
- Through ad‑hoc user queries
Data warehouse - Modeling Design and Data Marts Quiz Question 5: What type of database management system is optimized for complex queries and aggregations commonly needed in analytics?
- Column‑oriented DBMS (correct)
- Row‑oriented DBMS
- Document‑oriented NoSQL DBMS
- Graph database
Data warehouse - Modeling Design and Data Marts Quiz Question 6: Which of the following is a key advantage of using a dimensional (star schema) approach for business users?
- Easy to understand and navigate (correct)
- Reduces data storage requirements
- Eliminates the need for any joins
- Provides real‑time transaction processing
Data warehouse - Modeling Design and Data Marts Quiz Question 7: In a top‑down design methodology, what is created first?
- A normalized enterprise data model storing atomic data (correct)
- Dimensional data marts for departmental reporting
- An operational data store
- A data lake of raw data
Data warehouse - Modeling Design and Data Marts Quiz Question 8: Data marts built on top of a warehouse typically use which modeling approach for reporting?
- Dimensional modeling (correct)
- Normalized third normal form
- Entity‑relationship modeling
- No formal modeling (ad‑hoc)
Data warehouse - Modeling Design and Data Marts Quiz Question 9: Which of the following is an example of a typical OLTP transaction?
- Inserting a new order record (correct)
- Running a monthly sales summary report
- Generating a customer segmentation model
- Performing a data warehouse ETL load
Data warehouse - Modeling Design and Data Marts Quiz Question 10: How is the performance of an OLTP system most commonly measured?
- Transactions per second (correct)
- Queries per hour
- Data throughput in GB per day
- Number of concurrent users
In a data warehouse, what term describes a measurable metric stored for analysis?
1 of 10
Key Concepts
Database Types
Operational database
Analytic database
Column‑oriented database
Processing Systems
Online Transaction Processing (OLTP)
Online Analytical Processing (OLAP)
Data Modeling Approaches
Star schema
Data mart
Bottom‑up design
Top‑down design
Hybrid design
Definitions
Operational database
A database optimized for transaction processing that emphasizes data integrity, normalization, and high insert/update performance.
Analytic database
A database designed for complex queries and aggregations, often column‑oriented and using multidimensional schemas like star schemas.
Online Transaction Processing (OLTP)
Systems that handle large volumes of short, simple transactions such as inserts, updates, and deletes.
Online Analytical Processing (OLAP)
Systems that support low‑frequency, complex, aggregation‑heavy queries for multidimensional analysis.
Star schema
A dimensional data modeling approach that separates numeric facts from descriptive dimension tables, forming a simple, query‑friendly structure.
Data mart
A focused subset of a data warehouse that serves a single business area or department.
Bottom‑up design
A methodology where individual data marts are built first and later integrated into an enterprise data warehouse.
Top‑down design
A methodology that creates a comprehensive, normalized enterprise data model before deriving dimensional data marts.
Hybrid design
A design approach that combines normalized enterprise storage with dimensional data marts, often using an operational data store as an intermediate layer.
Column‑oriented database
A database management system that stores data by column, optimizing read‑heavy analytic workloads and aggregations.