Subjects/Technology/Data and AI/Database/Database

Database Architecture and Storage Engine

Understand database architectures, storage engine responsibilities, and key features such as indexing, materialized views, and replication.

Summary

Read Summary

Flashcards

Save Flashcards

Quiz

Take Quiz

Quick Practice

In what two ways can database configuration parameters be tuned?

1 of 14

Summary

Database Management System Essentials Introduction A database management system (DBMS) is software that enables the storage, retrieval, and management of data. To understand how modern databases work, we need to examine their architecture at multiple levels—from how they're configured and tuned, to how they physically store data, to how they're deployed and accessed by users. This study guide covers the essential architectural and operational concepts you need to master. Database Engine and Configuration At the core of every DBMS is a database engine that processes queries and manages data storage. This engine has numerous configuration parameters that can be adjusted to optimize performance for your specific use case. Configuration parameters fall into two categories: Static configuration requires restarting the database to take effect. For example, you might statically configure the maximum amount of main memory the database is allowed to use. Dynamic configuration can be changed while the database is running without interruption. This flexibility is valuable when you need to adjust performance settings on the fly, such as increasing cache size during periods of high demand. The key insight here is that databases aren't one-size-fits-all. A small application might need very little memory, while a large data warehouse might need terabytes. Configurable parameters let the database adapt to different environments. Performance Tuning Performance tuning is the practice of adjusting database settings to optimize response time and throughput. Several key areas are commonly tuned: Indexing strategies: Deciding which columns to index and how to structure those indexes Cache sizes: Determining how much data should be kept in fast memory Query execution plans: Selecting efficient paths for retrieving data Performance tuning is critical because a poorly configured database can be orders of magnitude slower than an optimized one. A query that takes 30 seconds might take 0.3 seconds after proper tuning. <extrainfo> Extensibility and Data Model Variants The terminology used to describe databases often reflects their underlying architecture or data model: Relational database management system (RDBMS): Stores data in tables with rows and columns Object-oriented database management system (OODBMS): Stores data as objects with properties and methods Distributed database management system (DDBMS): Manages data spread across multiple physical locations </extrainfo> Database Architectures How users and applications connect to a database has evolved significantly. Understanding different architectural patterns is essential for recognizing when each approach is most appropriate. Client–Server Architecture The client–server architecture was a major step forward in database design. In this model: The client is the user's desktop computer, running the application The server is a separate machine hosting the database Processing is distributed: the client handles the user interface and some application logic, while the server handles data storage and queries This architecture was revolutionary because it allowed multiple clients to safely access the same database simultaneously, without each user needing their own complete copy of the data. It also meant server resources could be dedicated to database management rather than running diverse applications. Multitier Architecture Modern applications typically use multitier architecture, which adds additional layers between the client and database: Presentation tier: The user interface, typically delivered through a web browser Application tier: Web servers and application servers running business logic Data tier: The database server A critical difference from client–server is that the database is now isolated from the client. Users connect to a web server, which then connects to the application server, which finally connects to the database. Each connection between tiers is carefully managed. This architecture provides several advantages: Security: Database credentials don't travel to the client; the application server handles authentication Scalability: You can run multiple application servers sharing one database, distributing load Flexibility: You can modify business logic on the application server without changing client software Embedded Database Administration Embedded databases are a special case designed for applications that bundle their own database engine. The defining characteristic is zero-administration: these databases are designed to operate with minimal manual configuration or ongoing administrative tasks. Think of an embedded database like SQLite in a mobile app—it works automatically without requiring a database administrator to monitor it, tune it, or perform backups. This contrasts sharply with enterprise databases that typically require dedicated staff to manage them. Storage Concepts While database architecture describes how users access databases, storage concepts describe what happens under the hood—how data is physically written to disk and organized for efficient retrieval. Physical Storage Layer The physical storage layer is where the actual database files live on disk. These files contain: The raw data (table contents) Structural information (how tables are organized) Semantic information (constraints, relationships, data types) Think of the storage layer as the "true" database—everything above it is an abstraction. Even if the server crashes and restarts, the storage layer preserves all this information, allowing the database to reconstruct its state. Storage Engine and Indexing The storage engine is the component responsible for reading data from and writing data to the storage layer. It typically uses the operating system's file system as an intermediary—instead of writing directly to raw disk sectors, it creates and manages files that the OS manages. One of the storage engine's most important responsibilities is managing indexes. An index is a data structure that provides faster access paths to data. Without indexes, the database would need to read every single row to find matching records—a process called a table scan. With proper indexing, the database can jump directly to matching records. For example, if you frequently query films by release year (as in the SQL query shown in img1), the storage engine might create an index on the releaseyear column. When you search for films from 2006, instead of scanning all 25 rows, the index lets the database locate them directly. Row-Oriented vs. Column-Oriented Storage Conventional storage organizes data row-oriented: a row is stored sequentially on disk, with all columns of that row together. Column-oriented storage stores all values of a single column together, followed by all values of the next column, and so on. Why does this matter? It depends on your workload: Row-oriented is faster when you need all columns for a few rows (typical transactional workloads: "Give me all info about customer #5") Column-oriented is faster when you need one column across many rows (typical analytical workloads: "Show me the average replacementcost across all films") Column-oriented databases also compress better because similar values are stored together. A column of release years might compress to a small fraction of its original size. <extrainfo> Correlation databases represent a more advanced storage technique that optimizes for specific query patterns by storing data in correlation clusters that reflect how data is actually accessed together. </extrainfo> Data Redundancy and Distribution Strategies Modern databases employ several strategies to improve availability, performance, and resilience. These involve creating additional copies or views of data. Materialized Views A materialized view is a pre-computed query result stored in the database. Rather than computing a complex query every time a user requests it, the result is already calculated and stored. For example, if you frequently need a report showing the count of films by release year, you could create a materialized view that stores this aggregate. The next time someone requests that report, the database retrieves the pre-computed results rather than scanning millions of rows. The tradeoff is clear: you save query execution time, but you use additional storage space. Additionally, whenever underlying data changes (a new film is added), the materialized view must be refreshed, adding maintenance overhead. Database Replication Database replication creates one or more copies of database objects across different servers or locations. This serves multiple purposes: Data availability: If one server fails, a replica can take over Performance improvement: Read queries can be distributed across multiple replicas, balancing load Resilience: The database survives partial failures (losing some servers but not all) The challenge with replication is consistency: when data is modified on one server, all replicas must be updated. Different databases handle this differently—some updates happen synchronously (ensuring all copies are updated before the operation completes) or asynchronously (updates propagate in the background). Data Virtualization Data virtualization is fundamentally different from replication. Rather than creating copies of data, virtualization provides a unified view of data that lives in multiple locations. Imagine you have customer data in one database, order data in another, and product information in a third. Data virtualization lets you query across all three sources simultaneously without physically copying data to a local database. The data stays where it is—you're accessing it "virtually." This is particularly valuable for analytics and reporting across disparate systems, as you get real-time data without the storage cost and synchronization overhead of replication. Essential Database Operations Database Logging Database logs maintain a historical record of all executed operations. They serve two critical purposes: Audit trails: For compliance and security, tracking who did what and when Recovery: If the database crashes, logs can be replayed to restore the database to a previous state Without logging, a crash could destroy uncommitted work. With proper logging, you can recover reliably from almost any failure. Query Optimization When you submit a query to a database, it must be executed. But there are usually multiple ways to execute the same query—some vastly more efficient than others. The query optimizer is responsible for automatically selecting an efficient execution plan. It analyzes the query, considers available indexes and data distribution, and chooses a strategy. For example, if you're joining two tables, the optimizer might decide to: Use an index on one table to find matching rows quickly Filter rows early to reduce the amount of data being processed Cache intermediate results if they're large Different storage engines have different optimization capabilities, so the optimizer works closely with the storage engine to make informed decisions.

Flashcards

In what two ways can database configuration parameters be tuned?

Statically or dynamically

What three specific types of DBMS terminologies indicate the underlying data model or deployment environment?

Relational database management system (RDBMS) Object‑oriented database management system (OODBMS) Distributed database management system (DDBMS)

How is processing distributed in a traditional client-server database architecture?

Between the application on a client desktop and the database on a server

Which specific components were added to database systems by the multitier architecture?

Application servers and web servers

In a multitier architecture, how is the end-user interface typically delivered?

Through a web browser

What design priority allows embedded databases to operate without manual configuration or ongoing tasks?

Zero-administration

What three elements are contained within the physical materialization of a database in the storage layer?

Data Structure Semantics (needed to reconstruct higher‑level views)

What is the primary responsibility of the storage engine regarding data persistence?

Writing data to permanent storage

What intermediate system does the storage engine often use when writing to permanent storage?

The operating system's file system

What is the primary purpose of using materialized views in a database?

To store frequently needed query results to avoid recomputing them

What are the two main costs or trade-offs associated with using materialized views?

Storage redundancy Update overhead

How does data virtualization enable analytics across multiple sources without creating a local copy?

By providing real-time access to data in its original locations

For what two primary purposes do database logs maintain a history of functions and operations?

Audit Recovery

What is the main function of the query optimizer for each database query?

Selecting an efficient execution plan

Quiz

Which storage model organizes data by columns rather than rows?

1 of 2

Key Concepts

Database Fundamentals

Database Management System (DBMS)

Storage Engine

Database Index

Materialized View

Database Replication

Data Access and Optimization

Data Virtualization

Query Optimizer

Client‑Server Architecture

Multitier Architecture

Column‑oriented Database

Definitions

Database Management System (DBMS)

Software that defines, creates, maintains, and controls access to databases.

Storage Engine

Component of a DBMS that handles writing, reading, and managing data on physical storage.

Database Index

Data structure that accelerates data retrieval by providing fast access paths to rows.

Materialized View

Precomputed query result stored in the database to improve read performance, refreshed periodically.

Database Replication

Process of copying and synchronizing database objects across multiple servers for availability and load balancing.

Data Virtualization

Technology that offers real‑time, unified access to data in its original locations without moving or copying it.

Query Optimizer

Subsystem that selects the most efficient execution plan for a given SQL query.

Client‑Server Architecture

Network model where client applications request services from a server that hosts the database.

Multitier Architecture

System design that separates presentation, application logic, and data storage into distinct layers.

Column‑oriented Database

Type of database that stores data by columns rather than rows, optimizing analytical workloads.