Database Design
Notes from Lis Sulmont's Database Design Course
Most of the content (and all images if not specified differently) are taken from Database Design by Lis Sulmont.
Understand the requirements
Questions worth asking
- Schemas: How should my data be logically organized?
- Normalization: Should my data have minimal dependency and redundancy?
- Views: What joins will be done most often?
- Access control: Should all users of the data have the same level of access?
- DBMS: How do I pick between all the SQL and noSQL options?
OLTP or OLAP?
OLTP
stands for Online Transaction Processing- Find the price of a book
- Update latest customer transaction
- Keep track of employee hours
OLAP
stands for Online Analytical Processing- Calculate books with best profit margin
- Find most loyal customers
- Determine employee of the month
OLTP | OLAP | |
---|---|---|
Purpose | support daily transactions | report and analyze data |
Design | application-oriented | subject-oriented |
Data | up-to-date, operational | consolidated, historical |
Size | snapshot, gigabytes | archive, terabytes |
Queries | simple transaction and frequent updates | comples, aggregate queries & limited updates |
Users | thousands | hundreds |
Database Design by Lis Sulmont
Summary
Aks yourself: Is my project OLAP, OLTP or something else?
Storing Data
Structured vs. unstructured
Structured data | Unstructured data | Semi-structured data |
---|---|---|
Follows a schema | Schemaless | Does not follow larger schema |
Defined data types and relationships | Makes up most of data in the world | self-describing structure |
SQL, tables in a relational database | photos, chat logs, MP3 | NoSQL, XML, JSON |
Beyond traditional databases
Traditional databases | Data warehouses | Data lakes |
---|---|---|
for storing real-time relational structured data | for analyzing archived structured data | for storing data of all structures (flexibility and scalability) |
OLTP | OLAP | Big data |
Data warehouses
Database Design by Lis Sulmont
Overview
- optimized for analytics - OLAP
- organized for reading/aggregating data
- usually read-only
- contains data from multiple sources
- massively parallel processing (MPP)
- typically uses a denormalized schema and dimensional modeling
- Data marts are subsets of data warehouses dedicated to specific topics
Practice
- Deploy a data warehouse on AWS
- Deploy a data warehouse on Azure
- Deploy a data warehouse on Google
Data lakes
- Store all types of data at a lower cost
- e.g., raw, operational databases, IoT device logs, real.time, relational and non-relational
- Retains all data and can take up petabytes
- Schema-on-read as opposed to schema-on-write
- Need to catalog data otherwise becomes a data swamp
- Run big data analytics using services such as Apache Spark and Hadoop Useful for deeplearning and data discovery because activities require so much data
Data flows
ETL
ETL stands for Extract Transform Load
Database Design by Lis Sulmont
ELT
ELT stands for Extract Load Transform
Database Design by Lis Sulmont
What is database design?
- Determines how data is logically stored
- How is data to be read and updated?
- Uses database models: high-level specifications for database structure
- Most popular: relational model
- Some other options: NpSQL models, object-oriented model, network model
- Uses schemas: blueprint of the database
- Defines tables, fields, relationships, indexes, and views
- When inserting data in relational datbases, schemas must be respected
Data modeling
Process of creating a data model
- Conceptual data model: describes entities, relationships, and attributes
- Tools: data structure diagrams, e.g., entity-relational diagrams and UML diagrams
Entities, Relationships, and Attributes
Database Design by Lis Sulmont
- Logical data model: defines tables, columns, relationships
- Tools: database models and schemas, e.g., relational model and star schema
Fastes conversion: Entities become the tables
Database Design by Lis Sulmont
Alternative 1: All in one table to minimize number of joins
Database Design by Lis Sulmont
Alternative 2: Add tables to improve data integrity
Database Design by Lis Sulmont
- Physical data model: describes physical storage
- Tools: partitions, CPUs, indexes, backup systems and tablespaces
Beyond the relational model
Dimensional modeling
Adaptation of the relational model for data warehouse design
- Optimized for OLAP queries: aggregate data, not updating (OLTP)
- Built using the star schema
- Easy to interpret and extend schema
Dimensional models consist of two types of tables:
Fact tables
- Holds records of a metric
- Decided by business use-case
- changes regularly
- connects to dimensions via foreign keys
Dimension tables
- Holds descriptions of attributes
- Does not change as often
Database Design by Lis Sulmont
Organizational principle
- What is being analyzed?
- How often do entities change?
Star and snowflake schema
Dimensional modeling
Star schema
- Fact tables
- holds records of a metric
- changes regularly
- connects to dimensions via foreign keys
- Dimension tables
- holds descriptions of attributes
- does not change as often
Snowflake schema
- More than one dimension
- Dimension tables are normalized
- Database design technique
- Divides tables into smaller tables and connects them via relationships
- Goal: reduce redundancy and increase data integrity
Identify repeating groups of data and create new tables for them