In general, a data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data. The data in warehouse is stored per subject area or business domain example, Client, Accounts, Address etc. Data warehouse also gets data from multiple sources and clubs them together, reduces data redundancy and creates one single source of true data. Usually in business terms called as golden source. Historical data is usually maintained in data warehouse and is not altered. In this blog, we will look at some of the key concepts in Data Warehouse projects.
Dimensional Data Model: It is a very commonly used terminology in data warehousing. Let’s define few more terms used in this data model.
Dimension: It is a category of information: All similar information needs to be put into a single dimension. For example, Accounts. It can be considered as a dimension. Different attributes related to Account like Account_Number, Account_Open_Date, Account_Close_Date, Account_Type etc can be attributes of this Accounts dimension.
Attribute: A unique level within dimension. As described above Account_Number, Account_Open_Date, Account_Close_Date, Account_Type etc are attributes of Accounts dimension.
Fact Table: It is table which stores information about organizational measures. For Example, an Account has done a Check transaction of certain Amount. Then your Transaction_Type_ID, Transaction_Amount, Currency_Code_ID are measures of Fact table. Transaction_Type_ID and Currency_Code_ID are surrogate keys of Transaction and Currency Dimension tables.
Lookup Table: The lookup table provides the detailed information about the attributes. For Example. A Transaction codes can be stored in one separate Lookup table. You can have different transaction codes for Checks, Debits, Credits, Wires, International transfers etc.
Datawarehouse is generally designed into Start Schema or SnowFlake Schema.
Star Schema:
In the star schema design, the fact table sits in the middle and is connected to other dimension and lookup tables like a star. Each dimension is represented as a single table. The primary key in each dimension table is related to a foreign key in the fact table. All measures in the fact table are related to all the dimensions that fact table.
Let's look at an example: Assume our data warehouse keeps Aviation Related data. You can have dimensions like Traveler, Passport, Destination, Foreign Contact and Fact table as Travel Itinerary. Sample ER diagram is below.
Snowflake Schema:
Snowflake is an extension to Star schema where each dimension further extends into additional dimensions.
For Example, in Aviation related data model, Destination Dimension can be further divided into more dimensions as shown in below ER diagram.
The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increased number of tables to maintain.
Types of Facts:
There are three types of facts:
Additive: Additive facts are facts that can be summed up through all the dimensions in the fact table.
Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.
Slowly Changing Dimensions:
Data is ever changing and so does data in dimensions and fact tables is bound to change. Slowly changing dimensions (SCD) are most common in data warehouse and are used to maintain historical data records within warehouse.
Types of SCD:
Basically, there are 3 types of SCD.
· Type 1: The new record replaces the original record. No trace of the old record exists. No history is maintained.
· Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two people.
· Type 3: The original record is modified to reflect the change.
Let’s take a simple example to explain SCDs. I have opened a Savings Account with a Bank ABC while I was residing at my old address in New York. I have now moved to New Jersey.
Original Record:
Customer ID | Customer Name | Customer DOB | Customer State |
123 | ABCDE | 19-JUN-1975 | New York |
Type 1 Record:
Customer ID | Customer Name | Customer DOB | Customer State |
123 | ABCDE | 19-JUN-1975 | New Jersey |
Type 2 Record:
Customer ID | Customer Name | Customer DOB | Customer State | Current Flag |
123 | ABCDE | 19-JUN-1975 | New York | N |
123 | ABCDE | 19-JUN-1975 | New Jersey | Y |
Type 3 Record:
Customer ID | Customer Name | Customer DOB | Current State | Original State |
123 | ABCDE | 19-JUN-1975 | New Jersey | New York |
There are advantages and disadvantages of each of these SCD’s. Main advantage of SCD1 is that its super easy to maintain however you will miss all historical data. You should go for SCD type 1 only when there is no organizational need to maintain historical data. SCD Type 2 allows us to maintain all historical data however it will cause size of tables to grow very fast eventually decreasing performances on DB. You should use SCD 2 when its absolutely must to maintain historical records. SCD type 3 does not maintain full but partial history.
Factless Fact Table:
A factless fact table is a fact table that does not have any measures. It is essentially an intersection of dimensions.
Junk Dimension:
In data warehouse design, frequently we run into a situation where there are yes/no indicator fields in the source system. Ideally these indicators are placed in Fact tables but sometimes it make more sense to put these indicators in separate Dimensions than Fact table itself. By this we make sure the size of the Fact tables is not increased unnecessarily.