Data Modeling- A brief introduction:
Let's explore how to create different designs of data models for different scenarios… What questions need to be answered? What visuals are we going to use, and whether it will answer the client's questions?
In its simplest terms, data modeling is the visual representation of how the dataset is arranged, so it can be understood just by looking at the entity-relationship diagram of the data model. I will be talking about data models with respect to Power BI only in this blog. A data model cannot be generalized for each business purpose. Data models are created keeping a few questions in mind, like, what purpose does it serve? What business questions does it need to answer? What visualizations do we need to create from that data model?
So, the first step, before creating a data model is finding the answer to the question: What business questions does it need to answer? In other words, what are the requirements of the client?
I will be talking about data models by using an Olympic dataset which I have downloaded from Kaggle.
Here are the table names, which are imported into Power BI.
Let's look at the data model which is automatically created by Power BI after the data is imported.
As you can see, there are no relationships created between a few tables. Only one relationship is created between Tokyo 2021 dataset and Medals, and that is bi-directional, which is again not good for various reasons. If we want to retrieve a few columns by propagating through the relationship from one table to another table with the help of DAX formula, it can’t be done, as there is no relationship between tables. Now, we have to carefully look at the columns and establish relationships manually in Power BI. There are three types of relationships between a fact table and dimension tables. These are:
One to many
Many to one
Many to many
Also, the type of relationships can be unidirectional, for filter propagation from one side to many sides and bi-directional, which cross filters the connected tables. The most efficient relationship is one to many and we should avoid bi-directional relationship, if we are not sure about this relationship, to avoid unnecessary issued in the visualizations.
Let's start with a question or requirement of the business. We will talk about different data models/ relationships in different scenarios.
Scenario 1: What changes are countries seeing in the Olympics over time?
Here we are being asked to do the Performance Analysis over time. Now , time to think, which data model would be perfect in this scenario.
In this scenario, the model should focus on tables which are populated with the data related to time and Performance. In other words, we have to analyze the performance of countries over the years.
I have changes the names of columns in coaches, athletes, noc_region and Teams for better understanding.
Now, the model looks like this:
In this model, fact table should be athletes_events and rest tables should be dimensions. We need Year from athlete_events table, medal information from athlete_events table and Countries from Athletes table. In the below data model, there is a relationship between concerned tables and relationship is established between Athletes table and athletes_events table by using game_type and sports columns.
Here is the visual after plotting Total Medals and countries by Year.
Scenario 2: Is there any correlation between the population of a country and its success in the Olympics?
Now, to answer this question, we need to change the data model. We are now concerned with a country's population and if that factor affects its success. So, the fact table should be Medals and other tables would be dimensions and there should be a relationship established between population_by_count and Medals table, using Team/NOC and countries columns.
The new data model is:
Scenario 3: Does team composition (coaches, athletes, etc.) impact winning the medal?
In this scenario, as we have to look for the tables Athelete_events, athletes, coaches and teams, we have to create relationship between these tables so filter can propagate through the relationships and we can get the value we want. Fact table would be Athelete_events and others would be dimensions. The relationship is created between NOC and countries columns as NOC column has an abbreviation of countries.
Lets see how our data model should look like to answer this questions.
Now for calculating Team composition metrics, we will create measures such as Average team size, Gender diversity and Medals per athlete. These will provide a baseline for analysis of medal success.
We can plot these visuals/ metrics according to requirements.
In real world scenarios, we have to keep the overall picture of business requirements, granularity of the analysis, adaptability and optimization of data model in mind, before designing the tailored data model. The data model should be in such a way that it can extract maximum value out of the data and make sure the numbers in the results are correct.
Thanks for reading!