'Data' is collection of information. But all that data may not be meaningful or useful in its raw form. Clean and quality data is essential to build good visualizations, that can help get better insights to analyze the data and make better data-driven decisions. When data isn’t clean, the decision making based on the analysis of that data may not be accurate, which may lead to failure in what we want to achieve.
What do we mean by clean Data? It means eliminating data that is not required, redundant or irrelevant. We can achieve this by eliminating duplicates, removing columns/rows that are not required, removing null values, eliminating anomalies/errors etc.,
Let’s take a look at how to import, clean and transform data in order to improve data quality using Power BI Desktop.
The first step is to get data. Microsoft Power BI Desktop allows importing data from various data sources like Excel, Text/CSV, PDF, XML, JSON, Oracle Database, SAP Database and many more.
Here on the Ribbon pane, select Get Data, click on More (using Store.csv file for this example)
Select ‘All’ and then, select ‘Text/CSV’ option and click on ‘Connect’
In the pop-up dialogue, select the required file.
After selecting the required data from a data source, we can load it on power BI desktop by clicking on the Load button.
As you can see here, the Store data is now loaded.
Once the data is loaded, we can now Transform the data with the help of the Power Query tool. Select the Store data and click on Transform data
Power Query Editor is a built-in feature in Power BI Desktop, to load, clean and transforms data.
Following are some of the ways to transform data:
Eliminate columns: We can delete unnecessary data from a dataset by eliminating columns that are not required. By doing this, we can reduce the size of the dataset, furthermore freeing up the space in memory.
Renaming a Column:
This feature allows us to rename a column to a more readable or appropriate name based on the requirement. Go to Transform tab, and then click on rename button to rename a column name.
Remove Rows - Remove Top/Bottom/Alternate rows, or Remove Duplicates/Errors:
Based on what we want to achieve in our reports, we may not need all of the data that is stored in a table, to do that we can use the Remove Rows option. This option can also be very useful when we combine multiple datasets, but don’t want all of the rows in our data model. Remove Row option can also be used to remove blank rows, rows with errors or duplicates.
Renaming a Query: This option can be used when we want to rename the query, for reasons like, original table name being too technical or doesn’t make sense to a user, or when we merge multiple tables in a query, and want a name that is more relevant to what we want to achieve. We can easily rename a query in the Power Query Editor by right clicking on the query and selecting ‘Rename’ and changing it as needed.
Split by Delimiter: This option can be used to split a column with text data type into multiple columns, based on the requirement.
In the following example we will split Full name into First Name and Last Name columns.
To do that, select ‘Full Name’ column and click on the Transform tab, then click on the Split Column option, and select ‘by Delimiter’.
In the pop-up dialogue, select the ‘space’ delimiter and click ok.
Full name column gets split into columns shown below. We can replace these column name as ‘First Name’ and ‘Last Name’.
Merge Columns: This options allows us to merge multiple columns to a single column.
Filters: This option is useful when we want to filter out only the most relevant data for our analysis. To filter rows in a column, click on the dropdown arrow on column header, then based on the column data type a Number filter or Text Filter can be applied using some conditions shown below. We can also remove an existing filter. Multiple filters can be applied as per requirement.
Changing incorrect Data types: In the screen shot below, the ‘Order date’ data type is set to text, which is not right. It has to be ‘Date’ data type only then date hierarchies can be created. It is very important to check for any mismatched data types and fix them. If not corrected, this would lead to errors, incorrect calculations, incorrect relationships with other tables or performance issues.
The Column data type can be changed in the Power Query Editor’s Transform tab. From the Transform tab, make the data type selection from the ‘Data type’ tab.
To conclude, data is not always meaningful. So, it is very important to clean and transform data to get high quality information. Quality data can help with decision-making resulting in high success rates. With the help of Power BI desktop and Power Query editor, we can achieve the above and build insightful reports to visualize and analyze data.
Images by Author.
References: https://learn.microsoft.com/en-us/training/modules/clean-data-power-bi/1-introduction