Data cleaning is the most important and initial step for Data Analysis. Raw data needs to be cleaned before we perform any kind of analysis on it. After cleaning, the data may have to be divided into different tables and relationships need to be established between them before proceeding to Analysis. Cleaning the data is very crucial since the whole analysis depends on how accurate the data is. Skewed data leads to incorrect analysis and results. A thorough understanding of the data set, sources and all the available documentation is the fundamental requirement before starting the data cleaning process.
Cleaning in Power BI
The first step to clean data in Power BI is to go to power query using transform option after loading the data. In power query editor, under the view tab, there is 'column quality' option, turn that on to know the quality of each column. This gives information on distribution of empty values, errors and valid data in the column. Checking the 'Column distribution' gives information on unique and distinct values in the column. Before starting the cleaning process turning on these two options will be helpful for better understanding of the columns.
These are the basic cleaning steps that can be considered irrespective of the type of data we are dealing with. All the options that are shown in the below can found by right clicking on the selected column.
Remove duplicates Duplicates can be removed by using the option 'Remove Duplicates' in power BI. However, data must be studied well prior to removing the duplicates rows. We must make sure we are not disturbing the data set which leads to incorrect analysis.
Renaming columns
In raw data column names may not be appropriate or may contain unnecessary information. Such columns need to be renamed for easy understanding. Try to stick to the same format while naming the columns for the whole data set.
Fixing the incorrect data types
Some columns data types may not match the data it contains. For instance, a column containing numerical values might have data type as alpha-numeric or text, then it has to be changed to numerical.
Handling null values or missing values
Null values can be handled in multiple ways depending on the data. Few ways to handle them are replacing null with zero if it's a numerical column and replacing it something like 'N/A' in case of text columns. Another approach is filling the values using statistical methods, if the analyst thinks changing the null values does not skew the data.
To fill the values based on previous cell values, Power BI has an option of 'Fill'. The 'Fill down' option takes the values in a cell and fills the cells having 'null' below it until a cell a that has a non-null value. The 'Fill up' option fills the cells having null values with the value of nearest non-null cell below it.
Correct the incorrect data
All the columns should be checked for any incorrect values. Data should be corrected based on our understanding of the data set. Power BI automatically detects if there are any obvious errors in a column when you enable 'Column quality' in view tab. You can use the option 'Replace Errors' and replace those errors.
Removing unnecessary columns
Raw data may contain a huge number of columns which may not be of any use for analysis. For instance, a column may have a description of Medicine or some disease, this column might have some other use but may not be useful from data analytics point of view. Again, thorough understanding of data is very crucial while doing this. If a potentially useful data column is removed, that is loss of data and analysis will be impacted.
Merging or Splitting columns
Some of the columns might be associated with each other, like giving similar information. We can try to merge them together into one column that can contain information about the all the said columns. On the other hand, there may be columns giving a lot more information which cannot be easily accessed during analysis. These kinds of columns must be split, and data is distributed so that it is readily available for analysis. To split the column, we must add a new column. There is 'Add column' tab in the top taskbar in power query editor.
There are multiple ways of adding a new column. Most common way is add using 'Custom Column' option. It allows us to add a new column based on that logic that is written.
Conclusion
When cleaning, the data must be approached from a Data Analyst point of view. Can I analyze this data? Will I be able to visualize data using the data in this column? If not, how would I prefer to transform this column so that I can use it in my analysis? These are some of the questions that need to be asked. There is no single solution to cleaning the data, everyone may have different cleaning techniques and approaches, but the basic principles remain the same!