Data in the real world is often messy, resulting in inaccurate conclusions and a waste of resources. For business decisions to be accurate, our data needs to be clean and prepared for analysis. If not cleaned, dirty data may lead to incorrect beliefs and assumptions about data-driven insights, poorly informed decisions based on those insights and distrust in the analytics process overall.
Data Cleaning: Data cleaning (sometimes also known as data cleansing or data wrangling) is an important early step in the data analytics process. This crucial step, which involves preparing and validating data, usually takes place before your core analysis. Data cleaning is the process of fixing or removing incorrect, incomplete, duplicate, corrupted, or incorrectly structured data in a data set. In some cases, these problems arise due to human error during data entry, or when merging different data structures or combining datasets with different terminology.
Well, cleaning data is not an easy task. If we have a small dataset, then we can easily do it by looking at it for some time and then removing the unwanted rows manually. But what if our dataset is big? What if we have millions of rows in the dataset and cleaning them manually is not possible? What if we want to combine multiple datasets from different data sources? We need some SQL functions to help us out of this situation. So let’s see which SQL functions will help us easily clean our datasets.
1. Remove irrelevant data
What’s considered irrelevant data will vary based on the dataset and the requirement. We need to figure out what data is relevant to our analyses. Let’s say we are only interested in customers who live in the Mexico. Data from customers who live outside of the US is useless for us therefore we should remove it from the dataset. We can filter them out with the following statement:
SELECT * FROM customers WHERE country = 'US';
With this statement, we’ll only keep the records where the customer’s country is listed as "Mexico".
2. Remove duplicate data
"Duplication" just means that you have repeated data in your dataset. This could be due to things like data entry errors or data collection methods. Duplicate data is inefficient to store and result in slow and inefficient processes and workflows. We can remove duplicate data and keep only one occurrence:
SELECT DISTINCT * FROM customers;
3. Fix Structural Errors
This SQL error generally means that somewhere in the query, there is invalid syntax. Structural errors include strange naming conventions, typing mistakes, or incorrect capitalization.
In the PostalCode column of our data, there are instances of both "N/A" and "Not Applicable" even though they represent the same group. This error should be fixed by standardizing the convention. This statement will create a new column PostalCode and Country that will replace "N/A" and "Not Applicable" with a NULL values.
SELECT CustomerID, CustomerName, ContactName, Address, City,
IF(PostalCode IN ('Not Applicable', 'N/A'), null, PostalCode) AS PostalCode,
IF(Country IN ('Not Applicable', 'N/A'), null, Country) AS Country
FROM customers;
Here are other structural issues you might encounter:
TRIM(), LTRIM(), or RTRIM() functions can be used to remove any unnecessary spaces at the beginning or the end of the string. For example, " Analysis " should be trimmed to be "Apple".
INITCAP() function can be used for incorrect capitalization, such as "analyst" can be converted to "Analyst".
UPPER() and LOWER() can be used to convert the text into all uppercase or lowercase letters.
REPLACE() function can be used to replace some of the characters or text. For instance, we want to change "Data Analysis" into "Data Analyst". We can use REPLACE() to replace "Analysis" with "Analyst".
CONCAT() function can be used to combine strings from several columns together (and with hard-coded values) using CONCAT. Simply order the values you want to concatenate and separate them with commas. For example, CONCAT(First_Name, ', ', Last_Name)
There are other String functions that can be used to clean the data.
4. Do type Conversions
The data type lets the database know what to expect from each column and also determines the kind of interactions that can occur. For example, if you want a column to contain only integers, you can use the “int” data type for it. Type conversion is crucial to maintain consistency. For example,numbers are a common data type that are often imputed as text, but for them to be processed and used for calculations, they need to be converted as numerals.
Implicit Data Conversions: When data needs to be converted from one data type to another, SQL Server can often convert the value automatically (implicitly). For example, suppose you want to add two integers together and they’re configured with different data types, SQL Server automatically converts the variable to an INT data type and produces a numeric sum. If we mix data types such as INT and VARCHAR, SQL Server will always attempt to convert everything to the one that has the highest precedence. This is why we sometimes get errors if try to use implicit conversions to add a number to a string.
Explicit Data Conversions: Explicit conversions let you exercise more control over your data type conversions whenever you compare, combine, or move data from one database object to another. To support explicit conversions, SQL Server provides two important functions: CAST and CONVERT. The functions are similar in their ability to convert data
5. Handle missing data
Handling missing data is a more complex process. Many algorithms and analysis tools won’t accept missing values, and the way missing data is handled will depend on the nature of it. There are a couple of options, both of which are not optimal due to a loss of information but can be implemented while minimizing negative impact.
When dealing with missing data, we can use two primary methods to solve the error: imputation or the removal of data.
The imputation method develops reasonable guesses for missing data. It’s most useful when the percentage of missing data is low. If the portion of missing data is too high, than this method has the potential to decrease the integrity of your data.
The other option is to remove data. When dealing with data that is missing at random, related data can be deleted to reduce bias. Removing data may not be the best option if there are not enough observations to result in a reliable analysis. In some situations, observation of specific events or factors may be required.
6. Finding Outliers
An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus process) to decide what will be considered abnormal. Some outliers represent natural variations in the population, and they should be left as is in your dataset. These are called true outliers. Other outliers are problematic and should be removed because they represent measurement errors, data entry or processing errors, or poor sampling. We can remove outliers if they are irrelevant to our analysis, but otherwise outliers can be a meaningful part of our data.
7. Normalize Data
Normalization is the process to eliminate data redundancy and enhance data integrity in the table. Normalization also helps to organize the data in the database. It is a multi-step process that sets the data into tabular form and removes the duplicated data from the relational tables.
The database normalization process is categorized into the following types:
First Normal Form (1 NF)
Second Normal Form (2 NF)
Third Normal Form (3 NF)
Boyce Codd Normal Form or Fourth Normal Form ( BCNF or 4 NF)
Fifth Normal Form (5 NF)
Sixth Normal Form (6 NF)
8. Validate data
The last step is to validate your data. At the end of the data cleaning process, you need to ensure that your data is valid, accurate, complete, consistent, and uniform. During data validation, we make sure that our data is complete (there are no blank or null values), unique (the values are not duplicated), accurate, has the expected format, size and does not contain any unaccepted special characters. Validating data can prevent false conclusions based on low-quality data. Data validation is necessary in order to:
Prevent delays on the various projects in your company.
Reduce the risk of making poor business decisions.
Increase the efficiency of the organization.
Protect application and prevent downtimes.
Conclusion
Data cleaning is probably the most important part of the data analytics process. Good data hygiene isn’t just about data analytics, though; it’s good practice to maintain and regularly update your data anyway. While no data is perfect, tracking errors and understanding where they’re coming from can make the data cleaning process much easier in the future.