I spent the last couple of months analyzing data, no matter how many charts I created, how well sophisticated the algorithms are, the results are always misleading. that’s how I came to know why data cleaning is important
Does data cleaning matter?
Data cleaning is necessary for valid and appropriate analyses. Raw data contain inconsistencies or errors, but cleaning your data helps you minimize or resolve these.
Without data cleaning, you could end up with a Type I or II error in your conclusion. Incorrect or inconsistent data leads to false conclusions. And so, how well you clean and understand the data has a high impact on the quality of the results.
RAW DATA V/S CLEAN DATA:
RAW DATA | CLEAN DATA |
Invalid | Valid |
Inaccurate | Accurate |
Incomplete | Complete |
Inconsistent | Consistent |
Duplicate entries | Unique |
Incorrectly formatted | Uniform |
BEFORE CLEANING THE DATA:
Things to be considered
· How the data is collected, and under what conditions?
· What does the data represent?
· What are the methods used to clean the data and why?
· Do you invest the time and money worth improving the process?
WORK FLOW OF CLEANING DATA:
INSPECTION:
Inspecting the data is time-consuming and requires using many methods for exploring the underlying data for error detection. Here are some of them:
· Duplicate data
· Invalid data
· Missing values
· Outliers
· Are there formatting irregularities for dates, or textual or numerical data?
· Do some columns have a lot of missing data?
· Are any rows duplicate entries?
· Do specific values in some columns appear to be extreme outliers?
CLEANING:
Data cleaning involve different techniques based on the problem and the data type. Different methods can be applied with each has its own trade-offs. Overall, incorrect data is either removed, corrected, or imputed.
IRRELEVANT DATA:
· Data which are not actually needed, and don’t fit under the context of the problem we’re trying to solve in dataset analyze columns-wise & row-wise.
· you may drop it columns Only if you are sure that a piece of data is unimportant
DUPLICATES:
Duplicates are data points that are repeated in your dataset.
It often happens when for example
· Data are combined from different sources
· The user may hit submit button twice thinking the form wasn’t actually submitted.
· A request to online booking was submitted twice correcting wrong information that was entered accidentally in the first time.
CONVERSION TYPE:
· Make sure numbers are stored as numerical data types.
· values that can’t be converted to the specified type should be converted to NA value (or any), with a warning being displayed. This indicates the value is incorrect and must be fixed.
· Watch out for values like “0”, “Not Applicable”, “NA”, “None”, “Null”, or “INF”, they might mean the same thing: The value is missing.
VERIFYING:
· Reverify the data & make sure every thing is correct.
· It might involve some manual corrections sometimes.
· You can visualize and see the difference between raw and clean data.
REPORTING:
Reporting is equally important as cleaning . with out a proper insight document. The report cannot be read.