An easy way to guarantee the quality of your data
In this series, we’ll delve into the world of data analytics and explore real-world scenarios to learn how data quality plays a pivotal role in making informed decisions. Today, we start with the fundamental question: What is data quality and why does it matter?
Data quality is defined as data’s dependability, accuracy, consistency, and completeness.
Consider your data to be the foundation of a building. High data quality ensures the trustworthiness of your insights and judgements, much as a good foundation provides the durability of a construction. Poor data quality can lead to incorrect conclusions, ill-advised initiatives, and missed opportunities.
Let’s have a practice to identify your data quality, Here is a sample case study you may consider to learn.
Case Study: Data of Mall Customers
Scenario:
Imagine you are working for a shopping mall and have access to a dataset containing customer information, including CustomerID, Gender, Age, Annual Income, and Spending Score. Your task is to perform a comprehensive analysis to better understand your customers and tailor your marketing strategies accordingly. However, the dataset has some issues that need to be addressed.
Problem:
Upon inspecting the dataset, you discover several data quality challenges:
- Missing Values: Some rows in the dataset have missing values, particularly in the Annual Income column.
- Duplicate Entries: The dataset contains duplicate customer entries, which can skew your analysis.
- Outliers: There appear to be outliers in the Spending Score column that might affect your segmentation.
To address these challenges and ensure the quality of your analysis, you follow these steps:
- Data Profiling: Using Python, you perform data profiling on the dataset. You calculate summary statistics for each column, including mean, median, and standard deviation. Additionally, you identify the percentage of missing values in the Annual Income column and explore the distribution of data points in the Age and Spending Score columns.
- Data Cleaning: Armed with insights from data profiling, you start cleaning the dataset. You handle missing values in the Annual Income column by imputing them with the median value. To deal with duplicate entries, you remove any repeated CustomerID records. To address outliers in the Spending Score column, you apply appropriate outlier detection and handling techniques.
By following these steps using Python, you showcased the significance of data quality and demonstrated how Python can be a powerful tool for profiling, cleaning, and validation.
let’s start our simulation
Load the dataset
import pandas as pd
import numpy as np
data = pd.read_csv("Mall_Customers_uncleaned.csv")
data.head()
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerID 204 non-null int64
1 Gender 204 non-null object
2 Age 202 non-null float64
3 Annual Income (k$) 200 non-null float64
4 Spending Score (1-100) 200 non-null float64
dtypes: float64(3), int64(1), object(1)
memory usage: 8.1+ KB
#check missing value
data.isnull().sum()
# output
CustomerID 0
Gender 0
Age 2
Annual Income (k$) 4
Spending Score (1-100) 4
dtype: int64
# Calculate the median of each column
age_median = data['Age'].median()
income_median = data['Annual Income (k$)'].median()
score_median = data['Spending Score (1-100)'].median()
# Fill missing values with medians
data['Age'].fillna(age_median, inplace=True)
data['Annual Income (k$)'].fillna(income_median, inplace=True)
data['Spending Score (1-100)'].fillna(score_median, inplace=True)
# Check if there are any more missing values
missing_values = data.isnull().sum()
data.duplicated().sum()
# output = 4
data['Gender'].describe()
# output
count 200
unique 4
top Female
freq 111
Name: Gender, dtype: object
data['Gender'].unique()
# output
array(['Male', 'Female', '0', '1'], dtype=object)
# Removing inconsistent values
valid_genders = ['Male', 'Female']
cleaned_data = data[data['Gender'].isin(valid_genders)]
import matplotlib.pyplot as plt
# Select only the numerical columns for outlier analysis
numerical_columns = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']
# Set up a color palette for the box plots
colors = ['lightblue', 'lightgreen', 'lightcoral']
# Create box plots for each numerical column
plt.figure(figsize=(12, 8))
for idx, column in enumerate(numerical_columns):
plt.subplot(1, 3, idx + 1)
plt.boxplot(data[column], patch_artist=True, boxprops=dict(facecolor=colors[idx]))
plt.title(f'Box Plot of {column}')
plt.ylabel(column)
plt.tight_layout()
plt.show()
After identifying outlier, now we need to remove the outlier by using Interquartile Range IQR method
# Define the column with outliers
outlier_column = 'Annual Income (k$)'
# Calculate the Interquartile Range (IQR) for outlier detection
Q1 = data[outlier_column].quantile(0.25)
Q3 = data[outlier_column].quantile(0.75)
IQR = Q3 - Q1
# Define the lower and upper bounds for outlier removal
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Remove outliers from the dataset
data = data[(data[outlier_column] >= lower_bound) & (data[outlier_column] <= upper_bound)]
data.info()
# output
<class 'pandas.core.frame.DataFrame'>
Int64Index: 198 entries, 0 to 201
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CustomerID 198 non-null int64
1 Gender 198 non-null object
2 Age 198 non-null float64
3 Annual Income (k$) 198 non-null float64
4 Spending Score (1-100) 198 non-null float64
dtypes: float64(3), int64(1), object(1)
memory usage: 9.3+ KB
You can visit my GitHub to see the whole code