# An easy way to guarantee the quality of your data

In this series, we’ll delve into the world of data analytics and explore real-world scenarios to learn how data quality plays a pivotal role in making informed decisions. Today, we start with the fundamental question: What is data quality and why does it matter?

Data quality is defined as data’s dependability, accuracy, consistency, and completeness.

Consider your data to be the foundation of a building. High data quality ensures the trustworthiness of your insights and judgements, much as a good foundation provides the durability of a construction. Poor data quality can lead to incorrect conclusions, ill-advised initiatives, and missed opportunities.

Let’s have a practice to identify your data quality, Here is a sample case study you may consider to learn.

*Case Study:** Data of Mall Customers*

**Scenario:**

Imagine you are working for a shopping mall and have access to a dataset containing customer information, including CustomerID, Gender, Age, Annual Income, and Spending Score. Your task is to perform a comprehensive analysis to better understand your customers and tailor your marketing strategies accordingly. However, the dataset has some issues that need to be addressed.

**Problem:**

Upon inspecting the dataset, you discover several data quality challenges:

**Missing Values:**Some rows in the dataset have missing values, particularly in the Annual Income column.**Duplicate Entries:**The dataset contains duplicate customer entries, which can skew your analysis.**Outliers:**There appear to be outliers in the Spending Score column that might affect your segmentation.

To address these challenges and ensure the quality of your analysis, you follow these steps:

**Data Profiling:**Using Python, you perform data profiling on the dataset. You calculate summary statistics for each column, including mean, median, and standard deviation. Additionally, you identify the percentage of missing values in the Annual Income column and explore the distribution of data points in the Age and Spending Score columns.**Data Cleaning:**Armed with insights from data profiling, you start cleaning the dataset. You handle missing values in the Annual Income column by imputing them with the median value. To deal with duplicate entries, you remove any repeated CustomerID records. To address outliers in the Spending Score column, you apply appropriate outlier detection and handling techniques.

By following these steps using Python, you showcased the significance of data quality and demonstrated how Python can be a powerful tool for profiling, cleaning, and validation.

**let’s start our simulation**

Load the dataset

`import pandas as pd`

import numpy as np

data = pd.read_csv("Mall_Customers_uncleaned.csv")

data.head()

`data.info()`

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 204 entries, 0 to 203

Data columns (total 5 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 CustomerID 204 non-null int64

1 Gender 204 non-null object

2 Age 202 non-null float64

3 Annual Income (k$) 200 non-null float64

4 Spending Score (1-100) 200 non-null float64

dtypes: float64(3), int64(1), object(1)

memory usage: 8.1+ KB

`#check missing value`

data.isnull().sum()

# output

CustomerID 0

Gender 0

Age 2

Annual Income (k$) 4

Spending Score (1-100) 4

dtype: int64

`# Calculate the median of each column`

age_median = data['Age'].median()

income_median = data['Annual Income (k$)'].median()

score_median = data['Spending Score (1-100)'].median()

# Fill missing values with medians

data['Age'].fillna(age_median, inplace=True)

data['Annual Income (k$)'].fillna(income_median, inplace=True)

data['Spending Score (1-100)'].fillna(score_median, inplace=True)

# Check if there are any more missing values

missing_values = data.isnull().sum()

`data.duplicated().sum()`

# output = 4

`data['Gender'].describe()`

# output

count 200

unique 4

top Female

freq 111

Name: Gender, dtype: object

data['Gender'].unique()

# output

array(['Male', 'Female', '0', '1'], dtype=object)

# Removing inconsistent values

valid_genders = ['Male', 'Female']

cleaned_data = data[data['Gender'].isin(valid_genders)]

`import matplotlib.pyplot as plt`

# Select only the numerical columns for outlier analysis

numerical_columns = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']

# Set up a color palette for the box plots

colors = ['lightblue', 'lightgreen', 'lightcoral']

# Create box plots for each numerical column

plt.figure(figsize=(12, 8))

for idx, column in enumerate(numerical_columns):

plt.subplot(1, 3, idx + 1)

plt.boxplot(data[column], patch_artist=True, boxprops=dict(facecolor=colors[idx]))

plt.title(f'Box Plot of {column}')

plt.ylabel(column)

plt.tight_layout()

plt.show()

After identifying outlier, now we need to remove the outlier by using Interquartile Range IQR method

`# Define the column with outliers`

outlier_column = 'Annual Income (k$)'

# Calculate the Interquartile Range (IQR) for outlier detection

Q1 = data[outlier_column].quantile(0.25)

Q3 = data[outlier_column].quantile(0.75)

IQR = Q3 - Q1

# Define the lower and upper bounds for outlier removal

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

# Remove outliers from the dataset

data = data[(data[outlier_column] >= lower_bound) & (data[outlier_column] <= upper_bound)]

`data.info()`

# output

<class 'pandas.core.frame.DataFrame'>

Int64Index: 198 entries, 0 to 201

Data columns (total 5 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 CustomerID 198 non-null int64

1 Gender 198 non-null object

2 Age 198 non-null float64

3 Annual Income (k$) 198 non-null float64

4 Spending Score (1-100) 198 non-null float64

dtypes: float64(3), int64(1), object(1)

memory usage: 9.3+ KB

You can visit my GitHub to see the whole code