Ads Click Prediction by Classification Machine Learning
Online advertising has become a crucial part of digital marketing strategies. Advertisers pay for their ads to be shown to potential customers, but the success of an ad campaign depends on the users clicking on the ads. Therefore, predicting which ads are more likely to be clicked can help advertisers optimize their ad campaigns and save money.
We will use machine learning modeling to predict potential users in digital advertising on this occasion.
The business team wishes to improve their digital advertising methods in order to entice potential customers to click on a product. So that the cost incurred is not excessive.
Now, our goal is to develop a machine-learning model capable of detecting potential users who are likely to convert or be interested in an advertisement. So that we can reduce advertising costs on digital platforms.
here is the dataset to practice
Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime as dt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score
from sklearn.metrics import confusion_matrix,ConfusionMatrixDisplay
from collections import defaultdict
from warnings import filterwarnings
filterwarnings('ignore')
Load Data
# define dataframe
df = pd.read_csv('Ad Click Data.csv')
df.head()
Exploration Data Analysis
A. User Distribution
luckily the data that we will use turns out to have labels that are quite balanced so we don’t need to do further preprocessing to overcome imbalanced classes.
Daily Internet Usage Distribution
We can see the spread of daily internet usage from the EDA above (in minutes). This distribution has a few intriguing items. that consumers who use the internet infrequently have a higher chance of clicking on a product than those who use it regularly.
This would suggest that people who use the internet infrequently tend to pay closer attention to website advertisements.
Daily Time Spend Distribution
Because there is a distinct distribution of Internet use. We’re attempting to demonstrate how a user interacts with a website. It appears from the EDA above that user duration on a website and internet consumption have a similar distribution. In other words, even a brief visit to a website might yield potential users.
Internet Usage vs Time Spent on Site
After being aware that internet usage and the amount of time spent on a webpage are identical. We attempt to determine how the two features relate to the target.
According to this plot, internet usage and the amount of time spent on a website can be split into two categories: active users and non-active users.
These 2 parts may contain elements that have a significant impact on whether someone chooses to click on an advertisement or not. In contrast to inactive users, active users are more likely to dislike clicking on an advertisement, as shown in the visualization above.
In conclusion, we can optimize our advertising system for users who are not actively using the internet.
Correlation
The Pearson correlation
We can use all of the characteristics for modeling because there is no multicorrelation (correlation between variables) based on the aforementioned correlation. However, we are unable to determine the connection between the feature and the target using Pearson correlation. In order to determine the relationship between features and their targets, we will use PPS (Predictive Power Score) in the sections that follow.
Predictive Power Score (PPScore)
We will concentrate solely on the Clicked on Ad feature based on the correlation graphic created earlier using PPS. Since that variable is our intended target, we will use it.
Quite relevant characteristics to the target:
- Internet usage every day,
- Age,
- Income Area, and
- Daily Time Spent on Site This correlation graphic can provide as modeling guidance.
Data Preprocessing
In the case of data preprocessing, we need clean data so that it can be applied to several machine learning models.
The steps we need to do are:
- Handle Missing Value
- Extract Datetime Data
- Split Targets and features
- Create One-hot encoding for categorical features
## UDF for Feature Extraction
def extract_day_of_week(time):
return dt.strptime(time,'%m/%d/%Y %H:%M').weekday()
def extract_day_of_month(time):
return dt.strptime(time,'%m/%d/%Y %H:%M').day
def extract_month(time):
return dt.strptime(time,'%m/%d/%Y %H:%M').month
Handle Missing Value
df['Daily Time Spent on Site'].fillna(df['Daily Time Spent on Site'].mean(),inplace=True)
df['Area Income'].fillna(df['Area Income'].mean(),inplace=True)
df['Daily Internet Usage'].fillna(df['Daily Internet Usage'].mean(),inplace=True)
df['Gender'].fillna(df['Gender'].mode()[0],inplace=True)
Extract Datetime Data
df['day_of_week'] = df['Timestamp'].apply(extract_day_of_week)
df['day_of_month'] = df['Timestamp'].apply(extract_day_of_month)
df['month'] = df['Timestamp'].apply(extract_month)
df = df.drop(labels=['Timestamp'],axis=1)
Split Target and Features
X = df.drop(labels=['Clicked on Ad'],axis=1)
y = np.where(df['Clicked on Ad']=='No',0,1)
Get Dummies for All Categorical Features
X_dummy = pd.get_dummies(X)
X_dummy
Build Model
The modeling stage comes next, where we create a model with a high level of precision. Since the number of categories on the target we choose is balanced, we will use accuracy metrics.
The steps for modeling are as follows:
- Dividing the test and train datasets
- Train employs standard data (Experiment 1)
- Workout with normalization (Experiment 2)
Splitting Train and Test Dataset
X_train,X_test,y_train,y_test = train_test_split(X_dummy,y,test_size = 0.3,stratify=y,random_state = 123)
print('Dimensi Train:',X_train.shape)
print('Dimensi Test:',X_test.shape)
# Dimensi Train: (700, 2214)
# Dimensi Test: (300, 2214)
## UDF for experimenting several classification models
def experiment(X_train,X_test,y_train,y_test):
"""
This function want to do an experiment for several models.
We just need data input
Parameter
---------
X_train = training data contains several features
X_test = testing data contains several features
y_train = train target
y_test = test target
"""
result = defaultdict(list)
knn = KNeighborsClassifier()
logreg = LogisticRegression()
dtc = DecisionTreeClassifier()
rf = RandomForestClassifier()
grad = GradientBoostingClassifier()
list_model = [('K-Nearest Neighbor',knn),
('Logistic Regression',logreg),
('Decision Tree',dtc),
('Random Forest',rf),
('Gradient Boosting',grad)
]
for model_name,model in list_model:
start = dt.now()
model.fit(X_train,y_train)
duration = (dt.now()-start).total_seconds()
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
recall = recall_score(y_test,y_pred)
precision = precision_score(y_test,y_pred)
result['model_name'].append(model_name)
result['model'].append(model)
result['accuracy'].append(accuracy)
result['recall'].append(recall)
result['precision'].append(precision)
result['duration'].append(duration)
return result
First Experiment
The outcome of modeling with default data is as follows (simple preprocessing). The decision tree classifier has the highest accuracy, according to the modeling findings. The random forest, however, is predicated on the highest level of precision. The accuracy obtained from some models, such as logistic regression and k-nearest neighbor, is not very excellent.
Second Experiment
from sklearn.preprocessing import MinMaxScaler
minmax_scaler = MinMaxScaler()
X_train_minmax = minmax_scaler.fit_transform(X_train)
X_test_minmax = minmax_scaler.transform(X_test)
However, based on this approach, we will select a random forest as the best model because it has the most accuracy and precision. After applying the min max scaler, we see considerable gains in numerous models.
Evaluation
final_model = result2['model'][3]
y_pred = final_model.predict(X_test_minmax)
#-------------------------------------------------
cm = confusion_matrix(y_test,y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,display_labels=final_model.classes_)
plt.figure(figsize=(13,10))
disp.plot()
plt.show()
We wish to analyze our model’s performance in depth using the random forest model as a foundation.
The random forest generates a very effective confusion matrix.
We can observe that there is very little prediction error (purple cells). The accuracy, precision, and recall will be good with the findings that follow.
For complete code, you can visit my GitHub here.