Machine Learning: Classification Model
Classification is a type of supervised machine learning where the goal is to predict a categorical label (also known as a class) based on a set of features or predictors.
A classification model takes a set of input features and maps them to one of several possible outputs or class labels. The model is trained on a labeled dataset where the correct class labels are known, and then used to make predictions on new, unseen data. The model learns to associate certain feature values with specific class labels, so that when new data is presented to the model, it can predict which class the data belongs to.
There are many algorithms for building classification models, including logistic regression, k-nearest neighbors, decision trees, random forests, and support vector machines. The choice of algorithm depends on the problem, the type of data, and the computational resources available.
Classification models have several benefits for industry, including:
a. Automated predictions: Classification models can automate the process of making predictions, which can be time-consuming and error-prone when performed manually.
b. Improved accuracy: With proper training and evaluation, classification models can be highly accurate, reducing the number of incorrect predictions compared to manual methods.
c. Scalability: Classification models can handle large amounts of data, making them well suited for handling big data in industry.
d. Improved decision-making: By providing accurate and automated predictions, classification models can help improve decision-making in various industries.
Examples of applications of classification models in industry include:
- Fraud detection: Classification models can be used to identify fraudulent activities, such as credit card fraud, by detecting patterns in transaction data that are indicative of fraud.
- Customer segmentation: Classification models can be used to segment customers into different groups based on their demographics, purchase history, and other data. This information can be used to target marketing campaigns and improve customer retention.
- Medical diagnosis: Classification models can be used in the medical field to diagnose diseases based on symptoms, test results, and other data.
- Sentiment analysis: Classification models can be used in text analysis to determine the sentiment expressed in text data, such as social media posts or product reviews. This information can be used to understand customer sentiment and improve customer satisfaction.
Here are the general steps to create a classification model in machine learning:
- Define the problem: Start by defining the problem you want to solve and determining whether it is a classification problem.
- Collect and prepare the data: Gather the relevant data for your problem, and clean and preprocess the data as necessary. This may involve handling missing values, transforming variables, and scaling the data.
- Split the data into training and test sets: Divide your data into two sets, one for training the model and one for testing its performance. It is important to keep the test set separate so that you can evaluate the model’s performance on unseen data.
- Choose a model: Select an appropriate model for the problem, such as logistic regression, decision trees, random forests, or support vector machines. Consider the type of data you are working with and the computational resources available.
- Train the model: Train the model on the training data by fitting the model to the data and optimizing its parameters.
- Evaluate the model: Use the test set to evaluate the performance of the model. This may involve computing metrics such as accuracy, precision, recall, and F1 score.
- Fine-tune the model: Based on the evaluation results, make any necessary changes to the model or the data to improve its performance. Repeat steps 5–7 until you are satisfied with the performance of the model, this is optional
- Use the model to make predictions: Use the trained model to make predictions on new, unseen data.
- Deploy the model: Deploy the model in a production environment and monitor its performance over time. Make any necessary updates to keep the model up-to-date and accurate.
Code Example:
# Import Library
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
# create_dataframe
df = pd.DataFrame({'categorical_column_1': ['A', 'B', 'C', 'D', 'E', 'A', 'B', 'C', 'D', 'E', 'A', 'B', 'C', 'D', 'E','A', 'B', 'C', 'D', 'E','A', 'B', 'C', 'D', 'E','A', 'B', 'C', 'D', 'E'],
'categorical_column_2': ['X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z','X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z','X', 'Y', 'Z'],
'numerical_column_1': np.random.rand(30),
'numerical_column_2': np.random.rand(30),
'numerical_column_3': np.random.rand(30),
'label': [0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0]})
From the dataframe it’s clear that we want to predict or classify the data based on label
# split the dataframe into X and y
X = df.drop('label', axis=1)
y = df['label']
Here we want to convert a categorical variable into dummy/indicator variables, using pandas.get_dummies
# create dummies for the categorical variables in X
X = pd.get_dummies(X, columns=['categorical_column_1', 'categorical_column_2'])
Split Train & Test Data
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Train Model
we use RandomForestClassifier in this case, A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
# train a Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
Create Prediction by using Test Data
# make predictions on the test data
y_pred = clf.predict(X_test)
Evaluation
# evaluate the model
from sklearn.metrics import accuracy_score, confusion_matrix
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))
# Accuracy: 77.78%
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)
# Confusion Matrix:
[[4 2]
[0 3]]
The result looks good, which means that machine learning very well, if you want to improve the model you can use hyperparameter tuning, but it’s optional if you satisfied with the result
Now is time to use the model to predict new data without labels, if you are interested to read and study more how to use to new data please visit my Github here
References:
a. https://towardsdatascience.com/top-machine-learning-algorithms-for-classification-2197870ff501
b. https://www.simplilearn.com/tutorials/machine-learning-tutorial/classification-in-machine-learning