Developing Classification Models for Accurate Oil Spill Detection in Ocean

Ahmad Firdaus
6 min readJul 8, 2024

--

Introduction

In this research, we will investigate the use of machine learning approaches to detect oil spills in satellite photos of ocean patches. Using computer vision techniques, we extract a comprehensive collection of attributes from satellite photos in order to identify patches as harbouring oil spills or pristine ocean areas. To improve our models’ predictive potential, we use rigorous data pretreatment techniques such as normalisation and feature engineering. We show a systematic process for getting high accuracy in detecting oil spills by training and assessing multiple categorization algorithms. This work helps to improve environmental monitoring by providing an automated approach for detecting and mitigating the effects of oil spills in marine ecosystems.

In this project, the Oil Spill Classification Dataset is employed.

The dataset was developed by starting with satellite images of the ocean, some of which contain an oil spill and some that do not. Images were split into sections and processed using computer vision algorithms to provide a vector of features to describe the contents of the image section or patch. The task is, given a vector that describes the contents of a patch of a satellite image, then predicts whether the patch contains an oil spill or not, e.g. from the illegal or accidental dumping of oil in the ocean.

There are two classes and the goal is to distinguish between spill and non-spill using the features of a given ocean patch.

Non-Spill: negative case, or majority class.
Oil Spill: positive case, or minority class.

Dataset: https://www.kaggle.com/datasets/sudhanshu2198/oil-spill-detection/data

Based on Data Preprocessing, the dataset contains 937 entries with 50 features. The features include a mix of numerical and integer types. Here are some key points:

  • Features: The dataset includes features such as f_1 to f_49, which appear to represent various characteristics derived from satellite images of ocean patches.
  • Target Variable: The target variable target is binary, with 0 indicating non-spill (majority class) and 1 indicating oil spill (minority class).
  • Data Quality: There are no missing values in any of the columns, which suggests the dataset is complete.
  • Class Imbalance: There is a significant class imbalance with 896 instances of non-spill (class 0) and only 41 instances of oil spill (class 1).

The following steps that should be conducted are:

1. Data Standardization: Transforming data into a common format for comparison, often by scaling numerical values.
2.
Model Selection: Choosing the best machine learning algorithm and its settings for a specific problem.
3.
Imbalance Handling: Techniques to address unequal class distribution in datasets.
4.
Model Training: Fitting a machine learning model to data to learn patterns and make predictions.
5.
Best Model Selection: Choosing the model with the highest performance based on evaluation metrics.
6.
Evaluation: Assessing how well a model performs on unseen data using metrics like accuracy or F1-score.

Data Standardization

from sklearn.preprocessing import StandardScaler

# Separate features and target variable
X = data.drop(columns=['target'])
y = data['target']

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Model Selection

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score

# Define models
models = {
'Logistic Regression': LogisticRegression(),
'Decision Tree': DecisionTreeClassifier(),
'Random Forest': RandomForestClassifier(),
'SVM': SVC(),
'KNN': KNeighborsClassifier(),
'Gradient Boosting': GradientBoostingClassifier()
}

# Evaluate models using cross-validation
model_scores = {}
for model_name, model in models.items():
scores = cross_val_score(model, X_scaled, y, cv=5, scoring='accuracy')
model_scores[model_name] = scores.mean()

# Print model scores
for model_name, score in model_scores.items():
print(f'{model_name}: {score:.4f}')
result:

Logistic Regression: 0.8985
Decision Tree: 0.8322
Random Forest: 0.9402
SVM: 0.9562
KNN: 0.9605
Gradient Boosting: 0.9156

Based on the result, Random Forest, SVM, and KNN were the top scores among them

Imbalance Handling

Imbalance handling refers to techniques used to address class imbalance in datasets where one class (minority class) is significantly underrepresented compared to another (majority class).

The chart clearly shows us imbalanced data, we need to solve this for better model performance. SMOTE was applied to oversample the minority class, generating synthetic examples to balance the classes, this method was used to increase the number of instances in the minority class by duplicating examples. here was the process:

from imblearn.over_sampling import SMOTE

# Apply SMOTE to balance the classes
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_scaled, y)

# Check the new distribution of the target variable
new_target_distribution = pd.Series(y_resampled).value_counts()
print(new_target_distribution)

# Plot the new distribution of the target variable
plt.figure(figsize=(6, 4))
sns.countplot(x=y_resampled)
plt.title('Distribution of Target Variable after SMOTE')
plt.show()

Model Training

Given the cross-validation results, we focused on training the top-performing models: SVM, KNN, and Random Forest. These steps were contributed to determine which model performs best on the balanced dataset for your oil classification task.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split

# Split the resampled data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42, stratify=y_resampled)

# Define the models to be trained
selected_models = {
'Random Forest': RandomForestClassifier(random_state=42),
'SVM': SVC(random_state=42),
'KNN': KNeighborsClassifier()
}

# Train and evaluate each model
for model_name, model in selected_models.items():
# Train the model
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print the evaluation metrics
print(f'{model_name} Model Evaluation')
print('---------------------------------')
print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))
print('Classification Report:')
print(classification_report(y_test, y_pred))
print('-' * 30)
Random Forest Model Evaluation
---------------------------------
Accuracy: 0.9916
Precision: 0.9835
Recall: 1.0000
F1 Score: 0.9917
Confusion Matrix:
[[177 3]
[ 0 179]]
Classification Report:
precision recall f1-score support

0 1.00 0.98 0.99 180
1 0.98 1.00 0.99 179

accuracy 0.99 359
macro avg 0.99 0.99 0.99 359
weighted avg 0.99 0.99 0.99 359

------------------------------
SVM Model Evaluation
---------------------------------
Accuracy: 0.9721
Precision: 0.9519
Recall: 0.9944
F1 Score: 0.9727
Confusion Matrix:
[[171 9]
[ 1 178]]
Classification Report:
precision recall f1-score support

0 0.99 0.95 0.97 180
1 0.95 0.99 0.97 179

accuracy 0.97 359
macro avg 0.97 0.97 0.97 359
weighted avg 0.97 0.97 0.97 359

------------------------------
KNN Model Evaluation
---------------------------------
Accuracy: 0.9582
Precision: 0.9227
Recall: 1.0000
F1 Score: 0.9598
Confusion Matrix:
[[165 15]
[ 0 179]]
Classification Report:
precision recall f1-score support

0 1.00 0.92 0.96 180
1 0.92 1.00 0.96 179

accuracy 0.96 359
macro avg 0.96 0.96 0.96 359
weighted avg 0.96 0.96 0.96 359

------------------------------

Based on the result, The Random Forest model had the highest F1 score, accuracy, and recall, making it the best-performing model among the three. Then, Random Forest will be trained on full training dataset.

# Train the Random Forest model on the full training set
best_model = RandomForestClassifier(random_state=42)
best_model.fit(X_train, y_train)

# Predict on the test set
y_pred = best_model.predict(X_test)

# Evaluate the best model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

# Print the evaluation metrics
print('Best Model (Random Forest) Evaluation')
print('-------------------------------------')
print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))
print('Classification Report:')
print(classification_report(y_test, y_pred))

# Plotting the Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=['Class 0', 'Class 1'], yticklabels=['Class 0', 'Class 1'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - Random Forest Model')
plt.show()
Best Model (Random Forest) Evaluation
-------------------------------------
Accuracy: 0.9916
Precision: 0.9835
Recall: 1.0000
F1 Score: 0.9917
Confusion Matrix:
[[177 3]
[ 0 179]]
Classification Report:
precision recall f1-score support

0 1.00 0.98 0.99 180
1 0.98 1.00 0.99 179

accuracy 0.99 359
macro avg 0.99 0.99 0.99 359
weighted avg 0.99 0.99 0.99 359

The Random Forest model was highly effective at classifying the types of oil in the dataset. It had:

  • High Accuracy: 99.16% of the predictions are correct.
  • High Precision: Very few false positives.
  • Perfect Recall for Positive Class: All positive instances are correctly identified.
  • High F1 Score: Indicates a balanced and strong performance across precision and recall.

The model’s performance metrics and the confusion matrix show that it is well-suited for the oil classification task, with minimal misclassification and high overall reliability.

--

--

Ahmad Firdaus

Data science passionate about uncovering insights and solving complex problems. Background in mathematics from Kyushu Univ. Skilled in Python, SQL, Tableau.