Building an Email Spam Detection Model in Python

TechsBucketNovember 23, 2024

Email spam detection is a fascinating application of machine learning that impacts nearly everyone using email. Most of us rely on spam filters daily, but have you ever wondered how these filters actually work? In this tutorial, we’ll dive deep into the process of building an email spam detection model in Python, using a clean dataset, modern libraries, and machine learning techniques.

Introduction

Spam emails are not just an annoyance—they’re a significant cybersecurity concern. Platforms like Gmail and Outlook use highly advanced machine learning algorithms to separate spam from legitimate emails. Today, you’ll learn how to create a similar spam detection system. With Python and a dataset of labeled emails, we’ll train a machine learning model to achieve high accuracy.

By the end of this guide, you’ll have a functioning email spam detection model and insights into how these systems operate.

Prerequisites

Before diving into the code, ensure your system is ready with the following installed:

Python: Download and install Python from the official website.
Jupyter Notebook: Part of the Anaconda package, which you can get here.
Editor: Any IDE like Visual Studio Code or PyCharm works.

Additionally, ensure familiarity with:

Basic Python programming
Fundamental machine learning concepts

We’ll also use the following libraries:

pandas and numpy for data manipulation
scikit-learn for machine learning
CountVectorizer for text preprocessing
MultinomialNB for the model

Problem Statement

Our goal is to build a machine learning model that categorizes emails as spam or ham (not spam). Using a labeled dataset, we’ll preprocess the data, train a model, and evaluate its performance. The end product will allow real-time classification of emails.

Dataset

The dataset for this project contains two columns: Category (spam or ham) and Message (email text). You can download the dataset here. Ensure that the data is clean, with no null values or inconsistencies.

Building the Model

Let’s get started with the Python implementation. Below is a step-by-step breakdown.

Importing Libraries

Start by importing the necessary libraries:

python

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import MultinomialNB

from sklearn import metrics

import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn import metrics

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

Loading and Inspecting the Dataset

Load the dataset into a Pandas DataFrame and inspect its structure.

python

emailData = pd.read_csv("spam.csv", encoding='latin-1')

emailData = emailData[['v1', 'v2']]

emailData.columns = ['Category', 'Message']

emailData.head(10)

emailData = pd.read_csv("spam.csv", encoding='latin-1') emailData = emailData[['v1', 'v2']] emailData.columns = ['Category', 'Message'] emailData.head(10)

emailData = pd.read_csv("spam.csv", encoding='latin-1')
emailData = emailData[['v1', 'v2']]
emailData.columns = ['Category', 'Message']
emailData.head(10)

Output: A table with two columns—Category (spam or ham) and Message.

Data Cleaning

Check for missing values and clean the dataset.

python

print(emailData.shape)

print(emailData.isnull().sum())

Transform the Category column into numerical values:

• Spam → 1

• Ham → 0

python

Copy code

emailData['Spam'] = emailData.Category.map({'spam': 1, 'ham': 0})

emailData.drop('Category', axis=1, inplace=True)

print(emailData.shape) print(emailData.isnull().sum()) Transform the Category column into numerical values: • Spam → 1 • Ham → 0 python Copy code emailData['Spam'] = emailData.Category.map({'spam': 1, 'ham': 0}) emailData.drop('Category', axis=1, inplace=True)

print(emailData.shape)
print(emailData.isnull().sum())
Transform the Category column into numerical values:
•	Spam → 1
•	Ham → 0
python
Copy code
emailData['Spam'] = emailData.Category.map({'spam': 1, 'ham': 0})
emailData.drop('Category', axis=1, inplace=True)

Splitting the Dataset

Split the data into training (70%) and testing (30%) sets.

python

X = emailData.Message

y = emailData.Spam

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

X = emailData.Message y = emailData.Spam X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

X = emailData.Message
y = emailData.Spam

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

5. Text Preprocessing

Transform text data into a numerical format using CountVectorizer. This converts email content into a matrix of token counts, ignoring common stopwords.

python

vect = CountVectorizer(stop_words='english')

vect.fit(X_train)

X_train_transformed = vect.transform(X_train)

X_test_transformed = vect.transform(X_test)

vect = CountVectorizer(stop_words='english') vect.fit(X_train) X_train_transformed = vect.transform(X_train) X_test_transformed = vect.transform(X_test)

vect = CountVectorizer(stop_words='english')
vect.fit(X_train)

X_train_transformed = vect.transform(X_train)
X_test_transformed = vect.transform(X_test)

6. Training the Model

Use the Multinomial Naive Bayes algorithm, a popular choice for text classification tasks.

python

mnb = MultinomialNB()

mnb.fit(X_train_transformed, y_train)

mnb = MultinomialNB() mnb.fit(X_train_transformed, y_train)

mnb = MultinomialNB()
mnb.fit(X_train_transformed, y_train)

7. Evaluating the Model

Predict on the test data and calculate accuracy.

python

y_pred = mnb.predict(X_test_transformed)

accuracy = metrics.accuracy_score(y_pred, y_test)

print(f"Accuracy: {accuracy}")

y_pred = mnb.predict(X_test_transformed) accuracy = metrics.accuracy_score(y_pred, y_test) print(f"Accuracy: {accuracy}")

y_pred = mnb.predict(X_test_transformed)
accuracy = metrics.accuracy_score(y_pred, y_test)
print(f"Accuracy: {accuracy}")

8. Testing with Real Emails

Test the model with real-world examples.

python

emails = [

'Sounds great! Are you home now?',

'Will u meet ur dream partner soon? Txt HORO followed by ur star sign.',

'Hello, I need your urgent help to claim $12M. Contact me ASAP.'

]

email_1 = vect.transform(emails)

print(mnb.predict(email_1))

emails = [ 'Sounds great! Are you home now?', 'Will u meet ur dream partner soon? Txt HORO followed by ur star sign.', 'Hello, I need your urgent help to claim $12M. Contact me ASAP.' ] email_1 = vect.transform(emails) print(mnb.predict(email_1))

emails = [
    'Sounds great! Are you home now?',
    'Will u meet ur dream partner soon? Txt HORO followed by ur star sign.',
    'Hello, I need your urgent help to claim $12M. Contact me ASAP.'
]

email_1 = vect.transform(emails)
print(mnb.predict(email_1))

Output: [0, 1, 1], indicating ham, spam, and spam, respectively.

Results and Observations

Our email spam detection model achieved an impressive accuracy of ~98%. The Naive Bayes algorithm proved highly effective for this text classification task. With real-world testing, the model correctly identified spam emails.

We successfully built a spam detection model in Python, achieving a high accuracy rate. By understanding the step-by-step process, you can now experiment with different datasets and algorithms to further enhance the model. This is a practical demonstration of machine learning’s power to solve everyday problems.

A big thank you for exploring TechsBucket! Your visit means a lot to us, and we’re grateful for your time on our platform. If you have any feedback or suggestions, we’d love to hear them.

add a comment

Building an Email Spam Detection Model in Python

Introduction

Prerequisites

Problem Statement

Dataset

Building the Model

Results and Observations

Leave a Response Cancel reply

How to Install LEMP Stack on Ubuntu 20.04 – 22.04

Elon Musk’s Family Office Net Worth

Colormag Pro 4.0.6 WordPress Theme Nulled

How To Backup MySQL Database Using Webmin

Samsung Galaxy A56 Review: A Mid-Range Powerhouse

Popular

Fix Bluetooth Audio Devices and Wireless Displays in Windows 10

Build a Weather App with Real-Time Features

Vivo X200 Ultra: A Photography Beast

Pages

Contact Info

Introduction

Prerequisites

Problem Statement

Dataset

Building the Model

Results and Observations

Leave a Response Cancel reply

You Might Also Like