CodeCraft by Dr. Christine Lee
Posts
Unleash the Power of Text Classification with NLP

Unleash the Power of Text Classification with NLP

Turn Python into a Spam-Fighting Machine! 🛡️📧

Text Classifier

A Beginner's Guide! 📝🌟

Hey there, future data wizards! 👋

Ready to dive into the fascinating world of Natural Language Processing (NLP) and learn how to classify text like a pro? In this tutorial, we'll explore text classification, a powerful technique that allows computers to automatically analyse and categorise text data. Whether you're interested in spam detection, sentiment analysis, or topic categorisation, understanding text classification is an essential skill in the realm of machine learning.

Let's get started on this exciting journey!

Why Learn Text Classification? 🤔

Text classification is everywhere in our digital world:

Spam Filtering: Identifying which emails are junk and which are important.
Sentiment Analysis: Understanding the mood or opinion expressed in text (e.g., positive or negative reviews).
Topic Classification: Categorising news articles, tweets, or customer feedback into relevant topics.

What You'll Learn 🔍

Preprocessing Text: Cleaning and preparing text data for analysis.
Feature Extraction: Converting text into numerical features that machine learning algorithms can understand.
Building a Classifier: Using machine learning models (like Naive Bayes or Support Vector Machines) to classify text.
Evaluating Performance: Assessing how well your classifier performs using metrics like accuracy and precision.

Our Project: Classifying Spam vs. Ham Messages📧📊

For this tutorial, we'll focus on classifying messages as spam (unwanted messages) or ham (legitimate messages). Our goal is to build a classifier that can automatically detect spam messages based on their content.

Step-by-Step Guide to Text Classification with Python 🐍

1. Import Necessary Libraries 📚

First, we need to import the libraries that will help us process the text data and build our machine learning model.

import nltk

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import make_pipeline

from sklearn.model_selection import train_test_split

from sklearn import metrics

import pandas as pd

import urllib.request

import zipfile

import os

Explanation

nltk: A powerful library for natural language processing.
sklearn: A machine learning library for Python.
pandas: A data manipulation library for handling datasets.
urllib.request and zipfile: Libraries for downloading and extracting data files.
os: A library to interact with the operating system.

2. Download and Extract the Dataset 📄

Next, we download the dataset, which is in a ZIP file, and extract its contents.

# Define the URL and the path to save the downloaded ZIP file

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"

zip_path = "smsspamcollection.zip"

# Download the ZIP file

urllib.request.urlretrieve(url, zip_path)

# Extract the ZIP file

with zipfile.ZipFile(zip_path, 'r') as zip_ref:

zip_ref.extractall()

# The extracted file name

file_name = "SMSSpamCollection"

# Read the extracted file into a pandas DataFrame

df = pd.read_csv(file_name, sep='\t', names=['label', 'message'], header=None)

# Display the first few rows

print(df.head())

Explanation

Step 1: We specify the URL of the dataset and the path where we want to save it.
Step 2: We download the ZIP file from the URL.
Step 3: We extract the contents of the ZIP file.
Step 4: We read the extracted file into a DataFrame so we can manipulate and analyse it.

3. Preprocess the Data 🧼

Preprocessing the text data involves cleaning it up and preparing it for analysis.

# Convert labels to a binary format

df['label'] = df['label'].map({'ham': 0, 'spam': 1})

# Tokenize and remove stop words

nltk.download('punkt')

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def preprocess_text(text):

tokens = word_tokenize(text.lower())

filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]

return ' '.join(filtered_tokens)

df['processed_message'] = df['message'].apply(preprocess_text)

Explanation

Step 1: Convert the labels to a binary format (0 for ham, 1 for spam).
Step 2: Download the necessary NLTK resources for tokenizing text and removing stop words.
Step 3: Define a function preprocess_text to clean the text by:
- Converting it to lowercase.
- Tokenizing it (splitting it into words).
- Removing stop words and non-alphanumeric tokens.
Step 4: Apply this preprocessing function to each message in the dataset.

What are stop words?

Stop words are common words that are often removed in natural language processing tasks because they usually don't carry significant meaning and can clutter the analysis. Here are some examples of stop words in English:

a
an
and
are
as
at
be
by
for
from
has
have
he
in
is
it
its
of
on
that
the
to
was
were
will
with

These words are frequently used in sentences but typically don't provide much information about the content or context. Removing them helps in focusing on the more meaningful words, which can improve the performance of text analysis tasks like classification and clustering.

4. Split the Dataset into Training and Testing Sets 📂

We split our dataset into two parts: one for training the model and one for testing its performance.

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(

df['processed_message'], df['label'], test_size=0.2, random_state=42)

Explanation

Step 1: Use the train_test_split function to split the data into training (80%) and testing (20%) sets.
Step 2: X_train and X_test contain the processed messages, while y_train and y_test contain the labels.

5. Build and Train the Classifier 🤖

Now, we create a machine learning pipeline to build and train our classifier.

# Create a pipeline that combines feature extraction and classification

classifier = make_pipeline(

TfidfVectorizer(),

MultinomialNB()

)

# Train the classifier

classifier.fit(X_train, y_train)

Explanation

Step 1: Create a pipeline that combines:
- TfidfVectorizer(): Converts text data into numerical features using Term Frequency-Inverse Document Frequency.
- MultinomialNB(): A Naive Bayes classifier for multinomially distributed data.
Step 2: Train the classifier using the training data (`X_train` and y_train).

6. Evaluate the Model Performance 📊

Finally, we test our model on the testing data and evaluate its performance.

# Predict the labels for the test set

predicted = classifier.predict(X_test)

# Evaluate the accuracy of the classifier

accuracy = metrics.accuracy_score(y_test, predicted)

print(f"Accuracy: {accuracy:.2f}")

# Display the confusion matrix

confusion_matrix = metrics.confusion_matrix(y_test, predicted)

print("Confusion Matrix:\n", confusion_matrix)

# Display the classification report

classification_report = metrics.classification_report(y_test, predicted)

print("Classification Report:\n", classification_report)

Explanation

Step 1: Predict the labels for the test set using the trained classifier.
Step 2: Calculate the accuracy of the model by comparing the predicted labels to the actual labels (`y_test`).
Step 3: Display the confusion matrix, which shows the number of correct and incorrect predictions.
Step 4: Display the classification report, which provides precision, recall, and F1-score for each class.

Output

Ham and Spam

The output you're seeing is a result of running a machine learning model to classify messages as either "ham" (non-spam) or "spam".

Here's a breakdown:

label and message: These are the columns of data your model is working with. The label column contains the actual classification of each message (ham or spam) and the message column contains the text of the message itself.
[nltk_data] ...: This indicates that your code is downloading and installing necessary resources for natural language processing (NLP), which is used to analyze and understand the text in the messages. These resources include "punkt" for sentence tokenization (breaking down text into sentences) and "stopwords" for identifying and removing common words that don't carry much meaning (like "the," "a," "an").
Accuracy: 0.97: This means your model correctly classified 97% of the messages in your dataset.
Confusion Matrix: The confusion matrix helps visualise how well your model performed. It shows how many messages were correctly classified as "ham" or "spam" and how many were misclassified.
[[966 0] [ 31 118]]:
- The top left value (966) means 966 ham messages were correctly classified as ham.
- The top right value (0) means no ham messages were incorrectly classified as spam.
- The bottom left value (31) means 31 spam messages were incorrectly classified as ham (false negatives).
- The bottom right value (118) means 118 spam messages were correctly classified as spam.
Classification Report: This report provides more detailed metrics about the performance of your model:
- precision: Measures how accurate the model is when it predicts a message is spam. For example, a precision of 0.97 means that when the model predicted a message as spam, it was correct 97% of the time.
- recall: Measures how well the model is able to identify all spam messages. For example, a recall of 0.79 means that the model correctly identified 79% of all actual spam messages.
- f1-score: A combined metric that takes both precision and recall into account. It's a good overall measure of the model's effectiveness.
- support: The number of messages in each category (ham and spam).

In summary, this output shows you that your machine learning model is performing quite well at classifying spam messages. It has a high accuracy, precision, recall, and f1-score, indicating that it's effectively identifying spam messages and minimizing false positives (classifying ham messages as spam) and false negatives (classifying spam messages as ham).

Putting It All Together 🧩

import nltk

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import make_pipeline

from sklearn.model_selection import train_test_split

from sklearn import metrics

import pandas as pd

import urllib.request

import zipfile

import os

# Define the URL and the path to save the downloaded ZIP file

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"

zip_path = "smsspamcollection.zip"

# Download the ZIP file

urllib.request.urlretrieve(url, zip_path)

# Extract the ZIP file

with zipfile.ZipFile(zip_path, 'r') as zip_ref:

zip_ref.extractall()

# The extracted file name

file_name = "SMSSpamCollection"

# Read the extracted file into a pandas DataFrame

df = pd.read_csv(file_name, sep='\t', names=['label', 'message'], header=None)

# Display the first few rows

print(df.head())

# Convert labels to a binary format

df['label'] = df['label'].map({'ham': 0, 'spam': 1})

# Tokenize and remove stop words

nltk.download('punkt')

nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def preprocess_text(text):

tokens = word_tokenize(text.lower())

filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]

return ' '.join(filtered_tokens)

df['processed_message'] = df['message'].apply(preprocess_text)

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(

df['processed_message'], df['label'], test_size=0.2, random_state=42)

# Create a pipeline that combines feature extraction and classification

classifier = make_pipeline(

TfidfVectorizer(),

MultinomialNB()

)

# Train the classifier

classifier.fit(X_train, y_train)

# Predict the labels for the test set

predicted = classifier.predict(X_test)

# Evaluate the accuracy of the classifier

accuracy = metrics.accuracy_score(y_test, predicted)

print(f"Accuracy: {accuracy:.2f}")

# Display the confusion matrix

confusion_matrix = metrics.confusion_matrix(y_test, predicted)

print("Confusion Matrix:\n", confusion_matrix)

# Display the classification report

classification_report = metrics.classification_report(y_test, predicted)

print("Classification Report:\n", classification_report)

Conclusion 🌟

Congratulations! You've successfully built a text classifier for spam detection using Python and NLP techniques. You've learned how to preprocess text data, extract features, build a classifier, and evaluate its performance. Text classification opens the door to endless possibilities in analyzing and understanding textual data.

Coding with a Smile 🤣😂

The Debugger Detective: Debugging is like being a detective in a crime movie where you're also the murderer. You have to investigate your own code, interrogate it, and finally solve the mystery of the missing semicolon or misbehaving loop. Just call yourself Sherlock Code!

Recommended Resources 📚

	Sponsored simple.ai by @dharmeshAI and agents, made simple. Learn how to grow your career or business in the AI age with Dharmesh Shah (co-founder & CTO of HubSpot). Join 1,000,000+ readers.

What’s Next 📅

In our next post, we'll dive into more advanced NLP techniques, such as named entity recognition and text summarisation. Get ready to explore how to make sense of even more complex textual data. Stay tuned and keep exploring the exciting world of data science and machine learning! 🚀📈

Ready for More Python Fun? 📬

Subscribe to our newsletter now and get a free Python cheat sheet! 📑 Dive deeper into Python programming with more exciting projects and tutorials designed just for beginners.

Keep exploring, keep coding, 👩‍💻👨‍💻and enjoy your journey into artificial intelligence, machine learning, data analytics, data science and more with Python!

Stay tuned for our next exciting project in the following edition!

Happy coding!🚀📊✨