CodeCraft by Dr. Christine Lee
Posts
Discover the Secret to Understanding Emotions

Discover the Secret to Understanding Emotions

Python Sentiment Analysis Made Easy!

🌟 Dive into Machine Learning with Python: Sentiment Analysis! 📊

Ready to explore another exciting area of machine learning? Today, we’re going to venture into sentiment analysis! Ever wondered how companies determine public sentiment on social media? Or how movie reviews are categorised as positive or negative? Let’s unravel this magic with a fun and easy Python project! 🚀

What is Sentiment Analysis? 🧐

Sentiment analysis is a technique used to determine whether a piece of text is positive, negative, or neutral. It’s like teaching your computer to understand human emotions! ❤️😠

Let's Code! 💻

We'll be using the nltk and scikit-learn libraries in Python. If you don’t have them installed yet, you can get them by running:

pip install nltk scikit-learn

Step-by-Step Guide 🛠️

1. Importing Libraries 📦

First, we need to import the necessary libraries.

import nltk

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

Explanation

Here, we import the necessary libraries:

nltk for natural language processing.
CountVectorizer and MultinomialNB from scikit-learn for text vectorization and Naive Bayes classification.
train_test_split to split the dataset.
accuracy_score to evaluate the model.

2. Downloading NLTK Data 📥

We need to download some data for NLTK to work properly.

nltk.download('movie_reviews')

nltk.download('punkt')

Explanation

We download essential datasets from NLTK:

movie_reviews: A collection of movie reviews used for training and testing the sentiment analysis model.
punkt: A set of rules for splitting text into sentences, which is needed for preprocessing the reviews.

3. Loading the Dataset 📚

We’ll use a dataset of movie reviews from NLTK.

from nltk.corpus import movie_reviews

import random

documents = [(list(movie_reviews.words(fileid)), category)

for category in movie_reviews.categories()

for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

Explanation

We load the movie_reviews dataset and shuffle the documents to ensure a random distribution of data:

movie_reviews.words(fileid) gets the words from a review.
movie_reviews.categories() returns the categories (positive/negative).
movie_reviews.fileids(category) gets the file IDs for a specific category.

4. Preparing the Data 🔧

We’ll prepare the data for our machine learning model.

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())

word_features = list(all_words)[:2000]

def document_features(document):

document_words = set(document)

features = {}

for word in word_features:

features[f'contains({word})'] = (word in document_words)

return features

featuresets = [(document_features(d), c) for (d, c) in documents]

Explanation

We prepare the data:

all_words computes the frequency distribution of words in the reviews.
word_features selects the top 2000 most frequent words as features.
document_features function checks which of the top words are present in each document.
featuresets applies the document_features function to each document.

5. Splitting the Data 📊

We split the data into training and testing sets.

train_set, test_set = train_test_split(featuresets, test_size=0.25, random_state=42)

Explanation

We split the dataset into training and testing sets:

train_set contains 75% of the data.
test_set contains 25% of the data.
random_state=42 ensures reproducibility.

6. Training the Model 🏋️

We’ll use a Naive Bayes classifier to train our model.

classifier = nltk.NaiveBayesClassifier.train(train_set)

Explanation

We train a Naive Bayes classifier using the training set.

Learn more about Naïve Bayes Classifier here.

7. Evaluating the Model 📈

Now, let’s see how well our model performs!

accuracy = nltk.classify.accuracy(classifier, test_set)

print(f"Accuracy: {accuracy * 100:.2f}%")

classifier.show_most_informative_features(10)

Explanation

We evaluate the model's performance:

nltk.classify.accuracy calculates the accuracy of the classifier on the test set.
classifier.show_most_informative_features(10) displays the 10 most informative features that help in classification.

Complete Code Snippet 📝

Here’s the full code to get you started with sentiment analysis:

import nltk

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import MultinomialNB

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

# Download necessary NLTK data

nltk.download('movie_reviews')

nltk.download('punkt')

# Load the dataset

from nltk.corpus import movie_reviews

import random

documents = [(list(movie_reviews.words(fileid)), category)

for category in movie_reviews.categories()

for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

# Prepare the data

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())

word_features = list(all_words)[:2000]

def document_features(document):

document_words = set(document)

features = {}

for word in word_features:

features[f'contains({word})'] = (word in document_words)

return features

featuresets = [(document_features(d), c) for (d, c) in documents]

# Split the data

train_set, test_set = train_test_split(featuresets, test_size=0.25, random_state=42)

# Train the model

classifier = nltk.NaiveBayesClassifier.train(train_set)

# Evaluate the model

accuracy = nltk.classify.accuracy(classifier, test_set)

print(f"Accuracy: {accuracy * 100:.2f}%")

# Show the most informative features

classifier.show_most_informative_features(10)

Output

Sentiment Analysis: The line "Accuracy: 80.20%" indicates the model achieved an 80.2% accuracy in predicting the sentiment (positive or negative) of movie reviews.
Informative Features: The "Most Informative Features" section highlights words or phrases that are most strongly associated with positive or negative sentiments. For example:
- "contains(outstanding) = True" means the presence of the word "outstanding" in a review is a strong indicator of a positive sentiment (13.8 times more likely to be positive than negative).
- "contains(seagal) = True" suggests the word "seagal" is often associated with negative reviews (12.8 times more likely to be negative than positive).
- Similarly, if a review contains the word "wonderfully," it is 7.4 times more likely to be classified as positive.
- And if a review contains the word "waste," it is 6.5 times more likely to be classified as negative.

In essence, the output shows that your program successfully trained a sentiment analysis model using movie reviews and identified key features that predict positive or negative sentiments.

Let’s Recap! 🔄

We imported the necessary libraries.
We downloaded and loaded the dataset of movie reviews.
We prepared and shuffled the data.
We created a function to extract features from the text.
We split the data into training and testing sets.
We trained a Naive Bayes classifier.
We evaluated the model’s accuracy and displayed the most informative features.

Conclusion

You've just walked through a simple yet powerful machine learning project in Python! By understanding how sentiment analysis works, you can explore further and create more complex models and applications.

Why It’s Awesome 🌟

Sentiment analysis can be used for numerous applications:

Social Media Monitoring 📱
Customer Feedback Analysis 📝
Market Research 📊
Opinion Mining 🕵️‍♂️

Challenge Yourself! 🏆

Try experimenting with different datasets or models! Explore more advanced techniques like using word embeddings or deep learning for sentiment analysis. The possibilities are endless.

Coding with a Smile

Variable Naming Woes: Coming up with variable names can feel like naming your children. You start with meaningful names, then quickly resort to 'thing1', 'thing2', and eventually 'x', 'y', and 'z'. Just remember, 'naming things' is one of the two hard problems in computer science—right up there with 'off-by-one errors'!

Ready for More Python Fun? 📬

Subscribe to our newsletter now and get a free Python cheat sheet! 📑 Dive deeper into Python programming with more exciting projects and tutorials designed just for beginners.

Keep exploring, keep coding, 👩‍💻👨‍💻and enjoy your journey into artificial intelligence, machine learning, data analytics, data science and more with Python!

Stay tuned for our next exciting project in the following edition!

Happy coding!🚀📊✨