- CodeCraft by Dr. Christine Lee
- Posts
- Discover the Secret to Understanding Emotions
Discover the Secret to Understanding Emotions
Python Sentiment Analysis Made Easy!
๐ Dive into Machine Learning with Python: Sentiment Analysis! ๐
Ready to explore another exciting area of machine learning? Today, weโre going to venture into sentiment analysis! Ever wondered how companies determine public sentiment on social media? Or how movie reviews are categorised as positive or negative? Letโs unravel this magic with a fun and easy Python project! ๐
What is Sentiment Analysis? ๐ง
Sentiment analysis is a technique used to determine whether a piece of text is positive, negative, or neutral. Itโs like teaching your computer to understand human emotions! โค๏ธ๐
Let's Code! ๐ป
We'll be using the nltk
and scikit-learn
libraries in Python. If you donโt have them installed yet, you can get them by running:
pip install nltk scikit-learn
Step-by-Step Guide ๐ ๏ธ
1. Importing Libraries ๐ฆ
First, we need to import the necessary libraries.
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
Explanation
Here, we import the necessary libraries:
nltk
for natural language processing.CountVectorizer
andMultinomialNB
from scikit-learn for text vectorization and Naive Bayes classification.train_test_split
to split the dataset.accuracy_score
to evaluate the model.
2. Downloading NLTK Data ๐ฅ
We need to download some data for NLTK to work properly.
nltk.download('movie_reviews')
nltk.download('punkt')
Explanation
We download essential datasets from NLTK:
movie_reviews: A collection of movie reviews used for training and testing the sentiment analysis model.
punkt: A set of rules for splitting text into sentences, which is needed for preprocessing the reviews.
3. Loading the Dataset ๐
Weโll use a dataset of movie reviews from NLTK.
from nltk.corpus import movie_reviews
import random
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
Explanation
We load the movie_reviews
dataset and shuffle the documents to ensure a random distribution of data:
movie_reviews.words(fileid)
gets the words from a review.movie_reviews.categories()
returns the categories (positive/negative).movie_reviews.fileids(category)
gets the file IDs for a specific category.
4. Preparing the Data ๐ง
Weโll prepare the data for our machine learning model.
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features[f'contains({word})'] = (word in document_words)
return features
featuresets = [(document_features(d), c) for (d, c) in documents]
Explanation
We prepare the data:
all_words
computes the frequency distribution of words in the reviews.word_features
selects the top 2000 most frequent words as features.document_features
function checks which of the top words are present in each document.featuresets
applies the document_features function to each document.
5. Splitting the Data ๐
We split the data into training and testing sets.
train_set, test_set = train_test_split(featuresets, test_size=0.25, random_state=42)
Explanation
We split the dataset into training and testing sets:
train_set
contains 75% of the data.test_set
contains 25% of the data.random_state
=42 ensures reproducibility.
6. Training the Model ๐๏ธ
Weโll use a Naive Bayes classifier to train our model.
classifier = nltk.NaiveBayesClassifier.train(train_set)
Explanation
We train a Naive Bayes classifier using the training set.
Learn more about Naรฏve Bayes Classifier here.
7. Evaluating the Model ๐
Now, letโs see how well our model performs!
accuracy = nltk.classify.accuracy(classifier, test_set)
print(f"Accuracy: {accuracy * 100:.2f}%")
classifier.show_most_informative_features(10)
Explanation
We evaluate the model's performance:
nltk.classify.accuracy
calculates the accuracy of the classifier on the test set.classifier.show_most
_informative_features(10) displays the 10 most informative features that help in classification.
Complete Code Snippet ๐
Hereโs the full code to get you started with sentiment analysis:
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Download necessary NLTK data
nltk.download('movie_reviews')
nltk.download('punkt')
# Load the dataset
from nltk.corpus import movie_reviews
import random
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
# Prepare the data
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features[f'contains({word})'] = (word in document_words)
return features
featuresets = [(document_features(d), c) for (d, c) in documents]
# Split the data
train_set, test_set = train_test_split(featuresets, test_size=0.25, random_state=42)
# Train the model
classifier = nltk.NaiveBayesClassifier.train(train_set)
# Evaluate the model
accuracy = nltk.classify.accuracy(classifier, test_set)
print(f"Accuracy: {accuracy * 100:.2f}%")
# Show the most informative features
classifier.show_most_informative_features(10)
Output
Sentiment Analysis: The line "Accuracy: 80.20%" indicates the model achieved an 80.2% accuracy in predicting the sentiment (positive or negative) of movie reviews.
Informative Features: The "Most Informative Features" section highlights words or phrases that are most strongly associated with positive or negative sentiments. For example:
"contains(outstanding) = True" means the presence of the word "outstanding" in a review is a strong indicator of a positive sentiment (13.8 times more likely to be positive than negative).
"contains(seagal) = True" suggests the word "seagal" is often associated with negative reviews (12.8 times more likely to be negative than positive).
Similarly, if a review contains the word "wonderfully," it is 7.4 times more likely to be classified as positive.
And if a review contains the word "waste," it is 6.5 times more likely to be classified as negative.
In essence, the output shows that your program successfully trained a sentiment analysis model using movie reviews and identified key features that predict positive or negative sentiments.
Letโs Recap! ๐
We imported the necessary libraries.
We downloaded and loaded the dataset of movie reviews.
We prepared and shuffled the data.
We created a function to extract features from the text.
We split the data into training and testing sets.
We trained a Naive Bayes classifier.
We evaluated the modelโs accuracy and displayed the most informative features.
Conclusion
You've just walked through a simple yet powerful machine learning project in Python! By understanding how sentiment analysis works, you can explore further and create more complex models and applications.
Why Itโs Awesome ๐
Sentiment analysis can be used for numerous applications:
Social Media Monitoring ๐ฑ
Customer Feedback Analysis ๐
Market Research ๐
Opinion Mining ๐ต๏ธโโ๏ธ
Challenge Yourself! ๐
Try experimenting with different datasets or models! Explore more advanced techniques like using word embeddings or deep learning for sentiment analysis. The possibilities are endless.
Coding with a Smile
Variable Naming Woes: Coming up with variable names can feel like naming your children. You start with meaningful names, then quickly resort to 'thing1', 'thing2', and eventually 'x', 'y', and 'z'. Just remember, 'naming things' is one of the two hard problems in computer scienceโright up there with 'off-by-one errors'!
Ready for More Python Fun? ๐ฌ
Subscribe to our newsletter now and get a free Python cheat sheet! ๐ Dive deeper into Python programming with more exciting projects and tutorials designed just for beginners.
Keep exploring, keep coding, ๐ฉโ๐ป๐จโ๐ปand enjoy your journey into artificial intelligence, machine learning, data analytics, data science and more with Python!
Stay tuned for our next exciting project in the following edition!
Happy coding!๐๐โจ