- CodeCraft by Dr. Christine Lee
- Posts
- Master Data Storytelling with Seaborn
Master Data Storytelling with Seaborn
A Beginner’s Guide
Python Seaborn Tutorial: Start Visualising Data
Introduction to Seaborn
Seaborn is an open-source, BSD-licensed Python library that provides a high-level API for visualising data. It is built on top of Matplotlib and offers beautiful default styles and color palettes to make statistical plots more attractive. As a data analyst or scientist, mastering Seaborn can significantly enhance your ability to communicate insights effectively.
|
Key Features of Seaborn
User-Friendly Interface: Seaborn is designed to work seamlessly with Pandas dataframes, making it easy to visualise and explore data quickly.
Attractive Plots with Minimal Code: Seaborn generates visually appealing plots with minimal coding efforts. It provides default themes and color palettes that you can easily customise.
Advanced Statistical Analysis: Beyond basic plots, Seaborn supports advanced statistical analysis, including regression analysis, distribution plots, and categorical plots.
Multi-Plot Visualisations: Create complex grids of plots for easy comparison between multiple variables or subsets of data.
Key Functions and Snippets
Let’s explore seven essential Seaborn functions along with code snippets using the Tips dataset.
The Tips dataset is a small dataset that is widely available online and can be easily read into a Pandas DataFrame.
Here are the key points about this dataset:
Size and Structure:
The Tips dataset contains only 244 rows (observations) and 7 variables (columns).
Each row represents information about a tip received by a waiter over a period of a few months while working in a restaurant.
Variables:
The dataset includes the following variables:
total_bill: The total bill amount (numeric).
tip: The tip amount received (numeric).
sex: The gender of the customer (categorical: “Male” or “Female”).
smoker: Whether the customer is a smoker (categorical: “Yes” or “No”).
day: The day of the week (categorical: “Thur”, “Fri”, “Sat”, or “Sun”).
time: The time of day (categorical: “Lunch” or “Dinner”).
size: The size of the dining party (numeric).
Context:
The dataset provides insights into tipping behavior based on various factors such as gender, smoking status, day of the week, and mealtime.
It’s commonly used for exploring relationships between variables and practicing data analysis techniques.
1. Scatter Plot
Scatter plots are valuable for identifying trends, outliers, and potential relationships between variables.
In the example below, we visualise the tipping behavior in a restaurant setting
Example:
import seaborn as sns
import matplotlib.pyplot as plt
# Load sample data
tips = sns.load_dataset("tips")
# Create a scatter plot
sns.scatterplot(x="total_bill", y="tip", data=tips)
plt.title("Scatter Plot: Total Bill vs. Tip")
plt.xlabel("Total Bill ($)")
plt.ylabel("Tip ($)")
plt.show()
Explanation:
sns.scatterplot(x="total_bill", y="tip", data=tips)
creates a scatter plot with total bill on the x-axis and tip amount on the y-axis using the tips dataset.The
plt.title("Scatter Plot: Total Bill vs. Tip")
sets the plot title.
Output:
The scatter plot compares the total bill amount (on the x-axis) with the tip amount received (on the y-axis).
Each point represents a specific bill and tip combination from the dataset.
Here’s what we can infer from the plot:
Positive Correlation: As the total bill amount increases, the tip amount tends to increase as well. This positive correlation suggests that customers tend to tip more when their bill is higher.
Spread of Data: The points are scattered across the graph, indicating variability in tip amounts for different bill amounts.
Outliers: Some points may deviate significantly from the general trend, representing outliers (e.g., unusually large tips for relatively small bills).
Noisy Data: The plot shows some noise, meaning that there isn’t a perfect linear relationship between total bill and tip.
Visual Clarity: The axes are labeled, and the title provides context for the plot.
Scatter Plot
2. Bar Plot
Bar Plot Purpose:
A bar plot (or bar chart) is used to compare different categories or groups by representing their values using rectangular bars.
It’s particularly useful for showing how a numerical value varies across different categories.
Example:
import seaborn as sns
import matplotlib.pyplot as plt
# Load the Tips dataset
tips = sns.load_dataset("tips")
# Create a bar plot
sns.barplot(x="day", y="total_bill", data=tips)
plt.title("Average Total Bill by Day")
plt.show()
Explanation:
sns.barplot(x="day", y="total_bill", data=tips)
creates a bar plot showing the average total bill for each day of the week.The
plt.title("Average Total Bill by Day")
sets the plot title.
Output:
In this bar chart, we are comparing the average total bill amount for each day of the week.
The bars indicate the average total bill for each day.
For example:
The bar for Thursday (Thur) represents the average total bill amount on Thursdays.
The bar for Friday (Fri) represents the average total bill amount on Fridays, and so on.
Bar Plot
3. Box Plot
Box Plot Purpose:
A box plot (or box-and-whisker plot) visually summarizes the distribution of numerical data.
It provides insights into the central tendency, spread, and presence of outliers within a dataset.
Example:
import seaborn as sns
import matplotlib.pyplot as plt
# Load the Tips dataset
tips = sns.load_dataset("tips")
# Create a box plot
sns.boxplot(x="day", y="total_bill", data=tips)
plt.title("Distribution of Total Bill by Day")
plt.show()
Explanation:
sns.boxplot(x="day", y="total_bill", data=tips)
creates a box plot showing the distribution of total bill amounts for each day.The
plt.title("Distribution of Total Bill by Day")
sets the plot title.
Output:
In this box plot, we are comparing the average total bill amount for each day of the week.
The box represents the interquartile range (IQR), which contains the middle 50% of the total bill amounts for each day.
The line inside the box represents the median total bill amount.
The whiskers extend to show the range of typical data (excluding outliers).
Any data points beyond the whiskers are potential outliers and are plotted individually as dots.
Box Plot
4. Heatmap
Heatmap Purpose:
A heatmap is a graphical representation of data using colors to visualize the value of a matrix.
It’s particularly useful for displaying relationships and patterns between variables.
Example:
import seaborn as sns
import matplotlib.pyplot as plt
# Load the Tips dataset
tips = sns.load_dataset("tips")
# Ensure we only use the numeric columns for correlation calculation
numeric_data = tips._get_numeric_data()
# Create a heatmap
correlation_matrix = tips.corr()
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()
Explanation:
correlation_matrix = tips.corr()
computes the correlation matrix for numeric columns in the tips dataset.sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
creates a heatmap of the correlation values.The
plt.title("Correlation Heatmap")
sets the plot title.
Output:
In this heatmap, we are visualising the correlation matrix of the Tips dataset.
The color intensity represents the strength and direction of the correlation between different numeric variables.
Here’s what we can infer:
Positive Correlation: Lighter colors (e.g., shades of red) indicate positive correlations (variables move together).
Negative Correlation: Darker colors (e.g., shades of blue) indicate negative correlations (variables move in opposite directions).
No Correlation: Cells with neutral colors (e.g., white or gray) represent no significant correlation.
The annotations within each cell show the actual correlation coefficients.
Heatmap
5. Pair Plot
Pair Plot Purpose:
A pair plot visualizes pairwise relationships in a dataset.
It shows scatter plots for each combination of numeric variables and histograms (or kernel density estimates) along the diagonal.
Example:
import seaborn as sns
import matplotlib.pyplot as plt
# Load the Tips dataset
tips = sns.load_dataset("tips")
# Create a pair plot
sns.pairplot(tips, hue="sex")
plt.title("Pair Plot: Relationships between Numeric Variables")
plt.show()
Explanation:
sns.pairplot(tips, hue="sex")
creates a pair plot showing relationships between numeric variables in the tips dataset, with different colors for male and female data points.The
plt.title("Pair Plot: Relationships between Numeric Variables")
sets the plot title.
Output:
The scatter plot between total bill and tip shows a positive correlation, indicating that higher total bills tend to result in larger tips.
The scatter plot between total bill and size suggests that larger groups (more people) tend to have higher total bills.
The diagonal histograms show the distribution of each variable (e.g., the distribution of total bills or tip amounts).
Pair Plot
6. Count Plot
A count plot visually displays the counts of observations in each categorical bin using bars.
It’s similar to a histogram but for categorical data instead of quantitative data.
Example:
import seaborn as sns
import matplotlib.pyplot as plt
# Load the Tips dataset
tips = sns.load_dataset("tips")
# Create a count plot
sns.countplot(x="day", data=tips)
plt.title("Number of Visits by Day")
plt.show()
Explanation:
sns.countplot(x="day", data=tips)
creates a count plot showing the number of visits for each day.The
plt.title("Number of Visits by Day")
sets the plot title.
Output:
In this count plot, we are visualising the number of visits for each day of the week.
Each bar represents the count of visits on a specific day.
For example:
The bar for Thursday (Thur) represents the number of visits on Thursdays.
The bar for Friday (Fri) represents the number of visits on Fridays, and so on.
7. Distribution Plot
A distribution plot (also known as a histogram) visualizes the distribution of a single variable.
It shows how the data is spread across different values.
Example:
import seaborn as sns
import matplotlib.pyplot as plt
# Load the Tips dataset
tips = sns.load_dataset("tips")
# Create a distribution plot
sns.distplot(tips["total_bill"], bins=20, kde=False)
plt.title("Distribution of Total Bill Amounts")
plt.show()
Explanation:
sns.distplot(tips["total_bill"], bins=20, kde=False)
creates a distribution plot (histogram) of total bill amounts with 20 bins and no kernel density estimate (KDE).By setting
kde=False
, the kernel density estimate (KDE) curve is disabled. If you setkde=True
, the plot would include a smooth curve showing the estimated probability density function.The
plt.title("Distribution of Total Bill Amounts")
sets the plot title.
Output:
In this plot, we are visualising the distribution of total bill amounts in the Tips dataset.
The bars show how many times each total bill amount occurs.
For example:
The tallest bar represents the most common total bill amount.
The distribution appears slightly skewed to the right, indicating that higher total bill amounts are less frequent.
The parameter
bins=20
specifies that the data is divided into 20 bins (intervals).
|
Summary
Seaborn is a powerful tool for creating informative and aesthetically pleasing statistical graphics. Whether you’re visualizing relationships, distributions, or comparisons, Seaborn simplifies the process and enhances your data storytelling.
Ready for More Python Fun?
Subscribe to our newsletter now and get a free Python cheat sheet! Dive deeper into Python programming with more exciting projects and tutorials designed just for beginners.
Keep exploring, keep coding, and enjoy your journey into data analytics with Python!
Happy plotting! 🚀