Master Data Storytelling with Seaborn

A Beginner’s Guide

Python Seaborn Tutorial: Start Visualising Data

Introduction to Seaborn

Seaborn is an open-source, BSD-licensed Python library that provides a high-level API for visualising data. It is built on top of Matplotlib and offers beautiful default styles and color palettes to make statistical plots more attractive. As a data analyst or scientist, mastering Seaborn can significantly enhance your ability to communicate insights effectively.

	Sponsored AiNexaVerse NewsWeekly AI Tools in your email!

Key Features of Seaborn

User-Friendly Interface: Seaborn is designed to work seamlessly with Pandas dataframes, making it easy to visualise and explore data quickly.
Attractive Plots with Minimal Code: Seaborn generates visually appealing plots with minimal coding efforts. It provides default themes and color palettes that you can easily customise.
Advanced Statistical Analysis: Beyond basic plots, Seaborn supports advanced statistical analysis, including regression analysis, distribution plots, and categorical plots.
Multi-Plot Visualisations: Create complex grids of plots for easy comparison between multiple variables or subsets of data.

Key Functions and Snippets

Let’s explore seven essential Seaborn functions along with code snippets using the Tips dataset.

The Tips dataset is a small dataset that is widely available online and can be easily read into a Pandas DataFrame.

Here are the key points about this dataset:

Size and Structure:
- The Tips dataset contains only 244 rows (observations) and 7 variables (columns).
- Each row represents information about a tip received by a waiter over a period of a few months while working in a restaurant.
Variables:
- The dataset includes the following variables:
  - total_bill: The total bill amount (numeric).
  - tip: The tip amount received (numeric).
  - sex: The gender of the customer (categorical: “Male” or “Female”).
  - smoker: Whether the customer is a smoker (categorical: “Yes” or “No”).
  - day: The day of the week (categorical: “Thur”, “Fri”, “Sat”, or “Sun”).
  - time: The time of day (categorical: “Lunch” or “Dinner”).
  - size: The size of the dining party (numeric).
Context:
- The dataset provides insights into tipping behavior based on various factors such as gender, smoking status, day of the week, and mealtime.
- It’s commonly used for exploring relationships between variables and practicing data analysis techniques.

1. Scatter Plot

Scatter plots are valuable for identifying trends, outliers, and potential relationships between variables.

In the example below, we visualise the tipping behavior in a restaurant setting

Example:

import seaborn as sns

import matplotlib.pyplot as plt

# Load sample data

tips = sns.load_dataset("tips")

# Create a scatter plot

sns.scatterplot(x="total_bill", y="tip", data=tips)

plt.title("Scatter Plot: Total Bill vs. Tip")

plt.xlabel("Total Bill ($)")

plt.ylabel("Tip ($)")

plt.show()

Explanation:

sns.scatterplot(x="total_bill", y="tip", data=tips) creates a scatter plot with total bill on the x-axis and tip amount on the y-axis using the tips dataset.
The plt.title("Scatter Plot: Total Bill vs. Tip") sets the plot title.

Output:

The scatter plot compares the total bill amount (on the x-axis) with the tip amount received (on the y-axis).
Each point represents a specific bill and tip combination from the dataset.
Here’s what we can infer from the plot:
- Positive Correlation: As the total bill amount increases, the tip amount tends to increase as well. This positive correlation suggests that customers tend to tip more when their bill is higher.
- Spread of Data: The points are scattered across the graph, indicating variability in tip amounts for different bill amounts.
- Outliers: Some points may deviate significantly from the general trend, representing outliers (e.g., unusually large tips for relatively small bills).
- Noisy Data: The plot shows some noise, meaning that there isn’t a perfect linear relationship between total bill and tip.
- Visual Clarity: The axes are labeled, and the title provides context for the plot.

Scatter Plot

2. Bar Plot

Bar Plot Purpose:

A bar plot (or bar chart) is used to compare different categories or groups by representing their values using rectangular bars.
It’s particularly useful for showing how a numerical value varies across different categories.

Example:

import seaborn as sns

import matplotlib.pyplot as plt

# Load the Tips dataset

tips = sns.load_dataset("tips")

# Create a bar plot

sns.barplot(x="day", y="total_bill", data=tips)

plt.title("Average Total Bill by Day")

plt.show()

Explanation:

sns.barplot(x="day", y="total_bill", data=tips) creates a bar plot showing the average total bill for each day of the week.
The plt.title("Average Total Bill by Day") sets the plot title.

Output:

In this bar chart, we are comparing the average total bill amount for each day of the week.

The bars indicate the average total bill for each day.
For example:
- The bar for Thursday (Thur) represents the average total bill amount on Thursdays.
- The bar for Friday (Fri) represents the average total bill amount on Fridays, and so on.

Bar Plot

3. Box Plot

Box Plot Purpose:

A box plot (or box-and-whisker plot) visually summarizes the distribution of numerical data.
It provides insights into the central tendency, spread, and presence of outliers within a dataset.

Example:

import seaborn as sns

import matplotlib.pyplot as plt

# Load the Tips dataset

tips = sns.load_dataset("tips")

# Create a box plot

sns.boxplot(x="day", y="total_bill", data=tips)

plt.title("Distribution of Total Bill by Day")

plt.show()

Explanation:

sns.boxplot(x="day", y="total_bill", data=tips) creates a box plot showing the distribution of total bill amounts for each day.
The plt.title("Distribution of Total Bill by Day") sets the plot title.

Output:

In this box plot, we are comparing the average total bill amount for each day of the week.
The box represents the interquartile range (IQR), which contains the middle 50% of the total bill amounts for each day.
The line inside the box represents the median total bill amount.
The whiskers extend to show the range of typical data (excluding outliers).
Any data points beyond the whiskers are potential outliers and are plotted individually as dots.

Box Plot

4. Heatmap

Heatmap Purpose:

A heatmap is a graphical representation of data using colors to visualize the value of a matrix.
It’s particularly useful for displaying relationships and patterns between variables.

Example:

import seaborn as sns

import matplotlib.pyplot as plt

# Load the Tips dataset

tips = sns.load_dataset("tips")

# Ensure we only use the numeric columns for correlation calculation

numeric_data = tips._get_numeric_data()

# Create a heatmap

correlation_matrix = tips.corr()

sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")

plt.title("Correlation Heatmap")

plt.show()

Explanation:

correlation_matrix = tips.corr() computes the correlation matrix for numeric columns in the tips dataset.
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm") creates a heatmap of the correlation values.
The plt.title("Correlation Heatmap") sets the plot title.

Output:

In this heatmap, we are visualising the correlation matrix of the Tips dataset.
The color intensity represents the strength and direction of the correlation between different numeric variables.
Here’s what we can infer:
- Positive Correlation: Lighter colors (e.g., shades of red) indicate positive correlations (variables move together).
- Negative Correlation: Darker colors (e.g., shades of blue) indicate negative correlations (variables move in opposite directions).
- No Correlation: Cells with neutral colors (e.g., white or gray) represent no significant correlation.
The annotations within each cell show the actual correlation coefficients.

Heatmap

5. Pair Plot

Pair Plot Purpose:

A pair plot visualizes pairwise relationships in a dataset.
It shows scatter plots for each combination of numeric variables and histograms (or kernel density estimates) along the diagonal.

Example:

import seaborn as sns

import matplotlib.pyplot as plt

# Load the Tips dataset

tips = sns.load_dataset("tips")

# Create a pair plot

sns.pairplot(tips, hue="sex")

plt.title("Pair Plot: Relationships between Numeric Variables")

plt.show()

Explanation:

sns.pairplot(tips, hue="sex") creates a pair plot showing relationships between numeric variables in the tips dataset, with different colors for male and female data points.
The plt.title("Pair Plot: Relationships between Numeric Variables") sets the plot title.

Output:

The scatter plot between total bill and tip shows a positive correlation, indicating that higher total bills tend to result in larger tips.
The scatter plot between total bill and size suggests that larger groups (more people) tend to have higher total bills.
The diagonal histograms show the distribution of each variable (e.g., the distribution of total bills or tip amounts).

Pair Plot

6. Count Plot

A count plot visually displays the counts of observations in each categorical bin using bars.
It’s similar to a histogram but for categorical data instead of quantitative data.

Example:

import seaborn as sns

import matplotlib.pyplot as plt

# Load the Tips dataset

tips = sns.load_dataset("tips")

# Create a count plot

sns.countplot(x="day", data=tips)

plt.title("Number of Visits by Day")

plt.show()

Explanation:

sns.countplot(x="day", data=tips) creates a count plot showing the number of visits for each day.
The plt.title("Number of Visits by Day") sets the plot title.

Output:

In this count plot, we are visualising the number of visits for each day of the week.
Each bar represents the count of visits on a specific day.
For example:
- The bar for Thursday (Thur) represents the number of visits on Thursdays.
- The bar for Friday (Fri) represents the number of visits on Fridays, and so on.

7. Distribution Plot

A distribution plot (also known as a histogram) visualizes the distribution of a single variable.
It shows how the data is spread across different values.

Example:

import seaborn as sns

import matplotlib.pyplot as plt

# Load the Tips dataset

tips = sns.load_dataset("tips")

# Create a distribution plot

sns.distplot(tips["total_bill"], bins=20, kde=False)

plt.title("Distribution of Total Bill Amounts")

plt.show()

Explanation:

sns.distplot(tips["total_bill"], bins=20, kde=False) creates a distribution plot (histogram) of total bill amounts with 20 bins and no kernel density estimate (KDE).
By setting kde=False, the kernel density estimate (KDE) curve is disabled. If you set kde=True, the plot would include a smooth curve showing the estimated probability density function.
The plt.title("Distribution of Total Bill Amounts") sets the plot title.

Output:

In this plot, we are visualising the distribution of total bill amounts in the Tips dataset.
The bars show how many times each total bill amount occurs.
For example:
- The tallest bar represents the most common total bill amount.
- The distribution appears slightly skewed to the right, indicating that higher total bill amounts are less frequent.
- The parameter bins=20 specifies that the data is divided into 20 bins (intervals).

	Sponsored AiNexaVerse NewsWeekly AI Tools in your email!

Summary

Seaborn is a powerful tool for creating informative and aesthetically pleasing statistical graphics. Whether you’re visualizing relationships, distributions, or comparisons, Seaborn simplifies the process and enhances your data storytelling.

Ready for More Python Fun?

Subscribe to our newsletter now and get a free Python cheat sheet! Dive deeper into Python programming with more exciting projects and tutorials designed just for beginners.

Keep exploring, keep coding, and enjoy your journey into data analytics with Python!

Happy plotting! 🚀