Python Magic

How to Turn Data into Art with Pandas, NumPy, and More

Data Analytics and Visualisation

Learn Python by Creating a Fun Data Visualisation Project

Introduction

Welcome to the world of Python programming! Today, we're going to dive into a fun and simple project that will introduce you to some powerful Python libraries: pandas, matplotlib, seaborn, and numpy. By the end of this tutorial, you'll have created a beautiful data visualisation and gained foundational skills in data manipulation, statistical analysis, and plotting.

What You'll Learn

Pandas: How to handle and manipulate data.
NumPy: Basic numerical operations and array manipulations.
Matplotlib: Creating simple plots and visualisations.
Seaborn: Enhancing your visualisations with advanced styling.

Step-by-Step Guide

Step 1: Setting Up Your Environment

First, ensure you have Python installed on your computer. You can download it from python.org. Next, install the required libraries. Open your terminal or command prompt and run:

pip install pandas numpy matplotlib seaborn

This command installs the pandas, numpy, matplotlib, and seaborn libraries, which are essential for data manipulation and visualisation.

Step 2: Importing Libraries

Start by importing the necessary libraries in your Python script or Jupyter notebook:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

Here, we're importing pandas as pd, numpy as np, matplotlib.pyplot as plt, and seaborn as sns. These imports allow us to use the functions and features provided by these libraries.

Step 3: Creating a Simple Dataset

Let's create a simple dataset using NumPy. We'll generate random data for demonstration purposes.

np.random.seed(0) # For reproducibility

data = {

'Age': np.random.randint(18, 70, size=100),

'Salary': np.random.randint(30000, 120000, size=100),

'Department': np.random.choice(['HR', 'Tech', 'Marketing', 'Sales'], size=100)

}

df = pd.DataFrame(data)

Explanation

np.random.seed(0): Ensures that the random numbers generated are the same each time you run the code.
np.random.randint(18, 70, size=100): Generates 100 random integers between 18 and 70 for the 'Age' column.
np.random.randint(30000, 120000, size=100): Generates 100 random integers between 30000 and 120000 for the 'Salary' column.
np.random.choice(['HR', 'Tech', 'Marketing', 'Sales'], size=100): Randomly assigns one of the departments to each entry.
pd.DataFrame(data): Creates a DataFrame from the dictionary data.

Output

The output is a DataFrame with 100 rows and three columns: Age, Salary, and Department.

Step 4: Exploring the Data with Pandas

Before we start visualising, let's explore our data.

print(df.head())

print(df.describe())

print(df['Department'].value_counts())

Explanation

df.head(): Displays the first five rows of the DataFrame.
df.describe(): Provides a summary of statistics for the numerical columns.
df['Department'].value_counts(): Counts the occurrences of each department.

Output

The output will show:

The first five rows of the DataFrame to give you an idea of what the data looks like.
Statistical summary including count, mean, std, min, 25%, 50%, 75%, and max for Age and Salary.
The count of entries in each department.

Step 5: Visualising Data with Matplotlib

We'll start with a simple scatter plot to see the relationship between Age and Salary.

plt.figure(figsize=(10, 6))

plt.scatter(df['Age'], df['Salary'], alpha=0.5)

plt.title('Age vs Salary')

plt.xlabel('Age')

plt.ylabel('Salary')

plt.show()

Explanation

plt.figure(figsize=(10, 6)): Sets the size of the figure.
plt.scatter(df['Age'], df['Salary'], alpha=0.5): Creates a scatter plot with Age on the x-axis and Salary on the y-axis. The alpha parameter sets the transparency of the points.
plt.title('Age vs Salary'): Sets the title of the plot.
plt.xlabel('Age') and plt.ylabel('Salary'): Label the x-axis and y-axis.
plt.show(): Displays the plot.

Output

The output will be a scatter plot showing the relationship between Age and Salary. Each point represents an individual in the dataset.

Step 6: Enhancing Visualisations with Seaborn

Seaborn makes it easy to create attractive and informative statistical graphics. Let's use it to create a more detailed plot.

plt.figure(figsize=(10, 6))

sns.scatterplot(x='Age', y='Salary', hue='Department', data=df, palette='viridis')

plt.title('Age vs Salary by Department')

plt.show()

Explanation

sns.scatterplot(x='Age', y='Salary', hue='Department', data=df, palette='viridis'): Creates a scatter plot with Age on the x-axis and Salary on the y-axis. The hue parameter colors the points based on the Department, and palette='viridis' sets the color scheme.
plt.title('Age vs Salary by Department'): Sets the title of the plot.
plt.show(): Displays the plot.

Output

The output will be a scatter plot similar to the previous one but with points colored according to their department, making it easier to see any trends or patterns within departments.

Step 7: Creating a Box Plot

Box plots are great for visualising the distribution of data across different categories.

plt.figure(figsize=(10, 6))

sns.boxplot(x='Department', y='Salary', data=df, palette='Set2')

plt.title('Salary Distribution by Department')

plt.show()

Explanation

sns.boxplot(x='Department', y='Salary', data=df, palette='Set2'): Creates a box plot with Department on the x-axis and Salary on the y-axis. The palette='Set2' sets the color scheme.
plt.title('Salary Distribution by Department'): Sets the title of the plot.
plt.show(): Displays the plot.

Output

The output will be a box plot showing the distribution of salaries for each department. The box plot displays the median, quartiles, and any outliers in the data.

Step 8: Creating a Heatmap

Heatmaps are useful for visualising correlations between different numerical features.

plt.figure(figsize=(10, 6))

numeric_data = df._get_numeric_data()

corr = numeric_data.corr()

sns.heatmap(corr, annot=True, cmap='coolwarm')

plt.title('Correlation Heatmap')

plt.show()

Explanation

numeric_data = df._get_numeric_data(): Ensures we only use the numeric columns for correlation calculation.
numeric_data.corr(): Calculates the correlation matrix of the DataFrame.
sns.heatmap(corr, annot=True, cmap='coolwarm'): Creates a heatmap of the correlation matrix. The annot=True parameter annotates the heatmap with the correlation values, and cmap='coolwarm' sets the color scheme.
plt.title('Correlation Heatmap'): Sets the title of the plot.
plt.show(): Displays the plot.

Output

The output will be a heatmap showing the correlations between Age and Salary. The color intensity indicates the strength of the correlation, making it easy to identify strong correlations.

The value at the intersection of "Age" and "Salary" is -0.1, indicating a very weak negative correlation. This suggests that there is almost no linear relationship between Age and Salary in this dataset.

Color Coding

The colors in the heatmap range from blue to red.
Red indicates a positive correlation (closer to 1).
Blue indicates a negative correlation (closer to -1).
The intensity of the color reflects the strength of the correlation. A deeper color means a stronger correlation.

Summary

Congratulations! You've just completed a fun and educational Python project. You learned how to:

Generate and manipulate data using NumPy and pandas.
Create basic plots with matplotlib.
Enhance and style your visualizations with seaborn.

By mastering these libraries, you're well on your way to becoming proficient in data analysis and visualization. Keep experimenting with different datasets and visualizations to deepen your understanding.

Ready for More Python Fun?

Subscribe to our newsletter now and get a free Python cheat sheet! Dive deeper into Python programming with more exciting projects and tutorials designed just for beginners.

Keep exploring, keep coding, and enjoy your journey into data analytics with Python!

Happy coding!🚀