- CodeCraft by Dr. Christine Lee
- Posts
- Python Magic
Python Magic
How to Turn Data into Art with Pandas, NumPy, and More
Data Analytics and Visualisation
Learn Python by Creating a Fun Data Visualisation Project
Introduction
Welcome to the world of Python programming! Today, we're going to dive into a fun and simple project that will introduce you to some powerful Python libraries: pandas, matplotlib, seaborn, and numpy. By the end of this tutorial, you'll have created a beautiful data visualisation and gained foundational skills in data manipulation, statistical analysis, and plotting.
What You'll Learn
Pandas: How to handle and manipulate data.
NumPy: Basic numerical operations and array manipulations.
Matplotlib: Creating simple plots and visualisations.
Seaborn: Enhancing your visualisations with advanced styling.
Step-by-Step Guide
Step 1: Setting Up Your Environment
First, ensure you have Python installed on your computer. You can download it from python.org. Next, install the required libraries. Open your terminal or command prompt and run:
pip install pandas numpy matplotlib seaborn
This command installs the pandas, numpy, matplotlib, and seaborn libraries, which are essential for data manipulation and visualisation.
Step 2: Importing Libraries
Start by importing the necessary libraries in your Python script or Jupyter notebook:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Here, we're importing pandas as pd
, numpy as np
, matplotlib.pyplot as plt
, and seaborn as sns
. These imports allow us to use the functions and features provided by these libraries.
Step 3: Creating a Simple Dataset
Let's create a simple dataset using NumPy. We'll generate random data for demonstration purposes.
np.random.seed(0) # For reproducibility
data = {
'Age': np.random.randint(18, 70, size=100),
'Salary': np.random.randint(30000, 120000, size=100),
'Department': np.random.choice(['HR', 'Tech', 'Marketing', 'Sales'], size=100)
}
df = pd.DataFrame(data)
Explanation
np.random.seed(0)
: Ensures that the random numbers generated are the same each time you run the code.np.random.randint(18, 70, size=100)
: Generates 100 random integers between 18 and 70 for the 'Age' column.np.random.randint(30000, 120000, size=100)
: Generates 100 random integers between 30000 and 120000 for the 'Salary' column.np.random.choice(['HR', 'Tech', 'Marketing', 'Sales'], size=100)
: Randomly assigns one of the departments to each entry.pd.DataFrame(data)
: Creates a DataFrame from the dictionarydata
.
Output
The output is a DataFrame with 100 rows and three columns: Age, Salary, and Department.
Step 4: Exploring the Data with Pandas
Before we start visualising, let's explore our data.
print(df.head())
print(df.describe())
print(df['Department'].value_counts())
Explanation
df.head()
: Displays the first five rows of the DataFrame.df.describe()
: Provides a summary of statistics for the numerical columns.df['Department'].value_counts()
: Counts the occurrences of each department.
Output
The output will show:
The first five rows of the DataFrame to give you an idea of what the data looks like.
Statistical summary including count, mean, std, min, 25%, 50%, 75%, and max for Age and Salary.
The count of entries in each department.
Step 5: Visualising Data with Matplotlib
We'll start with a simple scatter plot to see the relationship between Age and Salary.
plt.figure(figsize=(10, 6))
plt.scatter(df['Age'], df['Salary'], alpha=0.5)
plt.title('Age vs Salary')
plt.xlabel('Age')
plt.ylabel('Salary')
plt.show()
Explanation
plt.figure(figsize=(10, 6))
: Sets the size of the figure.plt.scatter(df['Age'], df['Salary'], alpha=0.5)
: Creates a scatter plot with Age on the x-axis and Salary on the y-axis. Thealpha
parameter sets the transparency of the points.plt.title('Age vs Salary')
: Sets the title of the plot.plt.xlabel('Age')
andplt.ylabel('Salary')
: Label the x-axis and y-axis.plt.show()
: Displays the plot.
Output
The output will be a scatter plot showing the relationship between Age and Salary. Each point represents an individual in the dataset.
Step 6: Enhancing Visualisations with Seaborn
Seaborn makes it easy to create attractive and informative statistical graphics. Let's use it to create a more detailed plot.
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Age', y='Salary', hue='Department', data=df, palette='viridis')
plt.title('Age vs Salary by Department')
plt.show()
Explanation
sns.scatterplot(x='Age', y='Salary', hue='Department', data=df, palette='viridis')
: Creates a scatter plot with Age on the x-axis and Salary on the y-axis. Thehue
parameter colors the points based on the Department, andpalette='viridis'
sets the color scheme.plt.title('Age vs Salary by Department')
: Sets the title of the plot.plt.show()
: Displays the plot.
Output
The output will be a scatter plot similar to the previous one but with points colored according to their department, making it easier to see any trends or patterns within departments.
Step 7: Creating a Box Plot
Box plots are great for visualising the distribution of data across different categories.
plt.figure(figsize=(10, 6))
sns.boxplot(x='Department', y='Salary', data=df, palette='Set2')
plt.title('Salary Distribution by Department')
plt.show()
Explanation
sns.boxplot(x='Department', y='Salary', data=df, palette='Set2')
: Creates a box plot with Department on the x-axis and Salary on the y-axis. Thepalette='Set2'
sets the color scheme.plt.title('Salary Distribution by Department')
: Sets the title of the plot.plt.show()
: Displays the plot.
Output
The output will be a box plot showing the distribution of salaries for each department. The box plot displays the median, quartiles, and any outliers in the data.
Step 8: Creating a Heatmap
Heatmaps are useful for visualising correlations between different numerical features.
plt.figure(figsize=(10, 6))
numeric_data = df._get_numeric_data()
corr = numeric_data.corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
Explanation
numeric_data = df._get_numeric_data()
: Ensures we only use the numeric columns for correlation calculation.numeric_data.corr()
: Calculates the correlation matrix of the DataFrame.sns.heatmap(corr, annot=True, cmap='coolwarm')
: Creates a heatmap of the correlation matrix. Theannot=True
parameter annotates the heatmap with the correlation values, andcmap='coolwarm'
sets the color scheme.plt.title('Correlation Heatmap')
: Sets the title of the plot.plt.show()
: Displays the plot.
Output
The output will be a heatmap showing the correlations between Age and Salary. The color intensity indicates the strength of the correlation, making it easy to identify strong correlations.
The value at the intersection of "Age" and "Salary" is -0.1, indicating a very weak negative correlation. This suggests that there is almost no linear relationship between Age and Salary in this dataset.
Color Coding
The colors in the heatmap range from blue to red.
Red indicates a positive correlation (closer to 1).
Blue indicates a negative correlation (closer to -1).
The intensity of the color reflects the strength of the correlation. A deeper color means a stronger correlation.
Summary
Congratulations! You've just completed a fun and educational Python project. You learned how to:
Generate and manipulate data using NumPy and pandas.
Create basic plots with matplotlib.
Enhance and style your visualizations with seaborn.
By mastering these libraries, you're well on your way to becoming proficient in data analysis and visualization. Keep experimenting with different datasets and visualizations to deepen your understanding.
Ready for More Python Fun?
Subscribe to our newsletter now and get a free Python cheat sheet! Dive deeper into Python programming with more exciting projects and tutorials designed just for beginners.
Keep exploring, keep coding, and enjoy your journey into data analytics with Python!
Happy coding!🚀