- CodeCraft by Dr. Christine Lee
- Posts
- See the Story Behind the Numbers
See the Story Behind the Numbers
COVID-19 Data Analytics Made Easy
Panda performing data analytics on Covid-19
Exploring Data Analytics with COVID-19 Data
Welcome to another exciting edition of our newsletter! In this post, we’re diving into the world of data analytics using a real-world dataset: COVID-19 data. This project will help you understand how to manipulate, analyse, and visualise data to extract meaningful insights. By the end of this tutorial, you'll have a strong grasp of the fundamental concepts of data analytics, making your learning journey both interesting and motivating.
What You'll Learn:
Loading and Exploring Data: How to load a dataset and understand its structure.
Data Cleaning: How to handle missing values and prepare the data for analysis.
Data Analysis: How to perform basic analysis to extract meaningful insights.
Data Visualisation: How to visualise data using various types of plots.
Let’s get started!
Step 1: Setup Your Environment
Before we dive into the data, make sure you have the necessary tools installed. We’ll use Python along with the following libraries:
pandas
numpy
matplotlib
seaborn
You can install these libraries using pip
pip install pandas numpy matplotlib seaborn
Step 2: Load the Dataset
We'll use a sample COVID-19 dataset. You can download it from Johns Hopkins University's GitHub repository.
import pandas as pd
# Load the dataset
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
data = pd.read_csv(url)
# Display the first few rows
print(data.head())
About the COVID-19 Dataset
The COVID-19 dataset used in this tutorial is sourced from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). This comprehensive dataset tracks the global spread of the COVID-19 virus, providing detailed information on confirmed cases, deaths, and recoveries across different countries and regions.
Key Features
Date Range: The dataset includes daily records from the beginning of the outbreak to the present, allowing for detailed time series analysis.
Geographical Coverage: Data is available for countries worldwide, with additional granularity for states/provinces in larger countries like the United States, Canada, and China.
Metrics Tracked:
Confirmed Cases: The total number of confirmed COVID-19 cases.
Deaths: The total number of deaths attributed to COVID-19.
Recovered: The total number of patients who have recovered from COVID-19.
Data Format: The dataset is structured in a time series format, with columns representing different dates and rows representing different geographical regions.
This dataset is widely used by researchers, analysts, and public health officials to monitor the spread of the virus, analyse trends, and inform policy decisions. Its detailed and regularly updated nature makes it an invaluable resource for anyone looking to understand the impact of COVID-19 on a global scale.
Output
Step 3: Explore the Data
Understanding the structure and content of your dataset is crucial. Let’s take a closer look at the data.
# Display basic information about the dataset
print(data.info())
# Show basic statistics
print(data.describe())
# Display the column names
print(data.columns)
Output
Step 4: Data Cleaning
Real-world data often requires cleaning. We’ll handle missing values and transform the data into a more usable format.
# Check for missing values
print(data.isnull().sum())
# Drop columns with missing values if necessary (Example: data.dropna(axis=1, inplace=True))
# Melt the dataset to have a better structure for analysis
data_melted = data.melt(id_vars=["Province/State", "Country/Region", "Lat", "Long"], var_name="Date", value_name="Confirmed")
data_melted["Date"] = pd.to_datetime(data_melted["Date"], format='%m/%d/%y')
# Display the first few rows of the cleaned data
print(data_melted.head())
Understanding data.melt
The data.melt
function is used to convert a wide-format DataFrame into a long-format DataFrame. In wide format, each subject or entity has its own column. In long format, the data is stacked so that each row is a single observation. This is particularly useful for time series data or data that needs to be grouped and aggregated.
Syntax
data.melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)
Parameters
id_vars
:This parameter specifies the columns that you want to keep as identifier variables.
These columns will not be unpivoted (i.e., they will remain the same).
value_vars
:This parameter specifies the columns that you want to unpivot.
These columns will be converted from wide to long format.
var_name
:This parameter sets the name of the new variable column.
If not specified, the default name will be
variable
.
value_name
:This parameter sets the name of the new value column.
The default name is
value
.
col_level
:If columns are multi-indexed, this parameter can specify which level to melt.
ignore_index
:If set to
True
, the original index will be ignored.If set to
False
, the original index will be retained.
Melt the COVID-19 dataset:
id_vars=["Province/State", "Country/Region", "Lat", "Long"]
: We keep these columns as they are because they are identifiers for each record.var_name="Date"
: We specify that the new variable column should be namedDate
.value_name="Confirmed"
: We specify that the new value column should be namedConfirmed
.
Output
Step 5: Data Analysis
Now that our data is clean, we can start analyzing it. Let’s look at the global spread of COVID-19 over time.
# Group by Date to see the global confirmed cases over time
global_cases = data_melted.groupby("Date")["Confirmed"].sum().reset_index()
# Display the first few rows
print(global_cases.head())
Output
Step 6: Data Visualization
Visualizing data helps in understanding trends and patterns. We’ll create some basic plots to visualize the spread of COVID-19.
import matplotlib.pyplot as plt
import seaborn as sns
# Plot the global confirmed cases over time
plt.figure(figsize=(10, 6))
plt.plot(global_cases["Date"], global_cases["Confirmed"], marker='o', linestyle='-')
plt.title('Global COVID-19 Confirmed Cases Over Time')
plt.xlabel('Date')
plt.ylabel('Confirmed Cases')
plt.grid(True)
plt.show()
Output
Global COVID-19 Confirmed Cases Over Time
Additional Analysis: Country-Specific Trends
We can also analyse the data for specific countries. Let’s see the trend for a specific country, like the United States.
# Filter data for the United States
us_data = data_melted[data_melted["Country/Region"] == "US"]
# Group by Date to see the confirmed cases over time for the US
us_cases = us_data.groupby("Date")["Confirmed"].sum().reset_index()
# Plot the confirmed cases over time for the US
plt.figure(figsize=(10, 6))
plt.plot(us_cases["Date"], us_cases["Confirmed"], marker='o', linestyle='-', color='red')
plt.title('COVID-19 Confirmed Cases Over Time in the US')
plt.xlabel('Date')
plt.ylabel('Confirmed Cases')
plt.grid(True)
plt.show()
Output
Confirmed cases over time for the US
Conclusion
Congratulations! You've just completed a basic data analytics project using COVID-19 data. Here’s what we covered:
Loading and Exploring Data: Understanding the dataset’s structure and basic statistics.
Data Cleaning: Handling missing values and transforming the data for analysis.
Data Analysis: Performing basic grouping and summing operations to extract insights.
Data Visualisation: Creating plots to visualise trends and patterns in the data.
By working through these steps, you’ve gained valuable skills in data manipulation, analysis, and visualisation. Keep exploring different datasets and applying these techniques to uncover more insights. Data analytics is a powerful tool, and you’re well on your way to mastering it!
Recommended AI Resources
Ready for More Python Fun? 📬
Subscribe to our newsletter now and get a free Python cheat sheet! 📑 Dive deeper into Python programming with more exciting projects and tutorials designed just for beginners.
Keep exploring, keep coding, 👩💻👨💻and enjoy your journey into data analytics with Python!
Stay tuned for our next exciting project in the following edition!
Happy coding!🚀📊✨