See the Story Behind the Numbers

COVID-19 Data Analytics Made Easy

Panda performing data analytics on Covid-19

Exploring Data Analytics with COVID-19 Data

 

Welcome to another exciting edition of our newsletter! In this post, we’re diving into the world of data analytics using a real-world dataset: COVID-19 data. This project will help you understand how to manipulate, analyse, and visualise data to extract meaningful insights. By the end of this tutorial, you'll have a strong grasp of the fundamental concepts of data analytics, making your learning journey both interesting and motivating.

 

What You'll Learn:

  1. Loading and Exploring Data: How to load a dataset and understand its structure.

  2. Data Cleaning: How to handle missing values and prepare the data for analysis.

  3. Data Analysis: How to perform basic analysis to extract meaningful insights.

  4. Data Visualisation: How to visualise data using various types of plots.

 

Let’s get started!

 

Step 1: Setup Your Environment

 

Before we dive into the data, make sure you have the necessary tools installed. We’ll use Python along with the following libraries:

  • pandas

  • numpy

  • matplotlib

  • seaborn

 

You can install these libraries using pip

pip install pandas numpy matplotlib seaborn

 

Step 2: Load the Dataset

 

We'll use a sample COVID-19 dataset. You can download it from Johns Hopkins University's GitHub repository.

 

import pandas as pd

 

# Load the dataset

url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"

data = pd.read_csv(url)

 

# Display the first few rows

print(data.head())

 

About the COVID-19 Dataset

The COVID-19 dataset used in this tutorial is sourced from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). This comprehensive dataset tracks the global spread of the COVID-19 virus, providing detailed information on confirmed cases, deaths, and recoveries across different countries and regions.

 

Key Features

  • Date Range: The dataset includes daily records from the beginning of the outbreak to the present, allowing for detailed time series analysis.

  • Geographical Coverage: Data is available for countries worldwide, with additional granularity for states/provinces in larger countries like the United States, Canada, and China.

  • Metrics Tracked:

    •  Confirmed Cases: The total number of confirmed COVID-19 cases.

    •  Deaths: The total number of deaths attributed to COVID-19.

    •  Recovered: The total number of patients who have recovered from COVID-19.

  • Data Format: The dataset is structured in a time series format, with columns representing different dates and rows representing different geographical regions.

This dataset is widely used by researchers, analysts, and public health officials to monitor the spread of the virus, analyse trends, and inform policy decisions. Its detailed and regularly updated nature makes it an invaluable resource for anyone looking to understand the impact of COVID-19 on a global scale.

Output

Step 3: Explore the Data

 

Understanding the structure and content of your dataset is crucial. Let’s take a closer look at the data.

 

# Display basic information about the dataset

print(data.info())

 

# Show basic statistics

print(data.describe())

 

# Display the column names

print(data.columns)

Output

Step 4: Data Cleaning

 

Real-world data often requires cleaning. We’ll handle missing values and transform the data into a more usable format.

 

# Check for missing values

print(data.isnull().sum())

 

# Drop columns with missing values if necessary (Example: data.dropna(axis=1, inplace=True))

 

# Melt the dataset to have a better structure for analysis

data_melted = data.melt(id_vars=["Province/State", "Country/Region", "Lat", "Long"], var_name="Date", value_name="Confirmed")

data_melted["Date"] = pd.to_datetime(data_melted["Date"], format='%m/%d/%y')

 

# Display the first few rows of the cleaned data

print(data_melted.head())

Understanding data.melt

The data.melt function is used to convert a wide-format DataFrame into a long-format DataFrame. In wide format, each subject or entity has its own column. In long format, the data is stacked so that each row is a single observation. This is particularly useful for time series data or data that needs to be grouped and aggregated.

Syntax

data.melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)

Parameters

  1. id_vars:

    • This parameter specifies the columns that you want to keep as identifier variables.

    • These columns will not be unpivoted (i.e., they will remain the same).

  2. value_vars:

    • This parameter specifies the columns that you want to unpivot.

    • These columns will be converted from wide to long format.

  3. var_name:

    • This parameter sets the name of the new variable column.

    • If not specified, the default name will be variable.

  4. value_name:

    • This parameter sets the name of the new value column.

    • The default name is value.

  5. col_level:

    • If columns are multi-indexed, this parameter can specify which level to melt.

  6. ignore_index:

    • If set to True, the original index will be ignored.

    • If set to False, the original index will be retained.

Melt the COVID-19 dataset:

  • id_vars=["Province/State", "Country/Region", "Lat", "Long"]: We keep these columns as they are because they are identifiers for each record.

  • var_name="Date": We specify that the new variable column should be named Date.

  • value_name="Confirmed": We specify that the new value column should be named Confirmed.

Output

Step 5: Data Analysis

 

Now that our data is clean, we can start analyzing it. Let’s look at the global spread of COVID-19 over time.

 

# Group by Date to see the global confirmed cases over time

global_cases = data_melted.groupby("Date")["Confirmed"].sum().reset_index()

 

# Display the first few rows

print(global_cases.head())

 

Output

Step 6: Data Visualization

 

Visualizing data helps in understanding trends and patterns. We’ll create some basic plots to visualize the spread of COVID-19.

 

import matplotlib.pyplot as plt

import seaborn as sns

 

# Plot the global confirmed cases over time

plt.figure(figsize=(10, 6))

plt.plot(global_cases["Date"], global_cases["Confirmed"], marker='o', linestyle='-')

plt.title('Global COVID-19 Confirmed Cases Over Time')

plt.xlabel('Date')

plt.ylabel('Confirmed Cases')

plt.grid(True)

plt.show()

 

Output

Global COVID-19 Confirmed Cases Over Time

Additional Analysis: Country-Specific Trends

 

We can also analyse the data for specific countries. Let’s see the trend for a specific country, like the United States.

 

# Filter data for the United States

us_data = data_melted[data_melted["Country/Region"] == "US"]

 

# Group by Date to see the confirmed cases over time for the US

us_cases = us_data.groupby("Date")["Confirmed"].sum().reset_index()

 

# Plot the confirmed cases over time for the US

plt.figure(figsize=(10, 6))

plt.plot(us_cases["Date"], us_cases["Confirmed"], marker='o', linestyle='-', color='red')

plt.title('COVID-19 Confirmed Cases Over Time in the US')

plt.xlabel('Date')

plt.ylabel('Confirmed Cases')

plt.grid(True)

plt.show()

 

Output

Confirmed cases over time for the US

Conclusion

Congratulations! You've just completed a basic data analytics project using COVID-19 data. Here’s what we covered:

  • Loading and Exploring Data: Understanding the dataset’s structure and basic statistics.

  • Data Cleaning: Handling missing values and transforming the data for analysis.

  • Data Analysis: Performing basic grouping and summing operations to extract insights.

  • Data Visualisation: Creating plots to visualise trends and patterns in the data.

By working through these steps, you’ve gained valuable skills in data manipulation, analysis, and visualisation. Keep exploring different datasets and applying these techniques to uncover more insights. Data analytics is a powerful tool, and you’re well on your way to mastering it!

There's An AI For ThatThe #1 AI newsletter. Read and trusted by over 1.2 million readers, including employees at Google, Microsoft, Meta, Salesforce, Intel, Samsung, as well as thousands of AI influencers and enthusiasts.

Ready for More Python Fun? 📬

Subscribe to our newsletter now and get a free Python cheat sheet! 📑 Dive deeper into Python programming with more exciting projects and tutorials designed just for beginners.

Keep exploring, keep coding, 👩‍💻👨‍💻and enjoy your journey into data analytics with Python!

Stay tuned for our next exciting project in the following edition!

Happy coding!🚀📊✨