Why Polars is a Game-Changer for Data Analysts

A Beginner’s Guide

Unleash the Power of Polars: The Secret Weapon for Data Scientists

 

Welcome back to another exciting edition of the CodeCraft newsletter! Today, we're diving into a powerful and lightning-fast DataFrame library in Python—Polars. If you've been using pandas, you'll find Polars to be a breath of fresh air with its speed and efficiency, especially for large datasets. Let’s get started!

Sponsored
simple.ai - The Agent AI newsletterJoin 200,000+ others and learn how to use Agent AI to grow your career or business.

What is Polars?

 

Polars is an in-memory DataFrame library for Rust and Python. It provides fast and efficient manipulation of structured data, similar to pandas but with several performance enhancements. Polars is designed to handle large datasets that might slow down pandas.

 

Why Polars?

 

  1. Speed: Polars is designed to be fast, leveraging Rust's performance for heavy data processing tasks. Rust is one of the fastest growing programming languages in the world.

  2. Memory Efficiency: Polars uses the Apache Arrow memory format, which is optimised for columnar data.

  3. Parallel Execution: Polars can run operations in parallel, making it faster for large datasets.

 

Getting Started with Polars

 

Before we dive into the code, you’ll need to install Polars. You can do this via pip:

 

pip install polars

 

Real-Life Example: Analyzing Sales Data

 

Let's walk through a real-life example of how Polars can make your data analysis faster and more efficient. Imagine you have a large CSV file with sales data from an e-commerce store, and you want to analyse the monthly sales performance.

 

Structure of sales_data.csv

 

The sales_data.csv file contains sales transaction data from an e-commerce store. Here are the columns in this file:

 

  1. transaction_id: A unique identifier for each transaction.

  2. date: The date of the transaction (in YYYY-MM-DD format).

  3. customer_id: A unique identifier for each customer.

  4. product_id: A unique identifier for each product.

  5. quantity: The quantity of the product sold in the transaction.

  6. price: The price of a single unit of the product.

  7. sales: The total sales amount for the transaction (calculated as quantity * price).

 

Sample Data

 

Here’s an example of how the data might look:

(To create the file, you can copy and paste the content into Notepad and save as sales_data.csv)

 

transaction_id,date,customer_id,product_id,quantity,price,sales

1,2024-01-15,101,5001,2,19.99,39.98

2,2024-01-17,102,5002,1,49.99,49.99

3,2024-02-05,103,5001,3,19.99,59.97

4,2024-02-18,101,5003,1,29.99,29.99

5,2024-03-03,104,5002,2,49.99,99.98

 

Explanation of Sample Data

 

  • transaction_id: 1, 2, 3, 4, 5

    Unique identifiers for each transaction.

     

  • date: 2024-01-15, 2024-01-17, 2024-02-05, 2024-02-18, 2024-03-03

    The dates on which these transactions occurred.

     

  • customer_id: 101, 102, 103, 101, 104

    Unique identifiers for customers. Note that customer 101 made two purchases.

     

  • product_id: 5001, 5002, 5001, 5003, 5002

    Unique identifiers for products. Note that product 5001 was purchased twice by different customers.

     

  • quantity: 2, 1, 3, 1, 2

    The quantity of each product sold in the transactions.

     

  • price: 19.99, 49.99, 19.99, 29.99, 49.99

    The price per unit of the product.

     

  • sales: 39.98, 49.99, 59.97, 29.99, 99.98

    The total sales amount for each transaction, calculated as quantity * price.

 

Using this structure and sample data, you can start experimenting with Polars to perform various data analyses.

 

Loading Data

 

First, let’s load the CSV file into a Polars DataFrame:

 

import polars as pl

 

# Load the CSV file

df = pl.read_csv("sales_data.csv")

print(df)

Output

Data Exploration

 

Now, let’s explore our data. We’ll look at the first few rows to get an idea of what our data looks like:

 

print(df.head(5))

 

Output

Basic Analysis

 

Let’s say we want to calculate the total sales for each month. Here’s how you can do it in Polars:

 

import polars as pl

# Load the CSV file

df = pl.read_csv("sales_data.csv")

# Convert the date column to a proper date type

df = df.with_columns(pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"))

# Extract the month from the date

df = df.with_columns(pl.col("date").dt.month().alias("month"))

# Group by month and calculate total sales

monthly_sales = df.group_by("month").agg(pl.col("sales").sum().alias("total_sales"))

print(monthly_sales)

 

Output

Visualisation

 

To make our analysis more interesting, let's visualise the monthly sales using matplotlib:

 

import polars as pl

import pandas as pd

import matplotlib.pyplot as plt

# Load the CSV file

df = pl.read_csv("sales_data.csv")

# Convert the date column to a proper date type

df = df.with_columns(pl.col("date").str.strptime(pl.Date, "%Y-%m-%d"))

# Extract the month from the date

df = df.with_columns(pl.col("date").dt.month().alias("month"))

# Group by month and calculate total sales

monthly_sales = df.group_by("month").agg(pl.col("sales").sum().alias("total_sales"))

# Convert to pandas for plotting using a dictionary

monthly_sales_dict = monthly_sales.to_dict(as_series=False)

monthly_sales_pd = pd.DataFrame(monthly_sales_dict)

# Sort the DataFrame by the "month" column to ensure consistent plotting order

monthly_sales_pd = monthly_sales_pd.sort_values(by="month")

# Convert the month column to integer

monthly_sales_pd["month"] = monthly_sales_pd["month"].astype(int)

# Plot the data

plt.figure(figsize=(10, 6))

plt.plot(monthly_sales_pd["month"], monthly_sales_pd["total_sales"], marker='o')

plt.title("Monthly Sales Performance")

plt.xlabel("Month")

plt.ylabel("Total Sales")

plt.grid(True)

plt.show()

 

Output

Make sure to run the pip install pyarrow command in your environment to ensure pyarrow is installed.

Polars vs. Pandas

 

To give you a sense of how Polars stands out, let’s compare the performance of Polars with pandas on the same task:

 

import pandas as pd

import time

 

# Load the data using pandas

df_pandas = pd.read_csv("sales_data.csv")

 

# Convert the date column to datetime

df_pandas['date'] = pd.to_datetime(df_pandas['date'])

 

# Extract the month

df_pandas['month'] = df_pandas['date'].dt.month

 

# Group by month and calculate total sales

start_time = time.time()

monthly_sales_pandas = df_pandas.groupby('month')['sales'].sum().reset_index()

end_time = time.time()

 

print(f"Pandas Execution Time: {end_time - start_time:.4f} seconds")

Output

Using pandas: 0.0011 seconds

Compare this with the Polars execution time and you'll notice a significant speedup, especially with larger datasets.

Using polars: 0.0003 seconds

Fun with Polars

 

To keep things fun, let's create a small challenge. Using the same dataset, try to calculate the average sales per customer per month. Here’s a hint: you’ll need to use the .mean() function and group by both month and customer ID.

 

Conclusion

Polars is an excellent tool for anyone looking to speed up their data analysis tasks in Python. Its performance, combined with an easy-to-use syntax, makes it a great alternative to pandas, especially for large datasets.

 

Sponsored
simple.ai - The Agent AI newsletterJoin 200,000+ others and learn how to use Agent AI to grow your career or business.

Ready for More Python Fun? 📬

Subscribe to our newsletter now and get a free Python cheat sheet! 📑 Dive deeper into Python programming with more exciting projects and tutorials designed just for beginners.

Keep exploring, keep coding, and enjoy your journey into data analytics with Python!

Happy coding!🚀📊✨