Intro to Polars: A Pandas Alternative for Efficiently Working with Large Datasets

Intro to Polars: A Pandas Alternative for Efficiently Working with Large Datasets

Prema Roman | Tuesday, Aug 22, 2023 |  Python Data

Dataframes are a powerful data structure for data processing, analytics, and ML. For many years, Pandas has been my go-to. But it can really slow you down when you’re working with big or high-dimensional data. Enter Polars!

Pandas Was a Game-Changer

To those of us whose early data science workflows consisted of writing SQL queries and then dumping the results into Excel for analysis, Pandas was a game-changer.

Pandas offered a one-stop shop solution that allowed us to work with tabular data and write SQL-like functions. Still need to work with databases? Pandas also has very easy-to-use functions to read from and write to databases. I was recently reminded of how good the library is in doing all the work for you when I struggled using Redshift’s COPY function to do the same. For those who have had to do this, you know that if you are not sufficiently precise in defining your parameters and column names, you are in for a lot of headaches. Imagine you had to write to a table with a very long list of columns. Good luck with that!

Searching for a Drop-In Replacement for Pandas

But Pandas and other libraries like scikit-learn were born in an era when datasets were much smaller than they are now. Anyone who uses these tools with large datasets inevitably runs into performance issues. But before you ask your devOps team for more compute resources, there may be some things you can do to help solve the performance bottleneck.

An early example was Spark; I have used Spark and it does a really good job with big data, but it has a steep learning curve because not only do you have to learn a new API, you have to learn the mechanics of distributed computing. The biggest gotcha that comes with Spark is lazy loading. You think everything is going great until you call collect() and then… everything slows to a crawl. By this point, you’ve probably written several transformations on the data and you have to figure out where you messed up. The cause of the bottleneck can be any number of reasons, from a huge number of nulls in one column in one table causing a problem in a join statement, to a computation that’s happening on a driver node that’s too small to handle the large volume of data.

There have been many alternatives introduced over the years, from koalas to Dask dataframes, but none ever quite clicked for my workflow. Luckily, there are new tools coming out all the time!

Testing Out Polars for the First Time

Recently, when I ran into the familiar Pandas performance problems, I decided to try out Polars. I had heard that Polars had an API that is just as accessible as Pandas, but more efficient. I was not disappointed!

Polars uses Apache Arrow under the hood, taking advantage of columnar schema, which makes aggregations and analytics much faster than Pandas with a much smaller memory footprint. It also has built-in optimizations that allow it to make use of all the available CPUs on your behalf.

I was very impressed by the developer experience of using Polars. The creators put in a lot of work to keep the API as similar to Pandas as possible. Some of the other syntax that I came across with was also familiar to me because it is similar to Spark. So it seems that Polars is inspired by both Pandas and Spark.

Here are some ways Polars is similar to Pandas. Note that the syntax, options, and parameters may be slightly different depending on your version.

# Pandas                                    # Polars
import pandas as pd                         import polars as pl

# read csv file
df = pd.read_csv("file.csv")                df = pl.read_csv("file.csv")

# get the shape of a dataframe
df.shape                                    df.shape

# drop columns
df.drop(columns=["col1", "col2"])           df.drop(columns=["col1", "col2"])


# write to a csv file
df.to_csv("file.csv")                       df.write_csv("file.csv")

To be sure, there are ways in which the API is different from Pandas. For those who have used Spark, some of these examples will look familiar:

# Pandas                                    # Polars

# add a new column with a constant value
df["new_col"] = "pandas"                     df.with_column(pl.lit("polars")
                                               .alias("new_col"))

# filtering rows
df.loc[df.flower.isin(["rose", "tulip"])]   df.filter(pl.col("flower")
                                              .is_in(["rose", "tulip"]))

# select subset of columns
df[["first_name", "last_name"]]             df.select(["first_name", "last_name"])

# conditional filtering
df["adult"] = df["age"].apply(lambda x:     df.with_columns(
     False if x < 18 else True)                 pl.when(pl.col("age") < 18)
                                                  .then(pl.lit(False))
                                                  .otherwise(pl.lit(True))
                                                  .alias("adult"))

Key API Differences between Pandas and Polars

There are a couple things to watch out for as you work with Polars.

No Mixed Types

Unlike Pandas, Polars is strict with data types — it does not allow for mixed data types in a single column.

Imagine you have a column called numbers with the following values: ["1", "2", "1A"]

Polars will throw an error if you attempt to cast these to int values by doing something like this:

df.with_columns(pl.col("numbers").cast(pl.Int8), strict=True))

If you set strict to False, Polars will drop the record with value 1A.

None != NaN

Another difference is how Polars handles null values.

Pandas treats None and NaN the same, while Polars considers None as a null value but does not consider NaN as a null value. Technically speaking, NaN (not a number) is used to represent missing values of float data type, while None represents missing values of any data type. If you want to read a much more detailed explanation of this behavior, I strongly recommend this excellent post by Yuki Kakegawa.

Lazy Loading that’s Easier to Reason About

Polars also offers a Spark-like Lazy API for even better performance. Here is an example from the docs. The scan_csv command unlike the read_csv command lazily loads the file instead of reading the entire file into memory at once:

q = (
    pl.scan_csv("iris.csv")
    .filter(pl.col("sepal_length") > 5)
    .groupby("species")
    .agg(pl.all().sum())
)

df = q.collect()

In the above example, Polars applies predicate pushdown by filtering down to only the records where the sepal_length is greater than 5. It applies projection pushdown by only selecting the columns that are needed. As a result, only a subset of the data is loaded into memory.

Complex SQL Queries

As most ML engineers know, converting queries into pure SQL is one of the best way to speak up a slow data science product.

Polars has a SQLContext for those who prefer running SQL queries for complex operations. The following example shows how to load a lazy dataframe and run queries against it:

lf = pl.LazyFrame({"a": [1, 2, 3], "b": ["x", None, "z"]})
res = pl.SQLContext(frame=lf).execute(
    "SELECT b, a*2 AS two_a FROM frame WHERE b IS NOT NULL"
)

Conclusion: Use Pandas and Polars Together!

It’s usually not necessary to complete remove Pandas in order to make your data processing and modeling pipelines work. You can can just swap out the parts that are slow and leave the rest of the code in Pandas. It is easy to switch back and forth between Polars and Pandas. You can convert a Pandas dataframe as follows: df_pl = pl.from_pandas(df_pd). Similarly, you can convert a Polars dataframe into a Pandas dataframe: df_pd = pl.to_pandas(). This faciliates refactoring existing code and for more flexibility by allowing users to continue to use other third party libraries that only support Pandas.

As you can see, Polars is a very easy-to-use library that offers a lot of powerful features. If you are someone who “thinks in dataframes”, Polars is a great resource to help you build more efficient data science and machine learning models. Check out the docs for more details, features, and code snippets!


Photo by Hans-Jurgen Mager on Unsplash

Photo by Hans-Jurgen Mager on Unsplash

About This Post

Love Pandas? Check out this overview of Polars - the new high-performance Python data processing library.

Written by:

Share this post:

Recent Rotations butterfly

View all

How to Manage Overwhelm

Each morning, I make the mistake of checking social media before getting out of bed. As I catch up on what’s happening in the world, I often find myself thinking “This too much” all before 8 AM.

Feb 13, 2024

Build Your Own LLM - Getting Into Production

If you’re building LLMs but have no way to deploy them, are they even useful? In this post, you’ll deploy an LLM into a live production application!

Feb 9, 2024

Build Your Own LLM - Training

If you want to protect your IP or avoid vendor lock, you may find that building your own LLM is more practical than relying on services like ChatGPT. In this post, you’ll train a custom LLM using your own data!

Feb 6, 2024
Enter Your Email To Subscribe