7/12/2022

Pandas vs Polar - A look at performance

#code#data-pipelines#rust

Intro

Pandas is the most famous Python library for working with structured data. It is used anywhere from one off analyses by scientists to full blown production data pipelines to automate data cleaning, manipulation and analyzing. Its popularity can be attributed to its ease of use and the fact that it is built on top of Python, one of the most popular programming languages with a relatively low learning curve.

With Pandas being so popular, it's not surprising that alternatives are popping up all over the place. One of the more interesting ones is Polars. It aims to be a familiar library to anyone who has experience with Pandas, but offers a better performance due to being built on top of Rust and Apache Arrow2. This also means it is compatible with other tools in the Arrow ecosystem, potentially interesting for enterprise users. It's available as both a library in Python and in Rust.

The goal of this blog is to look at some real world use cases to see which library performs better. The Pandas version used was 1.4.2 with Python version 3.9.12. For Polars the Rust version has been chosen instead of the Python version to get the most performance. Specifically version 0.22.8 and Rustc version 1.62.0. After all, when deciding what to use in your data pipelines or big data analysis, performance is what you want.

Head to head

The data is set used during the tests is a modified version of the New York City Taxi Trip Duration set from Kaggle which originally contains 1.462.644 rows. In order to simulate a large data set it is duplicated 25 times so that it consists of 36.566.100 rows. This final file is about 5gb in size.

Hardware

The tests were run on a 2015 model MacBook Pro with the following relevant specs:

  • 3,1 GHz Dual-Core Intel Core i7
  • 8 GB 1867 MHz DDR3

All test times are an average of 3 consecutive runs.

Use cases

The following tests were run to mimic real life use cases:

  1. Opening the file and show the shape of the DataFrame
  2. Opening the file and show the first 5 rows
  3. Opening the file and get the length of all strings in the id column
  4. Opening the file and apply a function to the trip_duration column to divide the number by 60 to go from the second value to a minute value
  5. Opening the file and filtering out all rows with a trip duration shorter than 500 seconds
  6. Opening the file, filtering out all the rows with a Y value in the store_and_fwd_flag column, group by ID and calculate the mean duration time

Timing code

The code used to time the individual use cases looks like this:

def main():
   # Perform one of the above described use cases
   perform_use_case()
 
 
if __name__ == "__main__":
   # Start the time
   start_time = timer()
  
   # Call main function to perform use case
   main()
  
   # End the timer
   end_time = timer()
  
   # Print the elapsed time in seconds
   print(f"The program took {end_time - start_time} seconds to run")
 
fn main() {
   // Start the timer
   let start_time: Instant = Instant::now();
  
   // Perform one of the above described use cases
   perform_use_case();
  
   // End the timer
   let elpased_time: Duration = start_time.elapsed();
  
   // Print of the file time in seconds
   println!("The program took {:?} seconds to run", elpased_time)
}
 

1. Opening the file and show the shape of the DataFrame

The first use case is a very simple one. Say you receive a CSV file either from a colleague or as output from another process and you want to verify it is about as large as you expect it to be. One easy way of doing this is to check the shape of the DataFrame (A DataFrame is what Pandas and Polars use to represent the data. Think of it as a spreadsheet). This will return the number of columns and number of rows.

In Pandas the code looks like this:

# Utility function to read the CSV file and convert it to a DataFrame
def generate_df():
   df = pd.read_csv(test_file)
   return df
 
def print_df_shape():
   df = generate_df()
   print(df.shape)

As you can see this is quite easy. When reading a CSV file Pandas converts it to a DataFrame and provides shape as an attribute that can print. We get the following result:

(36466100, 11)

About 36 million rows and 11 columns, just as expected!

Now, the Polars version of the code looks like this:

fn generate_lf() -> LazyFrame {
   LazyCsvReader::new(TEST_FILE.into())
       .has_header(true)
       .finish()
       .expect("Error opening file")
}
 
fn print_df_shape() -> () {
   let df = generate_lf().collect().expect("error");
   println!("{:?}", df.shape())
}

As you can see the generate_lf() utility function looks a bit different. This is because Polars not only support DataFrame, but also LazyFrames. These are essentially the same thing as DataFrames except that they wait to execute all the functions you called on them to the last possible second. This way the LazyFrame knows exactly what you want to do to it and finds the most optimal way of doing this.

Other than that, Polars also has a built-in shape() function that can be called on a DataFrame. The results are:

(36466100, 11)

Great, same result from both contenders! Now let's look at the interesting part, performance:

TaskPython PandasRust PolarsDifference
Opening the file and show the shape of the DataFrame274,75 Seconds38,97 Seconds235,78 seconds

Wow! We expected Polars to be faster but more than 7x faster is quite a lot. I suspect that reading the file itself is the bottleneck here though. That being said, it's not something we can just ignore. The data needs to come from somewhere.

2. Opening the file and show the first 5 rows

The second use case is also a simple one and continues from where we left off. You have now verified the DataFrame contains the correct number of rows and columns and you want to actually see the data that you are dealing with. However, knowing that we have more than 36 million rows, you only want to see the first 5 rows. Or in other words: you want to print the head of the DataFrame.

Here is how we would do this in Pandas:

def print_df_head():
   df = generate_df()
   print(df.head())

Which will print the following results:

idvendor_idpickup_datetime...trip_duration
id287542122016-03-14 17:24:55...455
id237739412016-06-12 00:43:35...663
id385852922016-01-19 11:35:24...2124
id350467322016-04-06 19:32:31...429
id218102822016-03-26 13:30:55...435

Looks good. Now for Polars, the code looks like:

fn print_df_head() -> () {
   let df = generate_lf().collect().expect("error");
   println!("{}", df.head(Some(5)))
}

Which in turn returns the following results:

id
---
str
vendor_id
---
i64
pickup_datetime
---
str
...trip_duration
---
i64
id287542122016-03-14 17:24:55...455
id237739412016-06-12 00:43:35...663
id385852922016-01-19 11:35:24...2124
id350467322016-04-06 19:32:31...429
id218102822016-03-26 13:30:55...435

Same table as returned by Pandas, good! But if you look closely you'll see a small difference. Since we are using Rust for our Polars examples, typing matters. Rust is a statically typed language which means that for example if a function expects a string type and you pass it a number, Rust will give you an error while compiling. The code won't even run. Python on the other hand is dynamically typed and will error out during runtime or in some cases not error out but produce unexpected results.

Both have their pros and cons and are not the focus of this blog. This is the reason the headers in our Polars example indicate the type of the values in the column.

Now, back to where we came for. The performance:

TaskPython PandasRust PolarsDifference
Opening the file and show the first 5 rows250,52 Seconds39,98 Seconds210,54 seconds

Pandas was a little bit faster than when getting the shape of the DataFrame and Polars scores about the same. Polars wins again.

3. Opening the file and get the lenght of all strings in the "id" column

Now that we have explored the shape and the first few rows, let's do some actual data manipulation. In this use case we are creating a new column which lists the lengths of all the values in the id column.

Here is how we would do this in Pandas:

def print_length_string_in_column():
   df = generate_df()
   df["vendor_id_length"] = df["id"].str.len()
   print(df.head())

The output looks like:

idvendor_idpickup_datetime...vendor_id_length
id287542122016-03-14 17:24:55...9
id237739412016-06-12 00:43:35...9
id385852922016-01-19 11:35:24...9
id350467322016-04-06 19:32:31...9
id218102822016-03-26 13:30:55...9

As you can see a new column has been added at the end: vendor_id_length which shows that at least for the first 5 rows we printed the IDs are all 9 characters long.

In Polars the code is in this case a little more complicated than Pandas:

fn print_length_strings_in_column() -> () {
   let lf = generate_lf();
   let df = lf
       .with_column(col("id").apply(
           |value| Ok(value.utf8()?
               .str_lengths()
               .into_series()),
               GetOutput::from_type(DataType::Int32),)
               .alias("vendor_id_lengths"),
       )
       .collect()
       .unwrap();
   println!("{:?}", df.head(Some(5)))
}

While it looks more complicated, most of it has to do with how Rust handles ownership of variables in order to prevent memory errors.

Now, if we look at the output here we will see the following:

id
---
str
vendor_id
---
i64
pickup_datetime
---
str
...vendor_id_length
---
u32
id287542122016-03-14 17:24:55...9
id237739412016-06-12 00:43:35...9
id385852922016-01-19 11:35:24...9
id350467322016-04-06 19:32:31...9
id218102822016-03-26 13:30:55...9

Polars also added the same column to the end of the DataFrame and as expected added the correct type, u32, to the column header.

Now for performance:

TaskPython PandasRust PolarsDifference
Opening the file and get the lenght of all strings in the "id" column262,33 Seconds39,75 Seconds222,58 seconds

Again what we expect to see, Pandas takes a lot longer than Polars but this might mostly be attributed to opening the CSV file.

Let's take opening the CSV file out of the equation for this and the following use cases since we are now actually manipulating data and no longer doing simple inspections.

TaskPython PandasRust PolarsDifference
Show the shape of the DataFrame16,83 Seconds1,77 Seconds14,61 seconds

There we go! Opening the file was indeed a bottleneck for both contenders. Now we get a better idea of how long it takes to actually perform the data manipulation part of the use case. As we can see, Polars still blows Pandas out of the water with a 9x speed-up.

4. Opening the file and apply a function to the "trip_duration" to devide the number by 60 to go from the second value to a minute value

Alright, next use case. One of the columns lists the trip duration of the taxi rides in seconds. Let's divide these numbers by 60 so that we can create a new column with the time in minutes.

In Pandas this is how you would do it:

def convert_trip_duration_to_minutes():
 
df = generate_df()
df["trip_duration_minutes"] = df["trip_duration"].apply(
   lambda duration_seconds: duration_seconds / 60
)
print(df.head())

The printed results indeed show a newly added column with the time in minutes as expected.

idvendor_id...trip_durationtrip_duration_minutes
id28754212...4557.583333
id23773941...66311.050000
id38585292...212435.400000
id35046732...4297.150000
id21810282...4357.250000

Now, for Polars:

fn convert_trip_duration_to_minutes() -> () {
let lf = generate_lf();
let df = lf
   .with_column(
       col("trip_duration").apply(|t| {
           Ok(
           (t.cast(&DataType::Float64).unwrap().f64().unwrap() / 60 as f64)
               .into_series(),
           )
}, GetOutput::from_type(DataType::Float64),)
   .alias("trip_duration_minutes"),)
   .collect()
   .unwrap();
 
println!("{:?}", df.head(Some(5)))
}

Again, the code looks a bit more complicated than the Pandas example but it is mostly due to how Rust works. The output is exactly what we expected though:

id
---
str
vendor_id
---
i64
...trip_duration
---
i64
trip_duration_minutes
---
f64
id28754212...4557.583333
id23773941...66311.05
id38585292...212435.4
id35046732...4297.15
id21810282...4357.25

Now for the performance. Let's start with the tests including opening the file:

TaskPython PandasRust PolarsDifference
Opening the file and apply a function to generate a minute duration column280,09 Seconds38,6 Seconds241,49 seconds

And the timing of just the data manipulation without counting opening the file:

TaskPython PandasRust PolarsDifference
Apply a function to generate a minute duration column14,07 Seconds1,12 Seconds12,95 seconds

As expected, Polars beats Pandas in this test too.

5. Opening the file and filtering out all rows with a trip duration shorther than 500 seconds

Okay, next up we are going to look at getting a subset of data from our complete dataset. This is a very common use case and something that might happen multiple times within data pipelines in order to accommodate different goals.

In Pandas the code looks quite simple:

def filter_out_trip_duration_500_seconds():
   df = generate_df()
   filtered_df = df[df["trip_duration"] >= 500]
   print(filtered_df.head())

And as we can see in the trip_duration column, all of the values are above 500.

idvendor_idpickup_datetime...trip_duration
id237739412016-06-12 00:43:35...663
id385852922016-01-19 11:35:24...2124
id132460322016-05-21 07:54:58...1551
id001289122016-03-10 21:45:01...1225
id143637122016-05-10 22:08:41...1274

This time the Polars code is also quite simple:

fn filter_out_trip_duration_500_seconds() -> () {
   let count: i64 = 500;
   let lf = generate_lf();
   let df = lf
       .filter(col("trip_duration").gt_eq(lit(count)))
       .collect()
       .unwrap();
   println!("{:?}", df.head(Some(5)));
}

The trip_duration columns shows here the same expected results:

id
---
str
vendor_id
---
i64
pickup_datetime
---
str
...trip_duration
---
i64
id237739412016-06-12 00:43:35...663
id385852922016-01-19 11:35:24...2124
id132460322016-05-21 07:54:58...1551
id001289122016-03-10 21:45:01...1225
id143637122016-05-10 22:08:411274

Now for the performance including opening the file:

TaskPython PandasRust PolarsDifference
Opening the file and filtering out all rows with a trip duration shorter than 500 seconds314,89 Seconds27,34 Seconds287,55 seconds

And without opening the file:

TaskPython PandasRust PolarsDifference
Filtering out all rows with a trip duration shorter than 500 seconds51,34 Seconds15,01 Seconds36,33 seconds

Here we see quite a large difference between Polars and Pandas when including the time to open the file. The Pandas average lies higher than the previous use cases and the Polars one lower. Why this is exactly I'm not quite sure but I suspect it has to do with the way the respective libraries handle scanning all the rows in a DataFrame.

6. Opening the file, filtering out all the rows with a "Y" store_and_fwd_flag value, group by ID and calculate the mean duration time

The final use case is the most complex one within the context of what we have been looking at. We are filtering out a subset of the data, group on id column and finally calculate the mean taxi trip duration time.

As all other examples, Pandas makes this pretty easy:

def filter_group_and_mean():
   df = generate_df()
   df = df[df["store_and_fwd_flag"] != "Y"]
   df_mean = df.groupby(["id"])["trip_duration"].mean()
   print(df_mean.head())

The output also look as expected:

idtrip_duration
id00000011105.0
id00000031046.0
id0000005368.0
id0000008303.0
id0000009547.0

Now Polars also makes this pretty straightforward:

fn filter_group_and_mean() -> () {
   let lf = generate_lf();
   let df = lf
       .filter(col("store_and_fwd_flag").eq(lit("N")))
       .groupby([col("id")])
       .agg([col("trip_duration").mean()])
       .sort("id", Default::default())
       .collect()
       .unwrap();
   println!("{:?}", df.head(Some(5)))
}

And returns the same results:

id
---
str
trip_duration
---
f64
id00000011105.0
id00000031046.0
id0000005368.0
id0000008303.0
id0000009547.0

The performance for this use cases is inline with what we have seen. Timing including opening the file:

TaskPython PandasRust PolarsDifference
Opening the file, filtering out all the rows with a "Y" store_and_fwd_flag value, group by ID and calculate the mean duration time330,75 Seconds18,30 Seconds287,55 seconds

And without opening the file:

TaskPython PandasRust PolarsDifference
Filtering out all the rows with a "Y" store_and_fwd_flag value, group by ID and calculate the mean duration time88,40 Seconds7,22 Seconds287,55 seconds

So, even the most complex use case we've looked at is no match for Polars. As a matter of fact, it is one of the faster ones if we look at the time it takes including opening the file.

Conclusion

We've looked at six different use cases varying in complexity. Each one made it clear that Polars is a lot more performant than Pandas. But this is of course not the whole story.

Code complexity

While working with Polars is pretty similar to working with Pandas, there is a lot of functionality that works the same way, the code was almost never just as easy as in Pandas. This of course has to do with the fact that we used Rust in the examples which simply requires more code to correctly handle things like variable ownership.

This does come with the added bonus of making your code more robust: Most errors are caught during build time, not runtime when it might already be in production.

So, familiarity with the library should be a consideration before picking either.

Pandas speed up

As some readers might already have noticed, the Pandas examples mostly use built in functions that are focused on ease of use and not performance. There are a number of ways to speed up Pandas to get closer (or maybe even surpass?) Polars. You can parallelize query execution with Dask, use the built in to_numpy() function to convert a DataFrame to an int64 NumPy object and even use Cython directly.

While there are plenty of options out there, the reality is that often when something has been built with the standard functions, there is not much incentive to look back.

This is what makes Polars nice, you can use the familiar functions but get the performance from the start without resorting to tricks.

What should I choose?

It depends.

While Polars is more performant, picking one is not that simple. If you process a large volume of data on a regular basis, for example sensor data, continuous research data or financial data, then it might be worth looking into Polars to increase the effectiveness of your data pipelines. Especially in an enterprise setting the runtime of a program can literally translate back to money saved. Usually due to fewer cloud resources needed.

On the other hand, if you need a one off script to analyze a dataset so you can continue with other tasks, learning how to work with Polars might not be worth it. Especially in the scientific community where a lot of analyses are done only once or twice Pandas is a great tool.

There is also a compromise: the earlier mentioned Python version of the Polars library. In an upcoming blog post we will look at this in depth.

So, whatever you choose I would recommend taking a look at Polars anyway. It's an interesting tool and when you get the hang of Rust a joy to work with!