Python: improving time required to load 1.5G CSV file

In previous post we discussed how we can search for specific item in 1.5G CSV file with 132M records. And we improved search from 73625.309 ms (more than a minute) to just ~0.005 ms - almost 15 million times faster. Which is pretty impressive improvement as per my understanding.

But there is still one bottle neck that can be improved - time required for first initial scan of the file. Let's try to improve this in this article.

All source files can be found in JFF-Bohdan/item_lookup

Baseline

As you probably may remember (more details can be found in initial post) we trying to parse CSV file with really simple two columns structure. These columns are passport series and passport number from publicly available list of expired passports from Russia. Feel free to use any other dataset than you may want to use.

During preprocessing of our file we are required to clean invalid entries following definition of valid passport numbers.

As usually, you may want to use virtualenv to create virtual environment and install required dependencies using:

pip install -r requirements-dev.txt

Please note that instead of pip and python you may need to use pip3 and python3 on Linux.

We also will use nice packages like tqdm and terminaltables which will make our output a bit more pretty.

To run test please run (source can be found here)

python baseline_initial_loading_time.py --input-file e:/tmp/ru_passports/list_of_expired_passports.csv

Assuming that your input file can be found in e:/tmp/ru_passports/list_of_expired_passports.csv

As you can see it shows pretty nice information about progress when working (iterations per second and elapsed time) thanks for tqdm package. You may want to know a bit more about this package, it works provides ability to draw progress bar (in case if amount of items that need to be processed is known) and also can be easily integrated with Jupyter notebooks. Here you can find pretty good introduction for Jupyter notebooks.

It shows up to 170k records processed per second and used 790958.699 ms (~13 minutes) to process entire file. We also have pretty nice table with execution stat, provided to us by terminaltables:

+-----------------------+--------------+
| Parameter             | Value        |
+-----------------------+--------------+
| Valid entries count   | 132034208    |
| Invalid entries count | 10094        |
| Execution time (ms)   | 790958.699   |
| Execution time (ns)   | 790958698600 |
| Current memory usage | 29.3 MB      |
+-----------------------+--------------+

Let's use this result as our baseline and improve it.

Using pipeline with generators and iterators

We spending a lot of time trying when calling procedures and going back and forth in our logic. What if our code will use yield and create generators for each stage of data processing?

For example, we can yield lines when reading from file and feed generator to the stage that will check if our input contains two columns. And that filtering stage will create one more generator, yielding only valid records and we will chain this with next stage?

So, our data processing pipeline may look like:

Yield lines from file;
Check if we have two columns in file and yield only valid lines;
Check if our line contains two columns and both columns represents valid passport attributes (series and number) and yield only valid lines;
Convert two columns into numbers and yield tuple with two numbers;
Iterate through items and count valid items.

For steps 2-4 we need to count invalid lines that violated the validation rules. Will it improve performance? And good news that it will do!

To run example please use (source can be found here):

python pipelined_loading_time.py --input-file e:/tmp/ru_passports/list_of_expired_passports.csv

Execution stat:

+---------------------------------+--------------+
| Parameter                       | Value        |
+---------------------------------+--------------+
| Valid entries count             | 132034208    |
| Invalid entries count           | 10094        |
| Execution time (ms)             | 323936.175   |
| Execution time (ns)             | 323936175300 |
| Execution time (human readable) | 5 minutes    |
| Current memory usage            | 29.6 MB      |
+---------------------------------+--------------+

It takes 323936.175 (approx 5 minutes) to process entire file. With 410k items per second. Which gives us solution that approx 2.5 times faster without too much code complication and no additional memory required.

Can we improve it? Yes we can, a bit :)

Tuned pipeline with generators and iterators

Let's talk about about EAFP (Easier to Ask for Forgiveness than Permission) vs LBYL (Look Before You Leap) approaches in Python. In case if we following LBYL first of all we need to check that data is valid before doing some computation. In case of EAFP we just want try to do what we want and in case of problem we can catch exception. Will it help us to improve our data processing pipeline? Yep!

Try to execute (source can be found here):

python tuned_pipelined_loading_time.py --input-file e:/tmp/ru_passports/list_of_expired_passports.csv

As you can see from the source code our pipeline changed a bit and now looks like:

Yield lines from file;
Check if we have two columns in file and yield only valid lines;
Try to convert two columns into numbers and yield tuple with two numbers. In case if exception caught we just can increment counter of invalid lines and continue data processing;
Iterate through items and count valid items.

Execution stat:

+---------------------------------+--------------+
| Parameter                       | Value        |
+---------------------------------+--------------+
| Valid entries count             | 132034208    |
| Invalid entries count           | 10094        |
| Execution time (ms)             | 272203.452   |
| Execution time (ns)             | 272203452400 |
| Execution time (human readable) | 4 minutes    |
| Current memory usage            | 29.5 MB      |
+---------------------------------+--------------+

It takes 272203.452 ms (approx 4 minutes minutes) to process entire file. With approx 485k items per second. Which gives us solution that approx 1.14 times faster than previous one and ~2.8 times faster that baseline. Not too much code complication and no additional memory required.

Summary

As you we see, we can improve solution speed for up to 2.5 (up to 60% of improvement for baseline) times just by switching to pipeline with generators and iterators (instead of calling functions) and have up to 2.8 (up to 65% for baseline) in case if pipeline will be used and we will follow EAFP approach.

I'm not a big fan of EAFP but sometimes it can be useful.

Enjoy!

JFF (Just For Fun)

Sunday, February 14, 2021