Sunday, February 14, 2021

Python: playing with big lists (132M records), checking if item in list

Python: playing with big lists (132M records), checking if item in list

Let's imagine situation when we need to check if item is in list and our list is pretty big. For example, we may have file with hundreds of millions records and we need to develop solution which should be able quickly say if we have specific item in that list or not.

Let's start with naive implementation and then try to improve it iteratively.

This post was inspired by post https://habr.com/ru/post/538358/

Let's start with same file that was proposed in mentioned article: open list with blocked/expired passports from Russia. This CSV file has pretty simple structure with just two columns and approx. 132M lines. File structure:

  • passport series;
  • passport number.

Few lines from target file:

PASSP_SERIES,PASSP_NUMBER
6502,655257
6399,127957
0501,353752
6502,960356
7301,336183
8003,409451

As you can see, structure is pretty simple and easy to understand. File can be downloaded here (use bie link in the middle of the page).

Let's implement multiple solutions going from naive to more and more sophisticated trying to improve all parts of the system including:

  • RAM footprint;
  • time complexity;
  • space complexity for the additional data structures on the disk (if required).

In this article we will discuss:

  • Linear search using non-sorted file on disk
  • Usasge of SQLite database
  • Usage of LMDB (Lightning Memory-Mapped Database) database
  • Usage of in memory builtin Python set data structure
  • Redis in-memory database (with bitfields)
  • Roaring bitmaps

All source code can be found at JFF-Bohdan/item_lookup

Linear search using non-sorted file on disk

Let's start with very naive implementation that will open file and perform linear search scanning file sequentially and break loops in case if entry will be found.

Implementation may look like:

def is_passport_expired(
        input_file: str,
        passport_series: str,
        passport_number: str
) -> bool:
    with open(input_file, "r", encoding="utf-8") as input_file:
        for index, line in enumerate(input_file):
            if not index:
                continue

            items = [item.strip() for item in line.split(",")]
            if len(items) != 2:
                continue

            if(
                    (items[0] == passport_series) and
                    (items[1] == passport_number)
            ):
                return True

    return False

Easy right? The main problem with this solution that it's slow, really slow. The time complexity is linear O(N) and space complexity is O(1).  For each request we need to scan file from the beginning to the end. Implementation can be found here linear_search.py from repository JFF-Bohdan/item_lookup where all other solutions can be found.

To execute provided example please use:

python linear_search.py --input-file e:/tmp/ru_passports/list_of_expired_passports.csv

Assuming that input file can be found in e:/tmp/ru_passports/list_of_expired_passports.csv

Please take a note that for Linux OS you may need to use python3 instead of python.

Let's see how it performs by using four requests with passport series and passport number:

  • 8003,409451 - 0.268 ms for lookup, record from the beginning of the file;
  • 0104,167806 - 32636.131 ms (almost half a minute) for the record somewhere in the middle of the file;
  • 6009,699906 - 70153.136 ms (more than a minute) for the record somewhere at the end of the file;
  • 1234,567890 - 73625.309 ms (more than a minute) for the record which is not in the file.

Used memory for the application (CPython 3.9 at Windows OS): 28.2 MB

Really not impressive lookup time, also considering that NMVe SSD which is pretty fast drive, was used.

So, let's try to develop something that may be a bit faster using gathered time and memory consumption as baseline.

Data cleanup

According to Wiki page with description of Russian passports we should expect that series should be 4 digit number and number should be 6 digit number.

We may found that in provided file we can find invalid numbers with different number of digits or some unexpected symbols.

Let's cleanup our input file before moving to the next stage and convert all records to tuples with pair of numbers (passport series and passport number).

Provided dataset contains ~10K wrong entries. You may find logic that was used for cleaning in file  here.

SQLite database

Generating database

Let's try to use SQLite database to store our dataset. All data will be cleaned during data ingestion into the database.

We may want to use very simple table structure like:

CREATE TABLE IF NOT EXISTS expired_passports
(
    passport_series integer,
    passport_number integer,
    PRIMARY KEY (passport_series, passport_number)
);

As you can see, we will convert both passport series and passport number to integers during data ingestion. Also, as we agreed above, we will cleanup data from wrong records.
 

To load all data into the database please use command:

python create_sqlite_database.py --input-file e:/tmp/ru_passports/list_of_expired_passports.csv

Assuming that data file is available in file e:/tmp/ru_passports/list_of_expired_passports.csv

Result database will have size approx 4Gb and will be stored in ./db/passports.sqlite. On my machine database generation used approx 44 minutes. Result dataset contains 132034208 record (at     2021-02-13).

Querying database

To check how it works we may try to use same inputs that were used for linear search case.

  • 8003,409451 - 2.477 ms for lookup, record from the beginning of the file;
  • 0104,167806 - 2.456 ms for the record somewhere in the middle of the file;
  • 6009,699906 - 2.809 ms for the record somewhere at the end of the file;
  • 1234,567890 - 2.571 for the record which is not in the file.

As expected there is no correlation between place in file and execution time.

To execute example (which can be found in file sqlite_search.py) please use:

python sqlite_search.py

Please take a note that for Linux OS you may need to use python3 instead of python.

Memory used by the application is 28.9 MB. In our case when we used both passport seria and passport number as composite primary key we can assume that under the hood B-Tree was used as index and time complexity will be O(log N) and space complexity is linear O(N) on additional disk space and RAM memory space should be O(1).

As you can see a difference is pretty impressive comparing 73625 for linear search and 2.5 ms when SQLite database is used. Look like SQLite implementation 29450 faster than linear search.

LMDB (Lightning Memory-Mapped Database) database

Generating database

Let's try to use LMDB to store all our cleaned records. To create LMDB database please use command below:

python create_lmdb_database.py --input-file e:/tmp/ru_passports/list_of_expired_passports.csv

This will create database in file ./db/passports.lmdb

I configured database to use no more than 6E9 bytes (approx 6Gb).

On my machine database generation used approx 37 minutes. Result dataset contains 132034208 record (at 2021-02-13).

Querying database

To check how it works we may try inputs used before:

  • 8003,409451 - 0.069 ms;
  • 0104,167806 - 0.049 ms;
  • 6009,699906 - 0.044 ms;
  • 1234,567890 - 0.043 ms for the record which is not in the file.

The same for LMDB that we previously had for SQLite, there is no correlation between place in file and execution time.

To execute example (which can be found in file lmdb_search.py) please use:

python lmdb_search.py

Please take a note that for Linux OS you may need to use python3 instead of python.

Memory used by the application is 28.2 MB.

According to Wiki LMDB uses internally B+ tree and we should expect O(log N) search time and linear space complexity O(N).

If we assume that average search time is 0.055 ms for LMDB and 2.5 ms for SQLite looks like LMDB is 45 times faster than SQLite.

Can we do faster? Let's try.

set data structure

What if we will just upload all the entries into single set data structure? We can expect that we will consume a lot of memory, because initially we had 1.5Gb input file with 132034208 records. We also improved a bit this by converting passport series and passport number into numbers. So, how much memory we will consume also considering that Tuple(int, int) will be an object? Let's see.

To run example (source can be found in set_search.py) please run:

python set_search.py --input-file e:/tmp/ru_passports/list_of_expired_passports.csv

It took up to 18 minutes to clean parse and load entire dataset into memory and it's was pretty fun to watch how the application consumes more and more memory.

Let's use same inputs for testing:

  • 8003,409451 - 0.003 ms;
  • 0104,167806 - 0.004 ms;
  • 6009,699906 - 0.006 ms;
  • 1234,567890 - 0.004 ms for the record which is not in the file.

Memory used is 21.4Gb (not a joke). In case if we will assume that 0.005 ms (O(1) for search time)is average time for search this solution will be 11 times faster than LMDB usage. But this solution is definitely not the best in memory footprint.

We also can play and split our sets into paritions by creating set for each unique passport series (provided data set contains up to 3389 unique series). Which will improve search in case we will not find passport series. But this solution also will be really bad in memory consumption. Can we do better? Yes, let's try to stick with in memory storage but also let's use something that will use less memory.

Redis in-memory database (with bitfields)

We can use Redis in memory database to store our dataset. We can use hashes where K will be passport series and V will be passport number. But let's try to consume even less memory by using bitfields. I prefer to use Docker to start Redis. You can use

docker run --name test-redis -d --rm -p 6379:6379 redis

To start redis instance that will operate on port 6379.

To upload all records as bitfields in Redis you can use:

python create_redis_database.py --input-file e:/tmp/ru_passports/list_of_expired_passports.csv

We also a bit speaded up data uploading by grouping operations in chunks of 250000 using pipelines.

Data uploading took up 37 minutes.

Memory consumption (checked using redis-cli):

127.0.0.1:6379[1]> info memory
# Memory
used_memory:431019888
used_memory_human:411.05M

To try search let's start example using command line:

python redis_search.py

  • 8003,409451 - 5.314 ms;
  • 0104,167806 - 5.628 ms;
  • 6009,699906 - 5.858 ms;
  • 1234,567890 - 5.607 ms for the record which is not in the file.

These results surpsised me a lot, because even SQLite is faster and LMDB is even faster. Anyway, we should also consider that we operating not with HDD but with SSD.

Can we try something else that will be in-memory solution, fast and will not consume a lot of memory? Yes we do. We can try roaring bitmaps and Python library to use it PyRoaringBitMap.

Roaring bitmaps

You can find more information about theory here. To launch example you need to execute:

python roaring_bitmap_search.py --input-file e:/tmp/ru_passports/list_of_expired_passports.csv

At the beginning of the execution application loads entire dataset into memory using roaring bitmaps data structure. Loading time is 14 minutes.

Let's check how it works:

  • 8003,409451 -  0.005 ms (5000 ns);
  • 0104,167806 - 0.005 ms (5200 ns);
  • 6009,699906 -  0.007 ms (7000 ns);
  • 1234,567890 - 0.003 ms (3000 ns) for the record which is not in the file.

Memory consumption is  88.2 MB. It gives pretty same response time 0.005 ms that we had with set data structure (10-11x times faster than LMDB) but uses just a fraction of memory required to use set structure 88.2Mb instead of 21.4Gb.

What can we improve here? Definitely loading time.

Summary and conclusion

Value/Solution Linear search on disk SQLite LMDB set data structure Redis bitfields Roaring bitmaps
Pre computation of additional storage N/A 44 minutes 37 minutes N/A 37 minutes N/A
Additional disk space complexity N/A O(N) O(N) N/A N/A N/A
Application start time 0s 0s 0s 18 minutes 0s 14 minutes
Lookup time 73625.309 ms ~2.5 ms ~0.055 ms ~0.005 ms ~5ms ~0.005 ms
Time complexity O(N) O(log N) O(log N) O(1) O(1) O(1)
RAM used 28.2 MB 28.9 MB 28.2 MB 21.4 Gb 411 Mb 88.2Mb
RAM complexity O(1) O(1) O(1) O(N) ? ?
  1. SQLite can work pretty fast for this case if SSD will be used (demonstrated ~2.5ms lookup time).
  2. LMDB works really good with sub-millisecond search (0.055 ms search). Should work really good for case when precomputed database need to be used and application need to start really quick.
  3. If you need something really fast you may try to use roaring bitmat (with also comes with really small footprint) stored in memory.

What we can improve? We definitely can improve speed of data loading from disk. We will discuss this topic in next blog post. Stay tuned.

Enjoy!


 


No comments:

Post a Comment