Researchers, please replace SQLite with DuckDB now

At the same time as you is inclined to be a researcher with computational workloads, you are going to be using SQLite for some duties. Please plunge it now and swap to DuckDB, it’s miles contrivance sooner and more straightforward to use

You is inclined to be a researcher, doubtlessly in the lifestyles or social sciences, and your group presents you glean entry to to a mercurial machine with many CPUs and hundreds reminiscence, doubtlessly as portion of a CPU cluster. You is inclined to be familiar with Linux and use Python/Pandas and/or R to govern and analyze your facts. You in most cases use SQLite, shall we assert, to pre-filter substantial datasets but furthermore as an option for storing datasets longer duration of time.

Some colleagues and IT people comprise suggested you that SQLite is a toy database engine for attempting out (which isn’t exact kind) and that you just might want to always silent swap to something sooner and more worn, comparable to PostgreSQL or a good deal of databases. However you cherish the real fact that SQLite does not require its have server, and it’s possible you’ll presumably perhaps even factual invent a SQL database in your mission folder alongside with all the a good deal of facts you is inclined to be using.

Don’t alternate that contrivance, even supposing a as a lot as the moment PostgreSQL on a as a lot as the moment laptop will likely be sooner. As a replace, you might want to always silent swap from SQLite to DuckDB because it’s miles contrivance sooner and more straightforward to use; these are the basic benefits:

  • DuckDB is designed from day one to use all the CPU cores in your machine
  • DuckDB is optimized for complex queries, whereas SQLite and most a good deal of SQL databases are more optimized for writing a pair of datasets at the same time. You will likely be ready to read up on the exiguous print right here.
  • DuckDB supports a pair of instant facts codecs out of the box and can read many database facts in parallel.

As take a look at system we use a heart of the street 32 core server (Intel Gold 6326 CPU @ 2.90GHz) with 768 GB Reminiscence and instant local flash storage. The system is linked to a mercurial shared Posix file system with more than 4GB/s read/write throughput as confirmed by the scratch-dna benchmark.

[dp@node]$ scratch-dna -v -p 4 1024 104857600 3 ./gscratch

Building random DNA sequence of 100.0 MB...
Writing 1024 files with filesizes between 100.0 MB and 300.0 MB...
Files Completed: 21, Data Written: 3.3GiB, Files Remaining: 1005, Cur FPS: 21, Throughput: 3400 MiB/s
Files Completed: 44, Data Written: 7.8GiB, Files Remaining: 982, Cur FPS: 22, Throughput: 4000 MiB/s
Files Completed: 68, Data Written: 12.7GiB, Files Remaining: 958, Cur FPS: 22, Throughput: 4333 MiB/s
Files Completed: 89, Data Written: 16.7GiB, Files Remaining: 937, Cur FPS: 22, Throughput: 4275 MiB/s

and a local flash disk with at the very least 1.2 GB/s read/write throughput

[dp@node]$ scratch-dna -v -p 4 1024 104857600 3 $TMPDIR

Building random DNA sequence of 100.0 MB...
Writing 1024 files with filesizes between 100.0 MB and 300.0 MB...
Files Completed: 4, Data Written: 1.1GiB, Files Remaining: 1022, Cur FPS: 4, Throughput: 1100 MiB/s
Files Completed: 12, Data Written: 2.6GiB, Files Remaining: 1014, Cur FPS: 6, Throughput: 1350 MiB/s
Files Completed: 19, Data Written: 3.6GiB, Files Remaining: 1007, Cur FPS: 6, Throughput: 1233 MiB/s

Now, let’s take a look at SQLite and DuckDB with a moderately sized CSV file — nothing too astronomical, but furthermore substantial passable to teach the adaptation. At the same time as you work in research, you are going to possible comprise glean entry to to a mercurial POSIX filesystem linked to your Linux machine. This filesystem may presumably perhaps even simply have many millions or billions of facts, and we may presumably perhaps presumably favor to harvest that facts to enact some diagnosis of your file metadata, comparable to sizes, file styles, and so forth. You will likely be ready to use pwalk to generate a CSV file of your folder metadata and then guarantee that with ‘iconv’ that it has the beautiful layout to be passe with a database engine.

pwalk --NoSnap --header "/your/dept-folder"> dept-folder-tmp.csv
iconv -f ISO-8859-1 -t UTF-8 dept-folder-tmp.csv> dept-folder.csv

Testing SQLite

Let’s import this CSV file into SQLite. We’ll use standalone SQLite version 3.38 and Python version 3.10, which comes with SQLite version 3.39. Moreover, we’ll use Pandas as a helper to detect if the columns of the CSV facts are numbers, text, or dates — a efficiency SQLite lacks. Pandas operates in reminiscence and can insert its table straight into a SQLite database. Then again, Pandas does not come with Python and wants to be installed individually.

import pandas, sqlite3

# File paths
csv_file_path='./db/x-dept.csv'
sqlite_db_path='./db/x-dept.db'

# Read CSV file into dataframe and copy the dataframe to SQLite
df=pandas.read_csv(csv_file_path)

# Connect to SQLite database (it will be created if it doesn’t exist)
conn=sqlite3.connect(sqlite_db_path)

# Insert data from Pandas Data Frame to a new 'meta_table' in sqlite db
df.to_sql('meta_table', conn, index=False, if_exists='replace')

# Commit and close
conn.commit()
conn.close()

Then we enact the script in our shared file system and it takes more than 43 minutes.

time ./sqlite-import.py

real 43m25.215s
user 28m23.660s
sys 15m2.266s

Which will likely be loads ….. but all the columns appear to comprise been imported correctly:

sqlite3 ./db/x-dept.db ".schema"

CREATE TABLE IF NOT EXISTS "meta_table" (
"inode" INTEGER,
"parent-inode" INTEGER,
"directory-depth" INTEGER,
"filename" TEXT,
"fileExtension" TEXT,
"UID" INTEGER,
"GID" INTEGER,
"st_size" INTEGER,
"st_dev" INTEGER,
"st_blocks" INTEGER,
"st_nlink" INTEGER,
"st_mode" INTEGER,
"st_atime" INTEGER,
"st_mtime" INTEGER,
"st_ctime" INTEGER,
"pw_fcount" INTEGER,
"pw_dirsum" INTEGER
);

Now, let’s lunge a straight forward ask on our shared POSIX file system to study if all 283 million records were inserted. We don’t need Python for this; it’s straight forward passable that we can factual enact it in our Bash shell. We factual depend the amount of records in the table:

time sqlite3 ./db/x-dept.db "select count(*) from meta_table;
283587401
real 20m30.382s
user 0m14.746s
sys 1m41.663s

Bigger than 20 min in all fairness prolonged, there’s inclined to be something hostile with our instant shared filesystem? Let’s copy the database to local disk and take a take a look at at again:

time cp ./db/x-dept.db $TMPDIR/

real 1m19.890s
user 0m0.024s
sys 1m16.456s

time sqlite3 $TMPDIR/x-dept.db "select count(*) from meta_table;"
283587401

real 1m3.460s
user 0m6.061s
sys 0m57.392s

Effectively, since we’ve copied the file, it may perhaps presumably perhaps even simply comprise been cached, but that by myself can not issue the adaptation. Re-working the recommendations from the shared file system, even when cached, is sooner, but it’s silent sluggish and takes nearly 11 minutes:

time sqlite3 ./db/x-dept.db "select count(*) from meta_table;
283587401
real 10m44.268s
user 0m12.983s
sys 1m40.303s

Let’s return to our local file and lunge every other straight forward but a small bit of more complex ask. We’d favor to know the total disk consumption of all facts in bytes and generate a sum() of the ‘st_size’ column:

time sqlite3 $TMPDIR/x-dept.db "select sum(st_size) from meta_table;"
591263908311685

real 1m40.343s
user 0m37.363s
sys 1m2.965s

The a small bit of more complex ask takes more than 50% longer than the easy one which is lustrous and returns about 538 TiB (591263908311685 bytes) of disk consumption

Testing DuckDB

One in every of the basic things we spy is that DuckDB can use CSV facts straight without importing them first. We can use the ‘read_csv_auto’ characteristic to robotically figure out the recommendations styles in every column.:

time duckdb -c "select count(*) from read_csv_auto('./db/x-dept.csv');"
100% ▕████████████████████████████████████████████████████████████▏
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 283587401 │
└──────────────┘

real 3m16.446s
user 18m11.720s
sys 2m53.538s

It took 3 minutes to lunge a sql ask straight on a CSV file and we can scrutinize that the CPU utilization is more than 800% which implies that 8–9 CPU cores were busy. We furthermore scrutinize that DuckDB has a nice progress bar.

Now, working with CSV facts straight is sluggish, as queries can not be optimized when using this knowledge interchange layout, but we can counteract that with raw compute vitality. Then again, let’s explore a good deal of alternate choices now we comprise. We can import the CSV file into a native DuckDB layout that is presumably very instant… but DuckDB in all fairness original, and no a good deal of tools will likely be ready to read the DuckDB layout, so let’s take a look at if a good deal of codecs are supported. We spy Apache Arrow, which is gaining a good deal of recognition, but we furthermore scrutinize the correct weak Parquet layout, that can even simply be read by many tools accessible. Let’s are trying that:

time duckdb -s "COPY (SELECT * FROM 
read_csv_auto('./db/x-dept.csv') TO './db/x-dept.parquet';"

real 5m9.249s
user 31m56.600s
sys 5m43.471s

OK, this conversion took factual over 5 minutes, which just will not be snide. Now let’s scrutinize what we can enact with ‘x-dept.parquet’. If we read it thru our shared POSIX file system, let’s lunge our old queries:

time duckdb -c "select count(*) from './db/x-dept.parquet';"

┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 283587401 │
└──────────────┘

real 0m0.641s
user 0m0.578s
sys 0m0.155s

Oops, 0.6 seconds as a replacement of more than 20 minutes ? That is 2000 times sooner on a shared filesystem and more than 100 times sooner when evaluating to SQLite on local flash/SSD! And we didn’t even use DuckDB’s internal database layout but. Let’s verify these outcomes with our 2nd ask:

time duckdb -c "select sum(st_size) from './db/CEDAR.parquet';"
┌─────────────────┐
│ sum(st_size) │
│ int128 │
├─────────────────┤
│ 591263908311685 │
└─────────────────┘

real 0m1.559s
user 0m6.254s
sys 0m1.877s

Equally spectacular. Now, let’s are trying something a small bit more complex. We’d favor to know the share allotment of every file form, as recognized by the file extension, as well to the total disk space every file form consumes. On this case, we’ll establish the SQL assertion in a text file known as ‘extension-summary.sql’ :

WITH TotalSize AS (
SELECT SUM(st_size) AS totalSize
FROM './db/x-dept.parquet'
)
SELECT
fileExtension as fileExt,
ROUND((SUM(st_size) * 100.0 / totalSize), 1) AS pct,
ROUND((SUM(st_size) / (1024.0 * 1024.0 * 1024.0)), 3) AS GiB
FROM
'./db/x-dept.parquet',
TotalSize
GROUP BY
fileExtension, totalSize
HAVING
(SUM(st_size) * 100.0 / totalSize)> 0.1
ORDER BY
pct DESC
LIMIT 6

and then we inquire DuckDB to lunge it against our Parquet file:

time duckdb ┌─────────┬────────┬────────────┐
│ fileExt │ pct │ GiB │
│ varchar │ double │ double │
├─────────┼────────┼────────────┤
│ gz │ 23.4 │ 128928.05 │
│ bam │ 20.9 │ 114925.732 │
│ czi │ 14.1 │ 77548.979 │
│ fq │ 7.6 │ 41954.356 │
│ sam │ 5.1 │ 27812.136 │
│ tif │ 5.0 │ 27457.25 │
├─────────┴────────┴────────────┤
│ 6 rows 3 columns │
└───────────────────────────────┘

real 0m3.320s
user 0m24.991s
sys 0m4.127s

This was as soon as carried out on a a good deal of server from the old one to guarantee that that the cache was as soon as frigid. 3.3 seconds is spectacular, but obviously, now we favor to search out an contrivance more complex ask to study the correct kind capabilities of DuckDB. Let’s are trying to title all facts that is inclined to be duplicates, as they comprise the same title, file measurement, and modification date, but are kept in a good deal of directories. We don’t in actuality comprise in-depth knowledge of SQL, so let’s inquire ChatGPT for some relief. Provided that DuckDB is like minded with PostgreSQL syntax, when we issue the table structure and issue what we are looking out to live, it presents a exact kind solution that we then copy into the ‘dedup.sql’ text file.

SELECT
-- Extract the filename without path
SUBSTRING(
filename FROM LENGTH(filename) - POSITION('/' IN REVERSE(filename)) + 2
FOR
LENGTH(filename) - POSITION('/' IN REVERSE(filename)) - POSITION('.' IN REVERSE(filename)) + 1
) AS plain_file_name,
st_mtime,
st_size,
COUNT(*) as duplicates_count,
ARRAY_AGG(filename) as duplicate_files -- Collect the full paths of all duplicates
FROM
'./db/x-dept.parquet'
WHERE
filename NOT LIKE '%/miniconda3/%' AND
filename NOT LIKE '%/miniconda2/%' AND -- let's ignore all that miniconda stuff
st_size> 1024*1024 -- we only look at files> 1 MB
GROUP BY
plain_file_name,
st_mtime,
st_size
HAVING
COUNT(*)> 1 -- Only groups with more than one file are duplicates
ORDER BY
duplicates_count DESC;
time duckdb real    0m21.780s
user 2m40.679s
sys 1m34.142s

On this case 13 CPU cores ran for 22 seconds to enact a resplendent complex ask. Rerunning this ask after copying the parquet file to a local disk presents us this:

time duckdb real    0m21.838s
user 2m30.352s
sys 1m34.448s

In a good deal of phrases, there’ll not be any efficiency distinction between the shared file system and the local flash disk.

Now let’s review this again with SQLite and our file ‘extentions-summary.sql’:

time cat extension_summary.sql | sqlite3 $TMPDIR/x-dept.db
gz|23.4|128928.05
bam|20.9|114925.732
czi|14.1|77548.979
fq|7.6|41954.356
sam|5.1|27812.136
tif|5.0|27457.25

real 6m15.232s
user 4m6.303s
sys 2m8.892s

OK, 6 minutes and 15 seconds versus 3.3 seconds capacity that DuckDB is 113 times sooner than SQLite, at the very least on this situation.

And the contrivance in which does our complex ‘dedup.sql’ ask build? After asking ChatGPT again to remodel the PostgreSQL-like minded ask to one who works with SQLite, we can enact it:

time cat dedup.sql | sqlite3 $TMPDIR/x-dept.db
real 5m20.571s
user 2m39.911s
sys 1m23.271s

On this case DuckDB is only 15 times sooner than SQLite and the efficiency distinction can presumably be largely attributed to the a pair of cpu cores, that DuckDB makes use of.

And in the kill, a word about wildcards

Be taught datasets are in most cases dispersed across a pair of facts. Right here is mostly for the reason that datasets are too substantial to be kept in a single file; a good deal of times, it’s because researchers, working globally and dispensed, organize their have datasets, and the recommendations only has to be merged in most cases for analytics applications. In many database programs, this may mean you comprise a pair of tables that will must silent be blended with a ‘union’ ask, but with DuckDB, this course of is extremely straight forward: Have confidence you had three facts: ‘./facts/prj-asia.csv’, ‘./facts/prj-africa.csv’, and ‘./facts/prj-europe.csv’, all with the same schema (column structure). You will likely be ready to simply use a wildcard to read all three facts as a single table:

duckdb -c "select count(*) from read_csv_auto('./data/prj-*.csv');"

Summary of Benefits of DuckDB vs SQLite for Analytics

  • DuckDB is contrivance sooner than SQLite, in some circumstances orders of magnitude
  • DuckDB has contrivance more facts import efficiency constructed-in, no exterior python functions wanted
  • DuckDB does not ride any efficiency bottlenecks with shared Posix filesystems which may be classic in most research environments
  • DuckDB makes use of PostgreSQL syntax, perhaps the most prevalent SQL slang among facts scientists and original open source database tasks
  • DuckDB has constructed-in make stronger for writing and studying Parquet and Apache Arrow facts codecs
  • Python and R functions are integral portion of the DuckDB mission
  • Make stronger of wildcards enables researchers to work with many facts in parallel

The establish you might want to always silent proceed to use SQLite

SQLite is presumably the most passe database on planet Earth and is furthermore one of perhaps the most proper, with millions of smartphone apps using it internally. There are a good deal of correct causes to use SQLite, and they also’re laid out on the SQLite web inform. Researchers deal with SQLite this capacity that of its flexibility and the real fact that no database server is foremost. Then again, as DuckDB does a better job of supporting this use case, researchers must silent take into accout switching this day.

Varied resources

  • I hope to put in writing a 2nd article about DuckDB rapidly, now more geared in opposition to use of DuckDB in Excessive Efficiency Computing (HPC) programs. A dialogue of the wildcard characteristic will likely be portion of this.
  • There could be a identical article for facts scientists right here, but that one has more techie language and I felt the necessity to put in writing something, that is more inclusive in opposition to all researchers.
  • At the same time as you is inclined to be drawn to inspecting your filesystem metadata with a more respectable and characteristic rich instrument, you might want to always silent take into accout https://starfishstorage.com/
Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like