Trying chDB, an embeddable ClickHouse engine
chDB is an embeddable, in-process SQL OLAP engine powered by ClickHouse. It's as if SQLite and ClickHouse had an offspring (no offence to either party). chDB takes up ≈100mb of disk space, runs on smaller machines (even on a 64mb RAM container), and provides language bindings for Python, Node.js, Go, Rust and C/C++.
Let's get a taste of chDB with some interactive examples (you can run/edit them without leaving the browser or installing anything).
Using chDB • SQL dialect • Reading data • Writing data • User-defined functions • Python DB API • Current status
Using chDB
It's as simple as pip install chdb
and then:
import chdb
res = chdb.query("select 42")
print(res, end="")
Using a database engine to select the number 42
is probably not very exciting, but bear with me.
SQL dialect
chDB is a wrapper around ClickHouse, so it supports exactly the same SQL syntax, including joins, CTEs, set operations, aggregations and window functions.
For example, let's create a sampled table of 10000 random numbers and calculate the mean and 95th percentile:
from chdb.session import Session
db = Session()
db.query("create database db")
db.query("use db")
db.query("""
create table data (id UInt32, x UInt32)
engine MergeTree order by id sample by id
as
select number+1 as id, randUniform(1, 100) as x
from numbers(10000);
""")
query_sql = """
select
avg(x) as "avg",
round(quantile(0.95)(x), 2) as p95
from data
sample 0.1;
"""
res = db.query(query_sql, "PrettyCompactNoEscapes")
print(res, end="")
Note a couple of things here:
Session
provides a stateful database connection (the data is stored in the temporary folder and discarded when the connection is closed).- The second argument to the
query
method specifies the output format. There are many supported formats such asCSV
,SQLInsert
,JSON
andXML
(try changing the format in the above example and re-running the code). The default one isCSV
.
Reading data
As with output formats, chDB supports any input format supported by ClickHouse.
For example, we can read a dataset from CSV:
query_sql = "select * from 'employees.csv'"
res = chdb.query(query_sql, "PrettyCompactNoEscapes")
print(
f"{res.rows_read()} rows | "
f"{res.bytes_read()} bytes | "
f"{res.elapsed()} seconds"
)
Or work with an external dataset as if it were a database table:
query_sql = """
select distinct city
from 'employees.csv'
"""
res = chdb.query(query_sql, "CSV")
print(res, end="")
We can even query Pandas dataframes as if they were tables:
import chdb.dataframe as cdf
import pandas as pd
employees = pd.read_csv("employees.csv")
departments = pd.read_csv("departments.csv")
query_sql = """
select
emp_id, first_name,
dep.name as dep_name,
salary
from __emp__ as emp
join __dep__ as dep using(dep_id)
order by salary desc;
"""
res = cdf.query(sql=query_sql, emp=employees, dep=departments)
print(res, end="")
Writing data
The easiest way to export data is to use the output format (the second parameter in the query
method), and then write the data to disk:
from pathlib import Path
query_sql = "select * from 'employees.csv'"
res = chdb.query(query_sql, "Parquet")
# export to Parquet
path = Path("/tmp/employees.parquet")
path.write_bytes(res.bytes())
# import from Parquet
query_sql = "select * from '/tmp/employees.parquet' limit 5"
res = chdb.query(query_sql, "PrettyCompactNoEscapes")
print(res, end="")
We can also easily convert the chDB result object into a PyArrow table:
query_sql = "select * from 'employees.csv'"
res = chdb.query(query_sql, "Arrow")
table = chdb.to_arrowTable(res)
print(table.schema)
Or Pandas dataframe:
query_sql = "select * from 'employees.csv'"
res = chdb.query(query_sql, "Arrow")
frame = chdb.to_df(res)
frame.info()
To persist a chDB session to a specific folder on disk, use the path
constructor parameter. This way you can restore the session later:
from chdb.session import Session
# create a persistent session
db = Session(path="/tmp/employees")
# create a database and a table
db.query("create database db")
db.query("""
create table db.employees (
emp_id UInt32 primary key,
first_name String, last_name String,
birth_dt Date, hire_dt Date,
dep_id String, city String,
salary UInt32,
) engine MergeTree;
""")
# load data into the table
db.query("""
insert into db.employees
select * from 'employees.csv'
""")
# ...
# restore the session later
db = Session(path="/tmp/employees")
# query the data
res = db.query("select count(*) from db.employees")
print(res, end="")
User-defined functions
We can define a function in Python and use it in chDB SQL queries.
Here is a split_part
function that splits a string on the given separator and returns the resulting field with the given index (counting from one):
from chdb.udf import chdb_udf
@chdb_udf()
def split_part(s, sep, idx):
idx = int(idx)-1
return s.split(sep)[idx]
second = chdb.query("select split_part('a;b;c', ';', 2)")
print(second, end="")
And here is a sumn
function that calculates a sum from 1 to N:
from chdb.udf import chdb_udf
@chdb_udf(return_type="Int32")
def sumn(n):
n = int(n)
return n*(n+1)//2
sum20 = chdb.query("select sumn(20)")
print(sum20, end="")
Currently chDB only supports scalar functions that take strings as parameters. If the function returns a type other than string, we should pass it as return_type
to the chdb_udf
decorator.
Python Database API
The chDB Python package adheres to the Python DB API (PEP 249), so you can use it just like you'd use stdlib's sqlite3
module:
from contextlib import closing
from chdb import dbapi
print(f"chdb version: {dbapi.get_client_info()}")
with closing(dbapi.connect()) as conn:
with closing(conn.cursor()) as cur:
cur.execute("select version()")
print("description:", cur.description)
print("data:", cur.fetchone())
Current status
chDB reached 1.0 in December 2023. It has the full power of ClickHouse under the hood, a team of passionate maintainers and a growing user base. The future looks bright and I wish chDB all the best!
──
P.S. Interactive examples in this post are powered by codapi — an open source tool I'm building. Use it to embed live code snippets into your product docs, online course or blog.
★ Subscribe to keep up with new posts.