Anton ZhiyanovEverything about SQLite, Python, open data and awesome software.https://antonz.org/https://antonz.org/assets/favicon/favicon.pngAnton Zhiyanovhttps://antonz.org/Hugo -- gohugo.ioen-usWed, 14 Sep 2022 08:50:00 +0000SQLite Weeklyhttps://antonz.org/sqlite-weekly/Wed, 14 Sep 2022 08:50:00 +0000https://antonz.org/sqlite-weekly/Weekly SQLite news, articles and extensions.SQLite is getting a lot of attention lately. That’s great!

So I decided what we are missing is a weekly newsletter. And immediately started one. Every week I will publish important news, notable articles, handy extensions, tricky questions, useful code samples and other awesome SQLite-related stuff.

The first issue is already out.

I really don’t like newsletter services, but I do love open source. That’s why SQLite Weekly is hosted on GitHub. I see this as a win-win for everyone:

  • Easy for me to write.
  • Comfortable for others to read.
  • Enables collaboration.
  • No privacy issues.
  • Email notifications and a built-in RSS feed.

Watch the repo to get an email when the next issue is published. Maybe star it too if you like the idea:

SQLite Weekly

]]>
User-defined functions in SQLitehttps://antonz.org/sqlean-define/Thu, 08 Sep 2022 15:30:00 +0000https://antonz.org/sqlean-define/Write functions in plain SQL.Write functions in plain SQL using the ‘define’ extension.

SQLite does not directly support user-defined functions. Sure, one can write a function in C or Python and register it within SQLite. But not in SQL itself.

Luckily for us, SQLite provides an extension mechanism. One of such extensions — define — allows writing functions in regular SQL.

With define writing a custom function becomes as easy as:

select define('sumn', ':n * (:n + 1) / 2');

And then using it as any built-in function:

select sumn(5);
-- 15

User-defined functions can take multiple parameters and call other functions.

Generate a random N such that a ≤ N ≤ b:

select define('randint', ':a + abs(random()) % (:b - :a + 1)');
select randint(10, 99);
-- 42
select randint(10, 99);
-- 17
select randint(10, 99);
-- 29

List user-defined functions:

select * from sqlean_define;

Delete a function:

select undefine('sumn');

There is even a way to return multiple values from a function!

See the docs for details

]]>
JSON Lineshttps://antonz.org/json-lines/Thu, 04 Aug 2022 18:30:00 +0000https://antonz.org/json-lines/CSV on steroids.Worked with the JSON Lines format the other day. It’s a CSV on steroids:

  • each entry is a separate line, as in CSV;
  • at the same time it is a full-fledged JSON.

For example:

{ "id":11, "name":"Diane", "city":"London", "department":"hr", "salary":70 }
{ "id":12, "name":"Bob", "city":"London", "department":"hr", "salary":78 }
{ "id":21, "name":"Emma", "city":"London", "department":"it", "salary":84 }
{ "id":22, "name":"Grace", "city":"Berlin", "department":"it", "salary":90}
{ "id":23, "name":"Henry", "city":"London", "department":"it", "salary":104}

Great stuff:

  • Suitable for objects of complex structure (unlike csv);
  • Easy to stream read without loading the entire file into memory (unlike json);
  • Easy to append new entries to an existing file (unlike json).

JSON can also be streamed. But look how much easier it is with JSON Lines:

import json
from typing import Iterator


def jl_reader(fname: str) -> Iterator[dict]:
    with open(fname) as file:
        for line in file:
            obj = json.loads(line.strip())
            yield obj


if __name__ == "__main__":
    reader = jl_reader("employees.jl")
    for employee in reader:
        id = employee["id"]
        name = employee["name"]
        dept = employee["department"]
        print(f"#{id} - {name} ({dept})")
#11 - Diane (hr)
#12 - Bob (hr)
#21 - Emma (it)
#22 - Grace (it)
#23 - Henry (it)

playground

Great fit for logs and data processing pipelines.

]]>
Expressive Ellipsis in Pythonhttps://antonz.org/ellipsis/Fri, 03 Jun 2022 10:50:00 +0000https://antonz.org/ellipsis/What is Ellipsis and how it is used.One of the lesser-known things in Python is the ellipsis:

class Flyer:
    def fly(self):
        ...

This code works. The ... (aka Ellipsis) is a real object that can be used in code.

Ellipsis is the only instance of the EllipsisType type (similar to how None is the only instance of the NoneType type):

>>> ... is Ellipsis
>>> True
>>> Ellipsis is ...
>>> True

Python core devs mostly use ... to show that a type, method, or function has no implementation — as in the fly() example.

And in type hints:

It is possible to declare the return type of a callable without specifying the call signature by substituting a literal ellipsis for the list of arguments in the type hint: Callable[..., ReturnType]

To specify a variable-length tuple of homogeneous type, use literal ellipsis, e.g. Tuple[int, ...]. A plain Tuple is equivalent to Tuple[Any, ...], and in turn to tuple.

# numbers  is a tuple of integer numbers
# summator is a function that accepts arbitrary parameters
#          and returns an integer
def print_sum(numbers: tuple[int, ...], summator: Callable[..., int]):
    total = summator(numbers)
    print(total)

print_sum((1, 2, 3), sum)
# 6

Other developers use Ellipsis for all sorts of bizarre things ツ

]]>
Flying Pig, or Protocols in Pythonhttps://antonz.org/protocol/Tue, 31 May 2022 17:00:00 +0000https://antonz.org/protocol/Structural subtyping using protocols.Let’s say you’ve developed a utility that sends everything flying:

def launch(thing):
    thing.fly()

Well, not exactly everything. Things with the fly() method, to be precise. With a single handy function we launch Frank (he’s a pigeon), an airplane, and even Superman:

class Frank:
    def fly(self):
        print("💩")

class Plane:
    def fly(self):
        print("Flight delayed")

class Superman:
    def fly(self):
        print("ε===(っ≧ω≦)っ")

Whoosh:

f = Frank()
launch(f)
# 💩

p = Plane()
launch(p)
# Flight delayed

s = Superman()
launch(s)
# ε===(っ≧ω≦)っ

It’s not that our heroes are particularly successful at coping with the task, but the launch works for them.

So far, so good. But sometimes (especially when the program grows) the developer wants to add a little rigor. Make it clear that the thing parameter in launch() is not any object, but necessarily a flying thing with the fly() method. What is the best way to do this?

Using a description

If you prefer to avoid types, then you will go with a variable name or a docstring:

def launch(flyer):
    """Launces a flyer (an object with a `fly()` method)"""
    flyer.fly()

The problem is that the more complex the code, the more often the “descriptive” approach fails.

Using a base class

Thanks to some 1990s java programming skills, you end up with a small hierarchy:

class Flyer:
    def fly():
        ...

class Frank(Flyer):
    # ...

class Plane(Flyer):
    # ...

class Superman(Flyer):
    # ...
def launch(thing: Flyer):
    thing.fly()

This method works:

$ mypy flyer.py
Success: no issues found in 1 source file

But, as the Python devs say, it is terribly “unpythonic”:

The problem is that a class has to be explicitly marked, which is unpythonic and unlike what one would normally do in idiomatic dynamically typed Python code.

Indeed. Not only have we modified three classes instead of one function. Not only have we introduced an inheritance hierarchy to our code. But also Frank, the plane and Superman are now burdened by the shared knowledge that they are Flyers. They never asked for this, you know.

Using a protocol

The quote above is from PEP 544 (Python Enhancement Proposal), which was implemented in Python 3.8. Starting with this version, Python recieved protocols.

Protocols describe behavior. Here is our Flyer:

from typing import Protocol

class Flyer(Protocol):
    def fly(self):
        ...

We use a protocol to specify that an object should have a specific behavior. The launch() function can only launch Flyers:

def launch(thing: Flyer):
    thing.fly()

The objects themselves do not need to know about the protocol. It is enough that they implement the right behavior:

class Frank:
    def fly(self):
        # ...

class Plane:
    def fly(self):
        # ...

class Superman:
    def fly(self):
        # ...

Protocols are static duck typing:

  • the interface is explicitly described in the protocol: a flyer has the fly() method;
  • but it is implemented implicitly, according to the “duck” principle: Superman has the fly() method — so he’s a flyer.

Let’s check:

$ mypy flyer.py
Success: no issues found in 1 source file

Perfect!

Summary

If your code should work consistently with different types, find their common behavior and specify it in the protocol. Use the protocol type for static code validation using mypy.

Avoid pigeons, planes, and superheroes whenever possible. They are nothing but problems.

]]>
Random numbers and sequences in Pythonhttps://antonz.org/random/Mon, 23 May 2022 15:30:00 +0000https://antonz.org/random/There is much more than just randint()Everybody knows about random.randint(a, b) in Python, which returns a ≤ n ≤ b:

random.randint(10, 99)
# 59

But the random module has so much more to offer.

Like selecting a number from a range with a step:

random.randrange(10, 99, 3)
# 91

Or a random sequence element:

numbers = [7, 9, 13, 42, 64, 99]
random.choice(numbers)
# 42

Or multiple elements:

numbers = range(99, 10, -1)
random.choices(numbers, k=3)
# [32, 62, 76]

How about choosing some elements more often than others? Sure:

numbers = [7, 9, 13, 42, 64, 99]
weights = [10, 1, 1, 1, 1, 1]

random.choices(numbers, weights, k=3)
# [42, 13, 7]

random.choices(numbers, weights, k=3)
# [7, 7, 7]

random.choices(numbers, weights, k=3)
# [13, 7, 7]

Wanna see a sample without repetitions? No problem:

numbers = [7, 9, 13, 42, 64, 99]
random.sample(numbers, k=3)
# [42, 99, 7]

Or even shuffle the whole sequence:

numbers = [1, 2, 3, 4, 5]
random.shuffle(numbers)
# [3, 2, 1, 5, 4]

There are also countless real-valued distributions like uniform(), gauss(), expovariate(), paretovariate() and many more. Not gonna get into the specifics now — see for yourself if your are a statistics fan.

Last but not least. When testing, seed the generator with a constant so that it gives reproducible results:

random.seed(42)

On the contrary, use seed() without arguments in production. Python will then use the sources of randomness provided by the operating system.

]]>
Common Table Expressions in SQLhttps://antonz.org/cte/Fri, 20 May 2022 16:50:00 +0000https://antonz.org/cte/Use them instead of subqueries.Rule #1 for writing well-readable SQL queries is to use common table expressions (CTE). People are afraid of them, but they really shouldn’t. Let’s learn CTEs in three minutes, so you don’t have to read a weighty SQL book or take an online course.

Problem

Let’s say we have a table with monthly sales for two years:

┌──────┬───────┬───────┬──────────┬─────────┐
│ year │ month │ price │ quantity │ revenue │
├──────┼───────┼───────┼──────────┼─────────┤
│ 2019 │ 1     │ 60    │ 200      │ 12000   │
│ 2019 │ 2     │ 60    │ 660      │ 39600   │
│ 2019 │ 3     │ 60    │ 400      │ 24000   │
│ 2019 │ 4     │ 60    │ 300      │ 18000   │
│ 2019 │ 5     │ 60    │ 440      │ 26400   │
│ 2019 │ 6     │ 60    │ 540      │ 32400   │
│ 2019 │ 7     │ 60    │ 440      │ 26400   │
│ 2019 │ 8     │ 60    │ 440      │ 26400   │
│ 2019 │ 9     │ 60    │ 250      │ 15000   │
│ 2019 │ 10    │ 60    │ 420      │ 25200   │
│ ...  │ ...   │ ...   │ ...      │ ...     │
└──────┴───────┴───────┴──────────┴─────────┘

playground

We want to select only those months for which revenue exceeded the monthly average for the year.

To begin with, let’s calculate the average monthly revenue by year:

select
  year,
  avg(revenue) as avg_rev
from sales
group by year;
┌──────┬─────────┐
│ year │ avg_rev │
├──────┼─────────┤
│ 2019 │ 25125.0 │
│ 2020 │ 48625.0 │
└──────┴─────────┘

Now we can select only those records in which revenue is not less than avg_rev:

select
  sales.year,
  sales.month,
  sales.revenue,
  round(totals.avg_rev) as avg_rev
from sales
  join (
    select
      year,
      avg(revenue) as avg_rev
    from sales
    group by year
  ) as totals
  on sales.year = totals.year
where sales.revenue >= totals.avg_rev;
┌──────┬───────┬─────────┬─────────┐
│ year │ month │ revenue │ avg_rev │
├──────┼───────┼─────────┼─────────┤
│ 2019 │ 2     │ 39600   │ 25125.0 │
│ 2019 │ 5     │ 26400   │ 25125.0 │
│ 2019 │ 6     │ 32400   │ 25125.0 │
│ 2019 │ 7     │ 26400   │ 25125.0 │
│ ...  │ ...   │ ...     │ ...     │
└──────┴───────┴─────────┴─────────┘

We solved the task using a subquery:

  • the inner query calculates the average monthly revenue;
  • the outer query joins with it and filters the results.

The query as a whole turned out to be a bit complicated. If you revisit it in a month, you’ll probably spend some time “unraveling” things. The problem is that such nested queries have to be read from the inside out:

  • find the innermost query and comprehend it;
  • join it with the next outer query;
  • join them with the next outer query;
  • and so on.

It is OK when there are only two levels, as in our example. In practice, I often encounter three- and four-level subqueries. A pain to read and understand.

Solution

Instead of a subquery, we can use a common table expression (CTE). Every subquery X:

select a, b, c
from (X)
where e = f

Can be converted to CTE:

with cte as (X)
select a, b, c
from cte
where e = f

In our example:

with totals as (
  select
    year,
    avg(revenue) as avg_rev
  from sales
  group by year
)

select
  sales.year,
  sales.month,
  sales.revenue,
  round(totals.avg_rev) as avg_rev
from sales
  join totals on totals.year = sales.year
where sales.revenue >= totals.avg_rev;

With a table expression, the query becomes flat — it’s much easier to perceive it this way. Besides, we can reuse the table expression as if it were a regular table:

with totals as (...)
select ... from sales_ru join totals ...
union all
select ... from sales_us join totals ...

SQL table expressions are somewhat similar to functions in a regular programming language — they reduce the overall complexity:

  • You can write an unreadable sheet of code, or you can break the code into understandable individual functions and compose a program out of them.
  • You can build a tower of nested subqueries, or you can extract them into CTEs and reference from the main query.

CTE vs subquery

There is a myth that “CTEs are slow”. It came from old versions of PostgreSQL (11 and earlier), which always materialized CTE — calculated the full result of a table expression and stored it until the end of the query.

This is usually a good thing: the engine calculates the result once, and then uses it several times during the main query. But sometimes materialization prevented the engine from optimizing the query:

with cte as (select * from foo)
select * from cte where id = 500000;

The query selects exactly one record by ID, but materialization creates a copy of the entire table in memory. Because of this, the query is terribly slow.

PostgreSQL 12+ and other modern DBMS have become smarter and no longer do so. Materialization is used when it does more good than harm. Plus, many DBMSs allow you to explicitly control this behavior through the MATERIALIZED / NOT MATERIALIZED instructions.

So CTEs work no slower than subqueries. And if in doubt, you can try both — a subquery and a table expression — and compare the query plan and execution time.

How does one know when to use a subquery and when to use CTE? I came up with a simple rule that has never failed me yet:

Always use CTE

That’s what I wish you.

P.S. There are also recursive CTEs, famous for their complexity and terrible naming (they have almost nothing in common with regular CTEs). Let’s talk about them some other day.

]]>
Temporary tables in SQLitehttps://antonz.org/temp-tables/Wed, 18 May 2022 20:30:00 +0000https://antonz.org/temp-tables/For faster exploratory data analysis.Sometimes you want to combine data from multiple tables into one and query the results. For example, join vacancies together with employers and regions:

Combine, then query

select v.*, e.name, a.name
from vacancy as v
  join employer as e on e.id = v.employer_id
  join area as a on a.id = v.area_id

The question is how to reference the combined dataset in further queries. There are three ways of doing that:

  1. Common Table Expressions (CTEs)
  2. Views
  3. Temporary tables

CTE

A Common Table Expression is basically a named subquery:

with combined_cte as (
  select v.*, e.name, a.name
  from vacancy as v
    join employer as e on e.id = v.employer_id
    join area as a on a.id = v.area_id
)
select ...
from combined_cte
where ...
group by ...
order by ...

The CTE is repeated in each query and computed on the fly. So if the subquery for the combined dataset is slow, the entire query will be even slower.

View

A view works like a CTE, but you can reference it by name and not repeat the subquery every time. Views are computed on the fly, similar to CTEs.

-- 1) create once
create view combined_view as
select v.*, e.name, a.name
from vacancy as v
  join employer as e on e.id = v.employer_id
  join area as a on a.id = v.area_id;

-- 2) use everywhere
select ...
from combined_view
where ...
group by ...
order by ...

PostgreSQL and others have materialized views, which store data on disk. But not SQLite.

Temporary table

A temporary table is like a real table: it stores data on disk, and you can build indexes. But it exists only while the database connection is open.

-- 1) create once
create temp table combined_temp as
select v.*, e.name, a.name
from vacancy as v
  join employer as e on e.id = v.employer_id
  join area as a on a.id = v.area_id;

-- 2) use everywhere
select ...
from combined_temp
where ...
group by ...
order by ...

Technically, SQLite stores temporary tables in a separate temp database. It keeps that database in a separate file on disk, visible only to the current database connection. The temporary database is deleted automatically as soon as the connection is closed.

Temporary database location

On unix-like systems, the directory for storing the temp database can be one of the following:

  1. The directory set by PRAGMA temp_store_directory (deprecated)
  2. The SQLITE_TMPDIR environment variable
  3. The TMPDIR environment variable
  4. /var/tmp
  5. /usr/tmp
  6. /tmp
  7. The current working directory (.)

SQLite picks the first one with both write and execute permissions.

To store the temp database in memory, set PRAGMA temp_store = MEMORY.

documentation

Temporary tables are great for experimenting when you’re just getting to know the data. Do whatever you want — everything will be forgotten after disconnecting from the database ツ

]]>
JSON and virtual columns in SQLitehttps://antonz.org/json-virtual-columns/Sun, 15 May 2022 11:25:00 +0000https://antonz.org/json-virtual-columns/Faster JSON handling in a relational database.Generated columns have another great use case.

JSON data

Let’s say you decide to keep a log of events that occur in the system. There are different types of events, each with its own set of fields. For example, sign-in:

{
    "timestamp": "2022-05-15T09:31:00Z",
    "object": "user",
    "object_id": 11,
    "action": "login",
    "details": {
        "ip": "192.168.0.1"
    }
}

Or account deposit:

{
    "timestamp": "2022-05-15T09:32:00Z",
    "object": "account",
    "object_id": 12,
    "action": "deposit",
    "details": {
        "amount": "1000",
        "currency": "USD"
    }
}

JSON functions

You decide to store the raw JSON, as normalization is non-trivial. You create an events table with a single value field:

select value from events;
{"timestamp":"2022-05-15T09:31:00Z","object":"user","object_id":11,"action":"login","details":{"ip":"192.168.0.1"}}
{"timestamp":"2022-05-15T09:32:00Z","object":"account","object_id":12,"action":"deposit","details":{"amount":"1000","currency":"USD"}}
{"timestamp":"2022-05-15T09:33:00Z","object":"company","object_id":13,"action":"edit","details":{"fields":["address","phone"]}}

And select events for a specific object:

select
  json_extract(value, '$.object') as object,
  json_extract(value, '$.action') as action
from events
where json_extract(value, '$.object_id') = 11;
┌────────┬────────┐
│ object │ action │
├────────┼────────┤
│ user   │ login  │
└────────┴────────┘

So far, so good. But json_extract() parses the text on each call, so for hundreds of thousands of records the query is slow. What should you do?

JSON columns

Define virtual columns:

alter table events
add column object_id integer
as (json_extract(value, '$.object_id'));

alter table events
add column object text
as (json_extract(value, '$.object'));

alter table events
add column action text
as (json_extract(value, '$.action'));

Build an index:

create index events_object_id on events(object_id);

Now the query works instantly:

select object, action
from events
where object_id = 11;

Thanks to virtual columns, we almost have a NoSQL database ツ

Playground

]]>
Compact objects in Pythonhttps://antonz.org/compact-objects/Fri, 13 May 2022 20:25:00 +0000https://antonz.org/compact-objects/Tuple vs dataclass, until numpy interferesPython is an object language. This is nice and cozy until you are out of memory holding 10 million objects at once. Let’s talk about how to reduce appetite.

Visit the Playground to try out the code samples

Tuples

Imagine you have a simple Pet object with the name (string) and price (integer) attributes. Intuitively, it seems that the most compact representation is a tuple:

("Frank the Pigeon", 50000)

Let’s measure how much memory this beauty eats:

import random
from pympler.asizeof import asizeof

def fields():
    name_gen = (random.choice(string.ascii_uppercase) for _ in range(10))
    name = "".join(name_gen)
    price = random.randint(10000, 99999)
    return (name, price)

def measure(name, fn, n=10_000):
    pets = [fn() for _ in range(n)]
    size = round(asizeof(pets) / n)
    print(f"Pet size ({name}) = {size} bytes")
    return size

baseline = measure("tuple", fields)
Pet size (tuple) = 161 bytes

161 bytes. Let’s use it as a baseline for further comparison.

Dataclasses vs named tuples

But who works with tuples these days? You would probably choose a dataclass:

from dataclasses import dataclass

@dataclass
class PetData:
    name: str
    price: int

fn = lambda: PetData(*fields())
measure("dataclass", fn)
Pet size (dataclass) = 257 bytes
x1.60 to baseline

Thing is, it’s 1.6 times larger than a tuple.

Let’s try a named tuple then:

from typing import NamedTuple

class PetTuple(NamedTuple):
    name: str
    price: int


fn = lambda: PetTuple(*fields())
measure("named tuple", fn)
Pet size (named tuple) = 161 bytes
x1.00 to baseline

Looks like a dataclass, works like a tuple. Perfect. Or not?

Slots

Python 3.10 received dataclasses with slots:

@dataclass(slots=True)
class PetData:
    name: str
    price: int


fn = lambda: PetData(*fields())
measure("dataclass w/slots", fn)
Pet size (dataclass w/slots) = 153 bytes
x0.95 to baseline

Wow! Slots magic creates special skinny objects without an underlying dictionary, unlike regular Python objects. Such dataclass is even lighter than a tuple.

What if 3.10 is out of the question yet? Use NamedTuple. Or add a slots dunder manually:

@dataclass
class PetData:
    __slots__ = ("name", "price")
    name: str
    price: int

Slot objects have their own shortcomings. But they are great for simple cases (without inheritance and other complex stuff).

numpy arrays

The real winner, of course, is the numpy array:

import string
import numpy as np

PetNumpy = np.dtype([("name", "S10"), ("price", "i4")])
generator = (fields() for _ in range(n))
pets = np.fromiter(generator, dtype=PetNumpy)
size = round(asizeof(pets) / n)
Pet size (numpy array) = 14 bytes
x0.09 to baseline

This is not a flawless victory, though. If names are unicode (U type instead of S), the advantage is not so impressive:

PetNumpy = np.dtype([("name", "U10"), ("price", "i4")])
Pet size (numpy U10) = 44 bytes
x0.27 to baseline

If the name length is not strictly 10 characters, but varies, say, up to 50 characters (U50 instead of U10) — the advantage disappears completely:

def fields():
    name_len = random.randint(10, 50)
    name_gen = (random.choice(string.ascii_uppercase) for _ in range(name_len))
    # ...

PetNumpy = np.dtype([("name", "U50"), ("price", "i4")])
Pet size (tuple) = 179 bytes

Pet size (numpy U50) = 204 bytes
x1.14 to baseline

Others

Let’s consider alternatives for completeness.

A regular class is no different than a dataclass:

class PetClass:
    def __init__(self, name: str, price: int):
        self.name = name
        self.price = price
Pet size (class) = 257 bytes
x1.60 to baseline

And a frozen (immutable) dataclass too:

@dataclass(frozen=True)
class PetDataFrozen:
    name: str
    price: int
Pet size (frozen dataclass) = 257 bytes
x1.60 to baseline

A dict is even worse:

names = ("name", "price")
fn = lambda: dict(zip(names, fields()))
measure("dict", fn)
Pet size (dict) = 355 bytes
x1.98 to baseline

Pydantic model sets an anti-record (no wonder, it uses inheritance):

from pydantic import BaseModel

class PetModel(BaseModel):
    name: str
    price: int
Pet size (pydantic) = 385 bytes
x2.39 to baseline

⌘ ⌘ ⌘

Compact (and not so compact) objects in Python:

Tuple
Dataclass
Named tuple
Dataclass with slots
Manual slots
numpy array
]]>