Anton ZhiyanovEverything about SQLite, Python, open data and awesome software. Zhiyanov -- gohugo.ioen-usTue, 04 Jan 2022 13:00:00 +0000The ultimate SQLite extension set, 04 Jan 2022 13:00:00 +0000, math, file IO and over 100 other functions.I really like SQLite. It’s a miniature embedded database, perfect for both exploratory data analysis and as a storage for small apps (I’ve blogged about that previously).

It has a minor drawback though. There are few built-in functions compared to PostgreSQL or Oracle. Fortunately, the authors provided an extension mechanism, which allows doing almost anything. As a result, there are a lot of SQLite extensions out there, but they are incomplete, inconsistent and scattered across the internet.

I wanted more consistency. So I started the sqlean project, which brings the extensions together, neatly packaged into domain modules, documented, tested, and built for Linux, Windows and macOS. Something like a standard library in Python or Go, only for SQLite.

I plan to write in detail about each module in a separate article, but for now — here’s a brief overview.

The main set

These are the most popular functions missing in SQLite:

  • crypto: cryptographic hashes like MD5 or SHA-256.
  • fileio: read and write files and catalogs.
  • fuzzy: fuzzy string matching and phonetics.
  • ipaddr: IP address manipulation.
  • json1: JSON functions.
  • math: math functions.
  • re: regular expressions.
  • stats: math statistics — median, percentiles, etc.
  • text: string functions.
  • unicode: Unicode support.
  • uuid: Universally Unique IDentifiers.
  • vsv: CSV files as virtual tables.

There are precompiled binaries for Windows, Linix and macOS.

The incubator

These extensions haven’t yet made their way to the main set. They may be too broad, too narrow, or without a well-thought API. I’m gradually refactoring and merging them into the main set:

  • array: one-dimensional arrays.
  • besttype: convert string value to numeric.
  • bloom: a fast way to tell if a value is already in a table.
  • btreeinfo, memstat, recsize and stmt: various database introspection features.
  • cbrt and math2: additional math functions and bit arithmetics.
  • classifier: binary classifier via logistic regression.
  • closure: navigate hierarchic tables with parent/child relationships.
  • compress and sqlar: compress / uncompress data.
  • cron: match dates against cron patterns.
  • dbdump: export database as SQL.
  • decimal, fcmp and ieee754: decimal and floating-point arithmetic.
  • define: create scalar and table-valued functions from SQL.
  • envfuncs: read environment variables.
  • eval: run arbitrary SQL statements.
  • isodate: additional date and time functions.
  • pearson: Pearson correlation coefficient between two data sets.
  • pivotvtab: pivot tables.
  • prefixes: generate string prefixes.
  • rotate: string obfuscation.
  • spellfix: search a large vocabulary for close matches.
  • stats2 and stats3: additional math statistics functions.
  • text2: additional string functions.
  • uint: natural string sorting and comparison.
  • unhex: reverse for hex().
  • unionvtab: union similar tables into one.
  • xmltojson: convert XML to JSON string.
  • zipfile: read and write zip files.
  • zorder: map multidimensional data to a single dimension.

Vote for your favorites! Popular ones will make their way into the main set faster.

Incubator extensions are also available for download.

How to load an extension

There are three ways to do it. If you are using SQLite CLI (sqlite.exe):

sqlite> .load ./stats
sqlite> select median(value) from generate_series(1, 99);

If you are using a tool like DB Browser for SQLite, SQLite Expert or DBeaver:

select load_extension('c:\Users\anton\sqlite\stats.dll');
select median(value) from generate_series(1, 99);

If you are using Python (other languages provide similar means):

import sqlite3

connection = sqlite3.connect(":memory:")
connection.execute("select median(value) from generate_series(1, 99)")

Next steps

If you feel that you are missing some function in SQLite, check the sqlean repository — you’ll probably find one.

If you want to participate, submit your own or third-party extensions.

I keep adding new extensions to the incubator. I also refactor the extensions from the incubator and merge them into the main set. I plan to write a separate article for each main module, so stay tuned.


Follow @ohmypy on Twitter to keep up with new posts 🚀

What's new in SQLite 3.37, 28 Nov 2021 15:25:00 +0000 tables, any type and a new pragma.Unlike 3.35, release 3.37 didn’t bring many changes. But among them is one of the most important in the history of SQLite: the “strict” table mode, in which the engine makes sure that the data in the column matches the type.

Perhaps now SQLite will no longer be called “the JavaScript of the DBMS world” ツ But let’s take it one piece at a time.

The problem with types

SQLite supports 5 data types:

  • INTEGER — integers,
  • REAL — real numbers,
  • TEXT — strings,
  • BLOB — binary data,
  • NULL — empty value.

But, unlike other DBMSs, SQLite can store any type of data in a given cell — regardless of the column type.

SQLite stores the type not only on the column itself, but also on each value in that column. That is why a given column can store values of different types without any problems. The type on the column is used as a hint: when inserting, SQLite tries to cast the value to the column type, but when it fails, it will save the value “as is”.

On the one hand, it is convenient for exploratory data analysis — you can import everything first, and then use SQL to deal with problematic values. Any other DBMS will give an error when importing and force you to crunch the data with scripts or manually.

On the other hand, it causes a constant flow of criticism against SQLite: you can write things into the production database that you will never be able to sort out.

And now, in version 3.37, the problem is solved!

STRICT tables

Now the table can be declared “strict”. Strict tables do not allow saving arbitrary data:

create table employees (
    id integer primary key,
    name text,
    salary integer
insert into employees (id, name, salary)
values (22, 'Emma', 'hello');
-- Error: stepping, cannot store TEXT value in INTEGER column employees.salary (19)

Emma clearly has a problem with her salary, which is what SQLite indicates. Someone has been waiting for this for twenty years ツ

At the same time, the engine still tries to convert the value to the column type, and if it succeeds — there will be no error:

insert into employees (id, name, salary)
values (22, 'Emma', '85');

select * from employees;
 id  name   salary 
 22  Emma   85     

See STRICT Tables for details.

The ANY datatype

ANY type provides the means to save arbitrary values into STRICT tables:

create table employees (
    id integer primary key,
    name text,
    stuff any
) strict;

insert into employees (id, name, stuff)
(21, 'Emma', 84),
(22, 'Grace', 'hello'),
(23, 'Henry', randomblob(8));

select id, name, typeof(stuff) from employees;
 id  name   typeof(stuff) 
 21  Emma   integer       
 22  Grace  text          
 23  Henry  blob          

The STRICT table stores ANY value without any transformations. In a regular table, ANY works almost the same way, but converts strings to numbers whenever possible.

See The ANY datatype for details.

table_list pragma

table_list pragma statement lists tables and views in the database:

pragma table_list;
 schema         name         type   ncol  wr  strict 
 main    expenses            table  4     0   0      
 main    employees           table  5     0   0      
 main    sqlite_schema       table  5     0   0      
 temp    sqlite_temp_schema  table  5     0   0      

Previously, one had to query the sqlite_schema table for this. The pragma is more convenient.

See PRAGMA table_list for details.

CLI changes

The CLI tool (sqlite.exe) now supports switching between multiple database connections using the dot command .connection:

sqlite> .connection
ACTIVE 0: :memory:
sqlite> .open
sqlite> .connection
sqlite> .connection 1
sqlite> .open employees.en.db
sqlite> .connection
ACTIVE 1: employees.en.db

See Working With Multiple Database Connections for details.

Also, there is now a --safe launch option. It disables commands that can make changes anywhere other than a specific database. Safe mode disables .open, .shell, .import and other “dangerous” commands.

See The –safe command-line option for details.

And a few more little things

  • The query scheduler ignores order by on subqueries unless they change the overall semantics of the query.
  • Function generate_series(start, stop, step) always requires the start parameter (stop and step remain optional).
  • Some changes in C API.

Overall, a great release! Strict tables offer a long-awaited alternative to flexible typing, any type makes flexibility explicit, and table_list pragma is just nice to have.

Official release notes | Download

Follow @ohmypy on Twitter to keep up with new posts 🚀

How Python list really works, 12 Nov 2021 17:55:00 +0000 why some methods take constant time while others take linear.This post is largely about the arrays — the #1 data structure in the world. If you are not a data structure guru yet, I guarantee that you will better understand Python lists, their advantages and limitations. If you already know everything — there is no harm in refreshing the key points.

Everybody knows how to work with lists in Python:

>>> guests = ["Frank", "Claire", "Zoe"]
>>> guests[1]

Surely you know that selecting an item by index — guests[idx] — works instantly even on a million elements list. More precisely, selection by index takes constant time O(1) — that is, it does not depend on the number of items in the list.

Do you know why it works so fast? Let’s find out.

List = array?

The list is based on an array. An array is a set of elements ① of the same size, ② located in memory one after another, without gaps.

Since elements are the same size and placed contiguously, it is easy to get an array item by index. All we need is the memory address of the very first element (the “head” of the array).

Let’s say the head is located at the address 0×00001234, and each item occupies 8 bytes. Then the element with the idx index is located at 0×00001234 + idx*8:

List = array

Since the “get value by address” memory operation takes constant time, selecting an array item by index also takes O(1).

Roughly speaking, this is how Python list works. It stores a pointer to the head of the array and the number of items in the array. The item count is stored separately so that the len() function also performs in O(1) time, and does not have to count the elements each time.

So far so good. But there are a couple of problems:

  • All array elements are the same size, but the list should be able to store items of different sizes (true/false, numbers, strings of different lengths).
  • The array has a fixed length, but the list should be able to store an arbitrary number of items.

We’ll tackle them a bit later.

A very primitive list

The best way to master a data structure is to implement it from scratch. Unfortunately, Python is not well suited for such low-level structures as arrays, because it doesn’t support explicit pointers (addresses in memory).

This is probably as close as we can get:

class OhMyList:
    def __init__(self):
        self.length = 0
        self.capacity = 8
        self.array = (self.capacity * ctypes.py_object)()

    def append(self, item):
        self.array[self.length] = item
        self.length += 1

    def __len__(self):
        return self.length

    def __getitem__(self, idx):
        return self.array[idx]

Our custom list has a fixed capacity (capacity = 8 items) and stores the elements in the array array.

The ctypes module gives access to the low-level structures on which the standard library is built. In this case, we use it to create a C-style array of capacity elements.

List = array of pointers

The list instantly retrieves an item by index, because it has an array inside. And the array is so fast because all the elements are the same size.

But list items can be of different sizes:

guests = ["Frank", "Claire", "Zoe", True, 42]

To solve this problem, someone came up with the idea of storing item pointers instead of item values. Each element of the array is a memory address, and if you follow this address — you will get the actual value:

List = array of pointers
The array stores pointers adjacently. But the values they refer to can be stored anywhere in memory.

Since pointers are fixed size (8 bytes on modern 64-bit processors), everything works fine. Instead of one operation (get the value from the array cell), we’ve now got two:

  1. Get the address from the array cell.
  2. Get the value at that address.

But it’s still constant time O(1).

List = dynamic array

If there are empty spaces left in the array underneath the list, then the .append(item) runs in constant time. All it takes is to write a new value to a free cell and increase the element counter by 1:

def append(self, item):
    self.array[self.length] = item
    self.length += 1

But what if the array is already full?

Python has to allocate memory for a new, bigger array and copy all the old items to the new one:

List = dynamic array
When there is no more space in the old array, it's time to create a new one.

Here we go:

def append(self, item):
    if self.length == self.capacity:
    self.array[self.length] = item
    self.length += 1

def _resize(self, new_cap):
    new_arr = (new_cap * ctypes.py_object)()
    for idx in range(self.length):
        new_arr[idx] = self.array[idx]
    self.array = new_arr
    self.capacity = new_cap

._resize() is a costly operation, so the new array should be significantly larger than the old one. In the example above, the new array is twice as large. Python uses a more modest coefficient — about 1.12.

If you remove more than half of the items from the list via .pop(), Python will shrink it. It’ll allocate a new, smaller array and move the elements into it.

Thus, the list juggles arrays all the time so that we don’t have to do it ツ

Appending an item to the list

Selecting from the list by index takes O(1) time — we have sorted that out. The .append(item) method is also O(1) until Python has to extend the array under the list. But array extension is an O(n) operation. So how long does .append() take after all?

It would be wrong to measure a single append — as we found out, sometimes it takes O(1), sometimes O(n). So computer scientists came up with amortized analysis. To get an amortized operation time, one estimates the total time that a sequence of K operations will take, then divide it by K.

Without going into details, I will say that the amortized time for .append(item) turns out to be constant — O(1). So appending to the list works very fast.

Why amortized append time is O(1)

Let's say the list is empty and want to append n items. For simplicity, we'll use the expansion factor of 2. Let's count the number of atomic operations:

  • 1st item: 1 (copy) + 1 (insert)
  • another 2: 2 (copy) + 2 (insert)
  • another 4: 4 (copy) + 4 (insert)
  • another 8: 8 (copy) + 8 (insert)
  • ...

For n items there will be n insertions.

As for copy:

1 + 2 + 4 + ... log(n) = 
= 2**log(n) * 2 - 1 =
= 2n - 1


So for n items there will be 3n - 1 atomic operations.

O((3n - 1) / n) = O(1)

Summing up, the following operations are guaranteed to be fast:

# O(1)

# O(1)

# amortized O(1)


As we found out, these operations are O(1):

  • select an item by index lst[idx]
  • count items len(lst)
  • add an item to the end of the list .append(item)
  • remove an item from the end of the list .pop()

Other operations are “slow”:

  • Insert or delete an item by index. .insert(idx, item) and .pop(idx) take linear time O(n) because they shift all the elements after the target one.
  • Search or delete an item by value. item in lst, .index(item) and .remove(item) take linear time O(n) because they iterate over all the elements.
  • Select a slice of k elements. lst[from:to] takes O(k).

Does this mean that you should not use “slow” operations? Of course not. If you have a list of 1000 items, the difference between O(1) and O(n) for a single operation is insignificant.

On the other hand, if you perform a “slow” operation on a list of 1000 items a million times — this is quite significant. The same goes if you invoke a single “slow” operation on a list of a million items.

Therefore, it is useful to know which list methods take constant time and which take linear time — to make a conscious decision in a specific situation.

I hope you’ll see Python lists in a new way after this article. Thanks for reading!

Follow @ohmypy on Twitter to keep up with new posts 🚀

Comments on Hacker News

SQLite playground in the browser, 04 Oct 2021 10:40:00 +0000 have built an online SQL playground with vanilla JS and a bit of GitHub API. Here's how.What I’ve always lacked is something similar to JSFiddle, but for SQLite. An online playground to quickly test an SQL query and share it with others.

Here is what I wanted:

  • Binary database import, not just SQL schema.
  • Support both local and remote databases (by url).
  • Save the database and queries in the cloud.
  • Free of charge, no sign-up required.
  • The latest version of SQLite.
  • Minimalistic and mobile friendly.

So I’ve built SQLime — an online SQLite playground for debugging and sharing SQL snippets.

SQLime - SQLite Playground

First I’ll show the results, then describe how everything works:

Now the details.

SQLite in the browser

All browsers — both mobile and desktop — have an excellent DBMS is already built-in — SQLite. It implements the SQL-92 standard (and a large part of later standards). Seems only logical to access it through the browser API.

Many browser vendors thought so at the end of the 00s. That’s how Web SQL standard appeared, supported by Apple (Safari), Google (Chrome), and Opera (popular at the time). Not by Mozilla (Firefox), though. As a result, Web SQL was killed in 2010. After that, browser data storage went along the NoSQL path (Indexed Database, Cache API).

In 2019, Ophir Lojkine compiled SQLite sources into WebAssembly (the ‘native’ browser binary format) for the sql.js project. It is a full-fledged SQLite instance that works in the browser (and quite a small one — the binary takes about 1Mb).

sql.js is the perfect engine for an online playground. So I used it.

Loading the database from a file

Get the file from the user via input[type=file], read it with the FileReader, convert into an 8-bit array, and upload to SQLite:

const file =[0];
const reader = new FileReader();
reader.onload = function () {
    const arr = new Uint8Array(reader.result);
    return new SQL.Database(arr);

Loading the database by URL

Upload the file using fetch(), read the answer into ArrayBuffer, then proceed as with a regular file:

const resp = await fetch(url);
const buffer = await response.arrayBuffer();
const arr = new Uint8Array(buffer);
return new SQL.Database(arr);

Works equally well with local and remote URLs. Also handles databases hosted on GitHub — just use the domain instead of

Querying the database

Perhaps the simplest part, as sql.js provides a convenient query API:

// execute one or more queries
// and return the last result
const result = db.exec(sql);
if (!result.length) {
    return null;
return result[result.length - 1];

Exporting the database to SQL

It is not hard to get the binary database content — sql.js provides a method:

const buffer = db.export();
const blob = new Blob([buffer]);
const link = document.createElement("a");
link.href = window.URL.createObjectURL(blob);
// ...;

But I wanted a full SQL script with table schema and contents instead of a binary file. Such script is easier to understand and upload to PostgreSQL or another DBMS.

To export the database, I used the algorithm from the sqlite-dump project. The code is not very concise, so I will not show it here (see dumper.js if interested). In short:

  1. Get a list of tables from the system sqlite_schema table, extract create table... queries.
  2. For each table, get a list of columns from the virtual table table_info(name).
  3. Select data from each table and generate insert into... queries.

It produces a readable script:

create table if not exists employees (
    id integer primary key,
    name text,
    city text,
    department text,
    salary integer
insert into "employees" values(11,'Diane','London','hr',70);
insert into "employees" values(12,'Bob','London','hr',78);
insert into "employees" values(21,'Emma','London','it',84);

Saving to the cloud

The database and queries need to be stored somewhere so that you can share a link to the prepared playground. The last thing I wanted was to implement the backend with authorization and storage. That way the service could not stay free, not to mention an extra signup headache.

Fortunately, there is a GitHub Gist API that perfectly fits all criteria:

  • many developers already have GitHub accounts;
  • API allows CORS (allowed to make requests from my domain);
  • nice user interface;
  • free and reliable.

I integrated the Gist API via the ordinary fetch(): GET to load the gist, POST to save it.

// produce an SQL script with db schema and contents
const data = export(db);
// save as gist
fetch("", {
    method: "post",
    headers: {
        Accept: "application/json",
        "Content-Type": "application/json",
        Authorization: `Token ${token}`
    body: JSON.stringify(data),

All the user needs is to specify the Github API token. Conveniently, the token is scoped exclusively to work with gists — it has no access to repositories, so is guaranteed to do no harm.

User Interface

Modern frontend projects are full of tooling and infrastructure stuff. Honestly, I’m not interested in it at all (I’m not a JS developer). So I deliberately did not use UI frameworks and did everything with vanilla HTML + CSS + JS. It seems to be quite acceptable for a small project.

SQLime on mobile
I took care of the mobile layout: the playground is perfectly usable on the phone. And there are command shortcuts for the desktop.

At the same time, the code turned out to be quite modular, thanks to native JS modules and web components — they are supported by all modern browsers. A real frontend developer will wince probably, but I’m fine.

The playground is hosted on GitHub Pages, and the deployment is a basic git push. Since there is no build stage, I didn’t even have to set up GitHub Actions.


Try SQLime for yourself — see if you find it useful. Or, perhaps, you’ll adopt the approach of creating serverless tools with vanilla JS and GitHub API. Constructive critique is also welcome, of course ツ

Follow @ohmypy on Twitter to keep up with new posts 🚀

Comments on Hacker News

Good Code Criterion, 02 Jun 2021 11:00:00 +0000 T, keep an eye on R.Good code is understandable and non-greedy. Let’s talk about it.

Time to understanding

The main criterion for good code is the time T it takes for a non-author to understand the code. Not “I sorta get it”, but understand deep enough to make changes and not break anything.

The smaller the T, the better the code.

Let’s say Alice and Bob implemented the same feature, and you want to modify it. If you understand Alice’s code in 10 minutes, and Bob’s code in 30 minutes - Alice’s code is better. It doesn’t matter how layered Bob’s architecture is, whether he used a functional approach, a modern framework, etc.

The T-metric is different for a beginner and an experienced programmer. Therefore, it makes sense to focus on the average level of devs who will use the code. If you have a team of people working for 10+ years, and everyone writes compilers in their spare time - even very complex code will have a low T. If you have a huge turnover and hire yesterday’s students — the code should be rather primitive so that T does not shoot through the roof.

It’s not easy to measure T directly, so usually, teams track secondary metrics that affect T:

  • code style (black for Python),
  • code smells (pylint, flake8),
  • cyclomatic complexity (mccabe),
  • module dependencies (import-linter).

Plus code review.

Resource usage

The second criterion for good code is the amount of resources R it consumes (time, CPU, memory, disk). The smaller the R, the better the code.

If Alice and Bob implemented a feature with the same T, but Alice’s code time complexity is O(n), and Bob’s is O(n²) (with the same consumption of other resources) - Alice’s code is better.

Note about the notorious “sacrifice readability for efficiency”. For each task, there is a resource consumption threshold R0, which the solution should not exceed. If R < R0, do not degrade T for the sake of further reducing R.

If a non-critical service processes a request in 50ms, you don’t need to rewrite it from Python to C to reduce the time to 5ms. The thing is already fast enough.

If the code has a high T and a low R, in most cases you can reduce T while keeping R < R0.

But sometimes, if resources are limited, or the input data is huge, it may not possible to reach R < R0 without degrading T. Then you really have to sacrifice clarity. But make sure that:

  1. This is the last resort when all the other options have failed.
  2. The code sections where T is traded for R are well isolated.
  3. There are few such sections.
  4. They are well-documented.


Here is the mnemonics for good code:

T↓ R<R0

Optimize T, keep an eye on R. Your team will thank you.

Thanks for reading! Follow @ohmypy on Twitter to keep up with new posts 🚀

Data Visualization Guide, 10 Apr 2021 20:00:00 +0000 design guide for presentations, reports, and dashboards.Today I’ve come across a perfect information graphics / data visualization guide:

  • Based on the works of Edward Tufte and Stephen Few.
  • Comprehensive yet not too wordy (150 pages).
  • Highly practical and with lots of examples (197 figures).

The book provides advice on designing clear, concise, and actionable reports and dashboards:

  1. How to articulate the message.
  2. How to choose an appropriate chart type.
  3. How to design specific chart elements.
  4. How to avoid clutter and increase information density.
  5. How to make everything clear and consistent.
Sample report

While the guide itself is great, the authors - IBCS Association - made some questionable choices:

  1. They vaguely and somewhat misleadingly called it ‘International Business Communication Standards’.
  2. They presented it on the IBCS website in a way that is barely readable (in my opinion).

Fortunately, IBCS published the guide under the permissive CC BY-SA license. So with some hard work and a bunch of Python I’ve created a web version, EPUB and PDF.

IBCS Association put a lot of thought it the guide, and the result really impressed me. So I encourage you to try it out.

Thanks for reading! Follow @ohmypy on Twitter to keep up with new posts 🚀

How to make an awesome Python package in 2021, 06 Apr 2021 20:30:00 +0300 one to be proud of.If you are like me, every once in a while you write a useful python utility and want to share it with your colleagues. The best way to do this is to make a package: it easy to install and saves from copy-pasting.

If you are like me, you might be thinking that creating packages is a real headache. Well, that’s not the case anymore. And I am going to prove it with this step-by-step guide. Just three main steps (and a bunch of optional ones) accompanied by few GitHub links. See for yourself:

1. Stub

We will create podsearch - a utility that searches for podcasts in iTunes. Let’s create a directory and a virtual environment:

$ mkdir podsearch
$ cd podsearch
$ python3 -m venv env
$ . env/bin/activate

Define a minimal package structure:

├── .gitignore
└── podsearch
"""Let's find some podcasts!"""

 __version__ = "0.1.0"

 def search(name, count=5):
     """Search podcast by name."""
     raise NotImplementedError()

2. Test package

Creating a package in Python used to be a troublesome task. Fortunately, nowadays there is a great little flit utility which simplifies everything. Let’s install it:

pip install flit

And create package description:

$ flit init
Module name [podsearch]:
Author [Anton Zhiyanov]:
Author email []:
Home page []:
Choose a license (see for more info)
1. MIT - simple and permissive
2. Apache - explicitly grants patent rights
3. GPL - ensures that code based on this is shared with the same terms
4. Skip - choose a license later
Enter 1-4 [1]: 1

Written pyproject.toml; edit that file to add optional extra info.


Flit has created pyproject.toml - the project metadata file. It already has everything you need to publish the package to the public repository - PyPI.

Sign up for TestPyPi (test repository) and PyPI (the main one). They are completely independent, so you will need two accounts.

Setup access to repositories in the ~/.pypirc:

index-servers =

username: nalgeon  # replace with your PyPI username

username: nalgeon  # replace with your TestPyPI username

And publish the package to the test repository:

$ flit publish --repository pypitest
Found 4 files tracked in git
Package is at

Done! The package is available on TestPyPi.

3. Public package

Let’s improve the code so that it actually searches for podcasts:

# ...


class Podcast:
    """Podcast metadata."""

    id: str
    name: str
    author: str
    url: str
    feed: Optional[str] = None
    category: Optional[str] = None
    image: Optional[str] = None

def search(name: str, limit: int = 5) -> List[Podcast]:
    """Search podcast by name."""
    params = {"term": name, "limit": limit, "media": "podcast"}
    response = _get(url=SEARCH_URL, params=params)
    return _parse(response)

And publish to the main repository - PyPI. Perform this step only after your package has working code that does something useful. Do not publish non-working packages and stubs.

flit publish

Done! Time to share it with colleagues.

To make the package a pleasure to use, I recommend that you follow a few more steps.

A. Readme and changelog

No one likes to write documentation. But without docs, it is unlikely that people will want to install your package, so let’s add and

Add readme to the pyproject.toml, so that PyPI shows it on the package page:

description-file = ""

Also specify the minimal supported Python version:

requires-python = ">=3.7"

Update the version in and publish the package via flit publish:

Package on PyPi

Mmm, nice.

B. Linters and tests

Let’s take care of formatting (black), test coverage (coverage), code quality (flake8, pylint, mccabe), and static analysis (mypy). We will run everything through tox.

$ pip install black coverage flake8 mccabe mypy pylint pytest tox

Create tox configuration in tox.ini:

isolated_build = True
envlist = py37,py38,py39

deps =
commands =
    black podsearch
    flake8 podsearch
    pylint podsearch
    mypy podsearch
    coverage erase
    coverage run --include=podsearch/* -m pytest -ra
    coverage report -m


And run all the checks:

$ tox -e py39
py39 run-test: commands[0] | black podsearch
All done! ✨ 🍰 ✨
py39 run-test: commands[2] | pylint podsearch
Your code has been rated at 10.00/10 (previous run: 10.00/10, +0.00)
py39 run-test: commands[6] | coverage report -m
TOTAL 100%
py39: commands succeeded
congratulations :)

Lovely! The linters are happy, the tests passed, the coverage is 100%.

C. Cloud build

Every solid open-source project runs cloud tests after each commit, so we will too. A nice side effect is having beautiful badges in the readme ツ

Let’s build the project with GitHub Actions, check test coverage with Codecov and code quality with Code Climate.

You will have to sign up for Codecov and Code Climate (both support GitHub login) and enable package repository in the settings.

After that, add the GitHub Actions build config to .github/workflows/build.yml:

# ...
        runs-on: ubuntu-latest
                python-version: [3.7, 3.8, 3.9]

            USING_COVERAGE: "3.9"

            - name: Checkout sources
              uses: actions/checkout@v2

            - name: Set up Python
              uses: actions/setup-python@v2
                  python-version: ${{ matrix.python-version }}

            - name: Install dependencies
              run: |
                  python -m pip install --upgrade pip
                  python -m pip install black coverage flake8 flit mccabe mypy pylint pytest tox tox-gh-actions                  

            - name: Run tox
              run: |
                                    python -m tox

            - name: Upload coverage to Codecov
              uses: codecov/codecov-action@v1
              if: contains(env.USING_COVERAGE, matrix.python-version)
                  fail_ci_if_error: true


GitHub runs tests via tox - just as we did. tox-gh-actions package and USING_COVERAGE settings ensure that tox uses the same Python version as GitHub Actions themself, as required by strategy.matrix (I learned this clever trick from Hynek Schlawak).

The last step sends test coverage to Codecov. Code Climate does not need a separate step - it spots repository changes automatically.

Now commit, push, and enjoy the result in a minute. And let everyone enjoy as well - add badges to

[![PyPI Version][pypi-image]][pypi-url]
[![Build Status][build-image]][build-url]
[![Code Coverage][coverage-image]][coverage-url]
[![Code Quality][quality-image]][quality-url]


<!-- Badges -->


Aren’t they cute?

Readme badges

D. Task automation

tox is fine, but not very handy for development. It’s faster to run individual commands like pylint, coverage etc. But they are quite verbose, so we’ll automate the boring stuff.

Let’s create short aliases for frequent actions with Makefile:

.PHONY: coverage deps help lint push test

coverage:  ## Run tests with coverage
	coverage erase
	coverage run --include=podsearch/* -m pytest -ra
	coverage report -m

deps:  ## Install dependencies
	pip install black coverage flake8 mccabe mypy pylint pytest tox

lint:  ## Lint and static-check
	flake8 podsearch
	pylint podsearch
	mypy podsearch

push:  ## Push code with tags
	git push && git push --tags

test:  ## Run tests
	pytest -ra


Here are our tasks:

$ make help
Usage: make [task]

task                 help
------               ----
coverage             Run tests with coverage
deps                 Install dependencies
lint                 Lint and static-check
push                 Push code with tags
test                 Run tests
help                 Show help message

To make the code more DRY, replace raw build.yml steps with make calls:

- name: Install dependencies
  run: |
            make deps

- name: Run tox
  run: |
            make tox

E. Cloud publish

GitHub is quite capable of running flit publish for us. Let’s create a separate workflow:

name: publish

        types: [created]

        runs-on: ubuntu-latest
            - name: Checkout sources
              uses: actions/checkout@v2

            - name: Set up Python
              uses: actions/setup-python@v2
                  python-version: "3.9"

            - name: Install dependencies
              run: |
                                    make deps

            - name: Publish to PyPi
                  FLIT_USERNAME: ${{ secrets.PYPI_USERNAME }}
                  FLIT_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
              run: |
                                    make publish


PYPI_USERNAME and PYPI_PASSWORD are set in repository settings (Settings > Secrets > New repository secret). Use your PyPi username and password, or even better - an API token.

Now GitHub will automatically publish the package as soon as you create a new release. Sweet!

⌘ ⌘ ⌘

Your perfect package is ready! It has everything one could dream of: clean code, clear documentation, tests, and cloud builds. Time to tell your colleagues and friends.

These settings will make your package awesome:

Thanks for reading! Follow @ohmypy on Twitter to keep up with new posts 🚀

Comments on Hacker News

SQLite is not a toy database, 25 Mar 2021 09:00:00 +0000 is why SQLite is a perfect tool for you - whether you are a developer, data analyst, or geek.English • RussianSpanish

Whether you are a developer, data analyst, QA engineer, DevOps person, or product manager - SQLite is a perfect tool for you. Here is why.

A few well-known facts to get started:

  • SQLite is the most common DBMS in the world, shipped with all popular operating systems.
  • SQLite is serverless.
  • For developers, SQLite is embedded directly into the app.
  • For everyone else, there is a convenient database console (REPL), provided as a single file (sqlite3.exe on Windows, sqlite3 on Linux / macOS).

Console, import, and export

The console is a killer SQLite feature for data analysis: more powerful than Excel and more simple than pandas. One can import CSV data with a single command, the table is created automatically:

> .import --csv city.csv city
> select count(*) from city;

The console supports basic SQL features and shows query results in a nice ASCII-drawn table. Advanced SQL features are also supported, but more on that later.

  century || ' century' as dates,
  count(*) as city_count
from history
group by century
order by century desc;
│   dates    │ city_count │
│ 21 century │ 1          │
│ 20 century │ 263        │
│ 19 century │ 189        │
│ 18 century │ 191        │
│ 17 century │ 137        │
│ ...        │ ...        │

Data could be exported as SQL, CSV, JSON, even Markdown and HTML. Takes just a couple of commands:

.mode json
.output city.json
select city, foundation_year, timezone from city limit 10;
.shell cat city.json
    { "city": "Amsterdam", "foundation_year": 1300, "timezone": "UTC+1" },
    { "city": "Berlin", "foundation_year": 1237, "timezone": "UTC+1" },
    { "city": "Helsinki", "foundation_year": 1548, "timezone": "UTC+2" },
    { "city": "Monaco", "foundation_year": 1215, "timezone": "UTC+1" },
    { "city": "Moscow", "foundation_year": 1147, "timezone": "UTC+3" },
    { "city": "Reykjavik", "foundation_year": 874, "timezone": "UTC" },
    { "city": "Sarajevo", "foundation_year": 1461, "timezone": "UTC+1" },
    { "city": "Stockholm", "foundation_year": 1252, "timezone": "UTC+1" },
    { "city": "Tallinn", "foundation_year": 1219, "timezone": "UTC+2" },
    { "city": "Zagreb", "foundation_year": 1094, "timezone": "UTC+1" }

If you are more of a BI than a console person - popular data exploration tools like Metabase, Redash, and Superset all support SQLite.

Native JSON

There is nothing more convenient than SQLite for analyzing and transforming JSON. You can select data directly from a file as if it were a regular table. Or import data into the table and select from there.

  json_extract(value, '$.iso.code') as code,
  json_extract(value, '$.iso.number') as num,
  json_extract(value, '$.name') as name,
  json_extract(value, '$') as unit
│ code │ num │      name       │   unit   │
│ ARS  │ 032 │ Argentine peso  | peso     │
│ CHF  │ 756 │ Swiss Franc     │ franc    │
│ EUR  │ 978 │ Euro            │ euro     │
│ GBP  │ 826 │ British Pound   │ pound    │
│ INR  │ 356 │ Indian Rupee    │ rupee    │
│ JPY  │ 392 │ Japanese yen    │ yen      │
│ MAD  │ 504 │ Moroccan Dirham │ dirham   │
│ RUR  │ 643 │ Russian Rouble  │ rouble   │
│ SOS  │ 706 │ Somali Shilling │ shilling │
│ USD  │ 840 │ US Dollar       │ dollar   │

Doesn’t matter how deep the JSON is - you can extract any nested object:

  json_extract(value, '$.id') as id,
  json_extract(value, '$.name') as name
  path like '$[%].industries'
│   id   │         name         │
│ 7.538  │ Internet provider    │
│ 7.539  │ IT consulting        │
│ 7.540  │ Software development │
│ 9.399  │ Mobile communication │
│ 9.400  │ Fixed communication  │
│ 9.401  │ Fiber-optics         │
│ 43.641 │ Audit                │
│ 43.646 │ Insurance            │
│ 43.647 │ Bank                 │

CTEs and set operations

Of course, SQLite supports Common Table Expressions (WITH clause) and JOINs, I won’t even give examples here. If the data is hierarchical (the table refers to itself through a column like parent_id) - WITH RECURSIVE will come in handy. Any hierarchy, no matter how deep, can be ‘unrolled’ with a single query.

with recursive tmp(id, name, level) as (
  select id, name, 1 as level
  from area
  where parent_id is null

  union all

  select, || ', ' || as name,
    tmp.level + 1 as level
  from area
    join tmp on area.parent_id =

select * from tmp;
│  id  │           name           │ level │
│ 93   │ US                       │ 1     │
│ 768  │ US, Washington DC        │ 2     │
│ 1833 │ US, Washington           │ 2     │
│ 2987 │ US, Washington, Bellevue │ 3     │
│ 3021 │ US, Washington, Everett  │ 3     │
│ 3039 │ US, Washington, Kent     │ 3     │
│ ...  │ ...                      │ ...   │

Sets? No problem: UNION, INTERSECT, EXCEPT are at your service.

select employer_id
from employer_area
where area_id = 1


select employer_id
from employer_area
where area_id = 2;

Calculate one column based on several others? Enter generated columns:

alter table vacancy
add column salary_net integer as (
  case when salary_gross = true then

Generated columns can be queried in the same way as ‘normal’ ones:

  substr(name, 1, 40) as name,
from vacancy
  salary_currency = 'JPY'
  and salary_net is not null
limit 10;

Math statistics

Descriptive statistics? Easy: mean, median, percentiles, standard deviation, you name it. You’ll have to load an extension, but it’s also a single command (and a single file).

.load sqlite3-stats

  count(*) as book_count,
  cast(avg(num_pages) as integer) as mean,
  cast(median(num_pages) as integer) as median,
  mode(num_pages) as mode,
  percentile_90(num_pages) as p90,
  percentile_95(num_pages) as p95,
  percentile_99(num_pages) as p99
from books;
│ book_count │ mean │ median │ mode │ p90 │ p95 │ p99  │
│ 1483       │ 349  │ 295    │ 256  │ 640 │ 817 │ 1199 │

Note on extensions. SQLite is missing a lot of functions compared to other DBMSs like PostgreSQL. But they are easy to add, which is what people do - so it turns out quite a mess.

Therefore, I decided to make a consistent set of extensions, divided by domain area and compiled for major operating systems. There are few of them there yet, but more are on their way:

sqlean @ GitHub

More fun with statistics. You can plot the data distribution right in the console. Look how cute it is:

with slots as (
    num_pages/100 as slot,
    count(*) as book_count
  from books
  group by slot
max as (
  select max(book_count) as value
  from slots
  printf('%.' || (book_count * 30 / max.value) || 'c', '*') as bar
from slots, max
order by slot;
│ slot │ book_count │              bar               │
│ 0    │ 116        │ *********                      │
│ 1    │ 254        │ ********************           │
│ 2    │ 376        │ ****************************** │
│ 3    │ 285        │ **********************         │
│ 4    │ 184        │ **************                 │
│ 5    │ 90         │ *******                        │
│ 6    │ 54         │ ****                           │
│ 7    │ 41         │ ***                            │
│ 8    │ 31         │ **                             │
│ 9    │ 15         │ *                              │
│ 10   │ 11         │ *                              │
│ 11   │ 12         │ *                              │
│ 12   │ 2          │ *                              │


SQLite works with hundreds of millions of records just fine. Regular INSERTs show about 240K records per second on my laptop. And if you connect the CSV file as a virtual table (there is an extension for that) - inserts become 2 times faster.

.load sqlite3-vsv

create virtual table temp.blocks_csv using vsv(
    schema="create table x(network text, geoname_id integer, registered_country_geoname_id integer, represented_country_geoname_id integer, is_anonymous_proxy integer, is_satellite_provider integer, postal_code text, latitude real, longitude real, accuracy_radius integer)",
.timer on
insert into blocks
select * from blocks_csv;

Run Time: real 5.176 user 4.716420 sys 0.403866
select count(*) from blocks;

Run Time: real 0.095 user 0.021972 sys 0.063716

There is a popular opinion among developers that SQLite is not suitable for the web, because it doesn’t support concurrent access. This is a myth. In the write-ahead log mode (available since long ago), there can be as many concurrent readers as you want. There can be only one concurrent writer, but often one is enough.

SQLite is a perfect fit for small websites and applications. uses SQLite as a database, not bothering with optimization (≈200 requests per page). It handles 700K visits per month and serves pages faster than 95% of websites I’ve seen.

SQLite supports partial indexes and indexes on expressions, as ‘big’ DBMSs do. You can build indexes on generated columns and even turn SQLite into a document database. Just store raw JSON and build indexes on json_extract()-ed columns:

create table currency(
  body text,
  code text as (json_extract(body, '$.code')),
  name text as (json_extract(body, '$.name'))

create index currency_code_idx on currency(code);

insert into currency
select value
from json_each(readfile('currency.sample.json'));
explain query plan
select name from currency where code = 'EUR';

`--SEARCH TABLE currency USING INDEX currency_code_idx (code=?)

You can also use SQLite as a graph database. A bunch of complex WITH RECURSIVE will do the trick, or maybe you’ll prefer to add a bit of Python:

simple-graph @ GitHub

Full-text search works out of the box:

create virtual table books_fts
using fts5(title, author, publisher);

insert into books_fts
select title, author, publisher from books;

  substr(title, 1, 30) as title,
  substr(publisher, 1, 10) as publisher
from books_fts
where books_fts match 'ann'
limit 5;
│       author        │             title              │ publisher  │
│ Ruby Ann Boxcar     │ Ruby Ann's Down Home Trailer P │ Citadel    │
│ Ruby Ann Boxcar     │ Ruby Ann's Down Home Trailer P │ Citadel    │
│ Lynne Ann DeSpelder │ The Last Dance: Encountering D │ McGraw-Hil │
│ Daniel Defoe        │ Robinson Crusoe                │ Ann Arbor  │
│ Ann Thwaite         │ Waiting for the Party: The Lif │ David R. G │

Maybe you need an in-memory database for intermediate computations? Single line of python code:

db = sqlite3.connect(":memory:")

You can even access it from multiple connections:

db = sqlite3.connect("file::memory:?cache=shared")

And so much more

There are fancy window functions (just like in PostgreSQL). UPSERT, UPDATE FROM, and generate_series(). R-Tree indexes. Regular expressions, fuzzy-search, and geo. In terms of features, SQLite can compete with any ‘big’ DBMS.

There is also great tooling around SQLite. I especially like Datasette - an open-source tool for exploring and publishing SQLite datasets. And DBeaver is an excellent open-source database IDE with the latest SQLite versions support.

I hope this article will inspire you to try SQLite. Thanks for reading!

Follow @ohmypy on Twitter to keep up with new posts 🚀

Comments on Hacker News

How to create a 1M record table with a single query, 21 Mar 2021 21:15:00 +0000 recursive CTE and randomized data.Let’s say you want to check how a query behaves on a large table - but there is no such table at hand. This is not a problem if your DBMS supports SQL recursion: lots of data can be generated with a single query. The WITH RECURSIVE clause comes to the rescue.

I’m going to use SQLite, but the same (or similar) queries will work for PostgreSQL and other DBMSs. Specifically, WITH RECURSIVE is supported in MariaDB 10.2+, MySQL 8.0+, PostgreSQL 8.4+ and SQLite 3.8+. Oracle 11.2+ and SQL Server 2005+ support recursive queries, but without the RECURSIVE keyword.

Random numbers

Let’s create a table with 1 million random numbers:

create table random_data as
with recursive tmp(x) as (
    select random()
    union all
    select random() from tmp
    limit 1000000
select * from tmp;

Or, if your database supports generate_series() (and does not support limit in recursive queries, like PostgreSQL):

create table random_data as
select random() as x
from generate_series(1, 1000000);


sqlite> select count(*) from random_data;

sqlite> select avg(x) from random_data;

Numeric sequence

Let’s fill the table with numbers from one to a million instead of random numbers:

create table seq_data as
with recursive tmp(x) as (
    select 1
    union all
    select x+1 from tmp
    limit 1000000
select * from tmp;

Or with generate_series():

create table seq_data as
select value as x
from generate_series(1, 1000000);


sqlite> select count(*) from seq_data;

sqlite> select avg(x) from seq_data;

sqlite> select min(x) from seq_data;

sqlite> select max(x) from seq_data;

Randomized data

Numbers are fine, but what if you need a large table filled with customer data? No sweat!

Let’s agree on some rules:

  • customer has an ID, name, and age;
  • ID is filled sequentially from 1 to 1000000;
  • name is randomly selected from a fixed list;
  • age is a random number from 1 to 80.

Let’s create a table of names:

create table names (
    id integer primary key,
    name text

insert into names(id, name)
(1, 'Ann'),
(2, 'Bill'),
(3, 'Cindy'),
(4, 'Diane'),
(5, 'Emma');

And generate some customers:

create table person_data as
with recursive tmp(id, idx, name, age) as (
    select 1, 1, 'Ann', 20
    union all
    select + 1 as id,
        abs(random() % 5) + 1 as idx,
        (select name from names where id = idx) as name,
        abs(random() % 80) + 1 as age
    from tmp
    limit 1000000
select id, name, age from tmp;

Or with generate_series():

create table person_data as
with tmp as (
        value as id,
        abs(random() % 5) + 1 as idx,
        abs(random() % 80) + 1 as age
    from generate_series(1, 1000000)
    (select name from names where id = idx) as name,
from tmp;

Everything is according to the rules here:

  • id is calculated as the previous value + 1;
  • idx field contains a random number from 1 to 5;
  • name is selected from the names table according to idx value;
  • age is calculated as a random number from 1 to 80.

Check it out:

sqlite> select count(*) from person_data;

sqlite> select * from person_data limit 10;
│ id │ name  │ age │
│ 1  │ Ann   │ 20  │
│ 2  │ Ann   │ 33  │
│ 3  │ Ann   │ 26  │
│ 4  │ Ann   │ 4   │
│ 5  │ Diane │ 20  │
│ 6  │ Diane │ 76  │
│ 7  │ Bill  │ 42  │
│ 8  │ Cindy │ 35  │
│ 9  │ Diane │ 6   │
│ 10 │ Ann   │ 29  │

A single query has brought us a million customers. Not bad! It would be great to achieve such results in sales, wouldn’t it? ツ

Follow @ohmypy on Twitter to keep up with new posts!

Comments on Hacker News

Automate your Python project with Makefile, 16 Mar 2021 17:15:00 +0000 is not just some relict from the 70s.When working on a library or application, certain tasks tend to show up over and over again:

  • checking the code with linters,
  • running tests with coverage,
  • deploying with Docker,

JS developers are lucky (ha!): their package.json has a special scripts section for this stuff:

    "scripts": {
        "format": "prettier --write \"src/**/*.ts\"",
        "lint": "tslint -p tsconfig.json",
        "test": "jest --coverage --config jestconfig.json",

Nothing like this is provided with Python. You can, of course, make a .sh script for each task. But it litters the project directory, and it’s better to keep all such tasks together. Installing a separate task runner or using the one built into IDE also seems weird.

Good news: Linux and macOS already have a great task automation tool for any project - Makefile.

Makefile for task automation

Perhaps you, like me, thought that Makefile is a relict from the 70s, useful for compiling C programs. True. But it is also perfectly suitable for automating any tasks in general.

Here’s what it might look like in a python project. Create a file named Makefile:

coverage:  ## Run tests with coverage
	coverage erase
	coverage run --include=podsearch/* -m pytest -ra
	coverage report -m

deps:  ## Install dependencies
	pip install black coverage flake8 mypy pylint pytest tox

lint:  ## Lint and static-check
	flake8 podsearch
	pylint podsearch
	mypy podsearch

push:  ## Push code with tags
	git push && git push --tags

test:  ## Run tests
	pytest -ra

And run linter with tests, for example:

$ make lint coverage

flake8 podsearch
pylint podsearch
mypy podsearch
coverage erase
coverage run —include=podsearch/* -m pytest -ra
coverage report -m
Name                    Stmts   Miss  Cover   Missing
podsearch/       2      0   100%
podsearch/          17      0   100%
podsearch/      51      0   100%
TOTAL                      70      0   100%


Task steps

A task can include multiple steps, like lint in the example above:

	flake8 podsearch
	pylint podsearch
	mypy podsearch

Each step is executed in a separate subprocess. To run a chain of actions (for example, cd and git pull) combine them through &&:

	git push && git push --tags

Task dependencies

Consider the test task, which should first perform linting, and then run the tests. Specify lint as a dependency for test, and you’re done:

test: lint
	pytest -ra

You can specify multiple space-separated dependencies. Or tasks can explicitly call each other:

	flake8 podsearch
	pylint podsearch
	mypy podsearch

	pytest -ra

	make lint
	make test

Task parameters

Consider the serve task which serves a static site, with IP and port specified as parameters. No problem:

	python -m http.server dist --bind $(bind) $(port)

Run task with parameters:

$ make serve bind=localhost port=3000

You can specify default parameter values:

bind ?= localhost
port ?= 3000
	python -m http.server dist --bind $(bind) $(port)

Now they are optional when running make:

$ make serve bind=
$ make serve port=8000
$ make serve

And so much more

If basic features are not enough, there are some great in-depth guides:

In the wild

Here is a Makefile from one of my projects (podcast search tool):


Makefiles are great for automating routine tasks, regardless of the language you prefer. Use them!

And follow @ohmypy on Twitter to keep up with new posts