Compact objects in Python

Python is an object language. This is all nice and cozy until you run out of memory holding 10 million objects at once. Let's talk about how to reduce its appetite.

Tuples

Imagine you have a simple Pet object with the name (string) and price (integer) attributes. Intuitively, it seems that the most compact representation is a tuple:

("Frank the Pigeon", 50000)

Let's measure how much memory this beauty eats:

import random
import string
from pympler.asizeof import asizeof

def fields():
    name_gen = (random.choice(string.ascii_uppercase) for _ in range(10))
    name = "".join(name_gen)
    price = random.randint(10000, 99999)
    return (name, price)

def measure(name, fn, n=10_000):
    pets = [fn() for _ in range(n)]
    size = round(asizeof(pets) / n)
    print(f"Pet size ({name}) = {size} bytes")
    return size

baseline = measure("tuple", fields)

161 bytes. Let's use that as a baseline for further comparison.

Dataclasses vs named tuples

But who works with tuples these days? You would probably choose a dataclass:

from dataclasses import dataclass

@dataclass
class PetData:
    name: str
    price: int

fn = lambda: PetData(*fields())

base = measure("baseline", fields)
measure("dataclass", fn, baseline=base)

Thing is, it's 1.6 times larger than a tuple.

Let's try a named tuple then:

from typing import NamedTuple

class PetTuple(NamedTuple):
    name: str
    price: int


fn = lambda: PetTuple(*fields())

base = measure("baseline", fields)
measure("named tuple", fn, baseline=base)

Looks like a dataclass, works like a tuple. Perfect. Or not?

Slots

Python 3.10 received dataclasses with slots:

@dataclass(slots=True)
class PetData:
    name: str
    price: int


fn = lambda: PetData(*fields())

base = measure("baseline", fields)
measure("dataclass w/slots", fn, baseline=base)

Wow! Slots magic creates special skinny objects without an underlying dictionary, unlike regular Python objects. Such a dataclass is even lighter than a tuple.

What if 3.10 is out of the question yet? Use NamedTuple. Or add a slots dunder manually:

@dataclass
class PetData:
    __slots__ = ("name", "price")
    name: str
    price: int

Slot objects have their own shortcomings. But they are great for simple cases (without inheritance and other complex stuff).

numpy arrays

The real winner, of course, is the numpy array:

import numpy as np

PetNumpy = np.dtype([("name", "S10"), ("price", "i4")])

n = 10_000
generator = (fields() for _ in range(n))
pets = np.fromiter(generator, dtype=PetNumpy)

size = round(asizeof(pets) / n)
base = measure("baseline", fields)

print(f"Pet size (numpy array) = {size} bytes\nx{size/base:.2f} to baseline")

This is not a flawless victory, though. If names are unicode (U type instead of S), the advantage is not so impressive:

import numpy as np

PetNumpy = np.dtype([("name", "U10"), ("price", "i4")])

n = 10_000
generator = (fields() for _ in range(n))
pets = np.fromiter(generator, dtype=PetNumpy)

size = round(asizeof(pets) / n)
base = measure("baseline", fields)

print(f"Pet size (numpy U10) = {size} bytes\nx{size/base:.2f} to baseline")

If the name length is not strictly 10 characters, but varies, say, up to 50 characters (U50 instead of U10) — the advantage disappears completely:

import random
import numpy as np

def fields_var_name():
    name_len = random.randint(10, 50)
    name_gen = (random.choice(string.ascii_uppercase) for _ in range(name_len))
    name = "".join(name_gen)
    price = random.randint(10000, 99999)
    return (name, price)

PetNumpy = np.dtype([("name", "U50"), ("price", "i4")])

n = 10_000
generator = (fields_var_name() for _ in range(n))
pets = np.fromiter(generator, dtype=PetNumpy)

size = round(asizeof(pets) / n)
base = measure("baseline", fields)

print(f"Pet size (numpy U50) = {size} bytes\nx{size/base:.2f} to baseline")

Others

Let's consider alternatives for completeness.

A regular class is no different than a dataclass:

class PetClass:
    def __init__(self, name: str, price: int):
        self.name = name
        self.price = price

fn = lambda: PetClass(*fields())

base = measure("baseline", fields)
measure("class", fn, baseline=base)

And a frozen (immutable) dataclass too:

@dataclass(frozen=True)
class PetDataFrozen:
    name: str
    price: int

fn = lambda: PetDataFrozen(*fields())

base = measure("baseline", fields)
measure("frozen dataclass", fn, baseline=base)

A dict is even worse:

names = ("name", "price")
fn = lambda: dict(zip(names, fields()))

base = measure("baseline", fields)
measure("dict", fn, baseline=base)

Pydantic model sets an anti-record (no wonder, it uses inheritance):

from pydantic import BaseModel

class PetModel(BaseModel):
    name: str
    price: int

names = ("name", "price")
fn = lambda: PetModel(**dict(zip(names, fields())))

base = measure("baseline", fields)
measure("pydantic", fn, baseline=base);

Summary

Here are some Python object implementations, ranked from more compact to less compact:

  1. numpy (specific use cases only)
  2. Slotted dataclass.
  3. Named tuple / ordinary tuple.
  4. Dataclass / regular class.
  5. Dictionary.
  6. Pydantic model.

──

Interactive examples in this post are powered by codapi — an open source tool I'm building. Use it to embed live code snippets into your product docs, online course or blog.

★ Subscribe to keep up with new posts.