Porting Go's strings package to C

Creating a subset of Go that translates to C was never my end goal. I liked writing C code with Go, but without the standard library it felt pretty limited. So, the next logical step was to port Go's stdlib to C.

Of course, this isn't something I could do all at once. I started with the io package, which provides core abstractions like Reader and Writer, as well as general-purpose functions like Copy. But io isn't very interesting on its own, since it doesn't include specific reader or writer implementations. So my next choices were naturally bytes and strings — the workhorses of almost every Go program. This post is about how the porting process went.

Bits and UTF-8 • Bytes • Allocators • Buffers and builders • Benchmarks • Optimizing search • Optimizing builder • Wrapping up

Bits and UTF-8

Before I could start porting bytes, I had to deal with its dependencies first:

math/bits implements bit counting and manipulation functions.
unicode/utf8 implements functions for UTF-8 encoded text.

Both of these packages are made up of pure functions, so they were pretty easy to port. The only minor challenge was the difference in operator precedence between Go and C — specifically, bit shifts (<<, >>). In Go, bit shifts have higher precedence than addition and subtraction. In C, they have lower precedence:

// Go: shift has HIGHER precedence than +
var x uint32 = 1<<2 + 3  // (1 << 2) + 3 == 7

// C: shift has LOWER precedence than +
uint32_t x = 1 << 2 + 3; // 1 << (2 + 3) == 32

The simplest solution was to just use parentheses everywhere shifts are involved:

// Go: Mul64 returns the 128-bit product of x and y: (hi, lo) = x * y
func Mul64(x, y uint64) (hi, lo uint64) {
    const mask32 = 1<<32 - 1
    x0 := x & mask32
    x1 := x >> 32
    y0 := y & mask32
    y1 := y >> 32
    w0 := x0 * y0
    t := x1*y0 + w0>>32
    // ...
}

// C: Mul64 returns the 128-bit product of x and y: (hi, lo) = x * y
so_Result bits_Mul64(uint64_t x, uint64_t y) {
    const so_int mask32 = ((so_int)1 << 32) - 1;
    uint64_t x0 = (x & mask32);
    uint64_t x1 = (x >> 32);
    uint64_t y0 = (y & mask32);
    uint64_t y1 = (y >> 32);
    uint64_t w0 = x0 * y0;
    uint64_t t = x1 * y0 + (w0 >> 32);
    // ...
}

With bits and utf8 done, I moved on to bytes.

Bytes

The bytes package provides functions for working with byte slices:

// Count counts the number of non-overlapping instances of sep in s.
func Count(s, sep []byte) int

// Equal reports whether a and b are the
// same length and contain the same bytes.
func Equal(a, b []byte) bool

// Index returns the index of the first instance
// of sep in s, or -1 if sep is not present in s.
func Index(s, sep []byte) int

// Repeat returns a new byte slice consisting of count copies of b.
func Repeat(b []byte, count int) []byte

// and others

Some of them were easy to port, like Equal. Here's how it looks in Go:

// Equal reports whether a and b are the
// same length and contain the same bytes.
func Equal(a, b []byte) bool {
    // Neither cmd/compile nor gccgo allocates for these string conversions.
    return string(a) == string(b)
}

And here's the C version:

// bytes_string reinterprets a byte slice as a string (zero-copy).
#define so_bytes_string(bs) ({                  \
    so_Slice _bs = (bs);                        \
    (so_String){(const char*)_bs.ptr, _bs.len}; \
})

// string_eq returns true if two strings are equal.
static inline bool so_string_eq(so_String s1, so_String s2) {
    return s1.len == s2.len &&
        (s1.len == 0 || memcmp(s1.ptr, s2.ptr, s1.len) == 0);
}

// Equal reports whether a and b are the
// same length and contain the same bytes.
bool bytes_Equal(so_Slice a, so_Slice b) {
    return so_string_eq(so_bytes_string(a), so_bytes_string(b));
}

Just like in Go, the so_bytes_string ([]byte → string) macro doesn't allocate memory; it just reinterprets the byte slice's underlying storage as a string. The so_string_eq function (which works like == in Go) is easy to implement using memcmp from the libc API.

Another example is the IndexByte function, which looks for a specific byte in a slice. Here's the pure-Go implementation:

// IndexByte returns the index of the first instance
// of c in b, or -1 if c is not present in b.
func IndexByte(b []byte, c byte) int {
    for i, x := range b {
        if x == c {
            return i
        }
    }
    return -1
}

And here's the C version:

// IndexByte returns the index of the first instance
// of c in b, or -1 if c is not present in b.
so_int bytes_IndexByte(so_Slice b, so_byte c) {
    for (so_int i = 0; i < so_len(b); i++) {
        so_byte x = so_at(so_byte, b, i);
        if (x == c) {
            return i;
        }
    }
    return -1;
}

I used a regular C for loop to mimic Go's for-range:

Loop over the slice indexes with for (so_len is a macro that returns b.len, similar to Go's len built-in).
Access the i-th byte with so_at (a bounds-checking macro that returns *((so_byte*)b.ptr + i)).

But Equal and IndexByte don't allocate memory. What should I do with Repeat, since it clearly does? I had a decision to make.

Allocators

The Go runtime handles memory allocation and deallocation automatically. In C, I had a few options:

Use a reliable garbage collector like Boehm GC to closely match Go's behavior.
Allocate memory with libc's malloc and have the caller free it later with free.
Introduce allocators.

An allocator is a tool that reserves memory (typically on the heap) so a program can store its data structures there. See Allocators from C to Zig if you want to learn more about them.

For me, the winner was clear. Modern systems programming languages like Zig and Odin clearly showed the value of allocators:

It's obvious whether a function allocates memory or not: if it has an allocator as a parameter, it allocates.
It's easy to use different allocation methods: you can use malloc for one function, an arena for another, and a stack allocator for a third.
It helps with testing and debugging: you can use a tracking allocator to find memory leaks, or a failing allocator to test error handling.

An Allocator is an interface with three methods: Alloc, Realloc, and Free. In C, it translates to a struct with function pointers:

// Allocator defines the interface for memory allocators.
typedef struct {
    void* self;
    so_Result (*Alloc)(void* self, so_int size, so_int align);
    so_Result (*Realloc)(void* self, void* ptr,
        so_int oldSize, so_int newSize, so_int align);
    void (*Free)(void* self, void* ptr, so_int size, so_int align);
} mem_Allocator;

As I mentioned in the post about porting the io package, this interface representation isn't as efficient as using a static method table, but it's simpler. If you're interested in other options, check out the post on interfaces.

By convention, if a function allocates memory, it takes an allocator as its first parameter. So Go's Repeat:

// Repeat returns a new byte slice consisting of count copies of b.
func Repeat(b []byte, count int) []byte

Translates to this C code:

// Repeat returns a new byte slice consisting of count copies of b.
//
// If the allocator is nil, uses the system allocator.
// The returned slice is allocated; the caller owns it.
so_Slice bytes_Repeat(mem_Allocator a, so_Slice b, so_int count)

If the caller doesn't care about using a specific allocator, they can just pass an empty allocator, and the implementation will use the system allocator — calloc, realloc, and free from libc.

Here's a simplified version of the system allocator (I removed safety checks to make it easier to read):

// SystemAllocator uses the system's malloc, realloc, and free functions.
// It zeros out new memory on allocation and reallocation.
typedef struct {} mem_SystemAllocator;

so_Result mem_SystemAllocator_Alloc(void* self, so_int size, so_int align) {
    void* ptr = calloc(1, (size_t)(size));
    if (ptr == NULL) {
        return (so_Result){.val.as_ptr = NULL, .err = mem_ErrOutOfMemory};
    }
    return (so_Result){ .val.as_ptr = ptr, .err = NULL};
}

so_Result mem_SystemAllocator_Realloc(void* self, void* ptr, so_int oldSize,
    so_int newSize, so_int align) {
    void* newPtr = realloc(ptr, (size_t)(newSize));
    if (newPtr == NULL) {
        return (so_Result){.val.as_ptr = NULL, .err = mem_ErrOutOfMemory};
    }
    if (newSize > oldSize) {
        // Zero new memory beyond the old size.
        memset((char*)newPtr + oldSize, 0, (size_t)(newSize - oldSize));
    }
    return (so_Result){.val.as_ptr = newPtr, .err = NULL};
}

void mem_SystemAllocator_Free(void* self, void* ptr, so_int size, so_int align) {
    free(ptr);
}

The system allocator is stateless, so it's safe to have a global instance:

// System is an instance of a memory allocator that uses
// the system's malloc, realloc, and free functions.
mem_Allocator mem_System = {
    .self = &(mem_SystemAllocator){},
    .Alloc = mem_SystemAllocator_Alloc,
    .Free = mem_SystemAllocator_Free,
    .Realloc = mem_SystemAllocator_Realloc};

Here's an example of how to call Repeat with an allocator:

so_Slice src = so_string_bytes(so_str("abc"));
so_Slice got = bytes_Repeat(mem_System, src, 3);
so_String gotStr = so_bytes_string(got);
if (so_string_ne(gotStr, so_str("abcabcabc"))) {
    so_panic("want Repeat(abc) == abcabcabc");
}
mem_FreeSlice(so_byte, mem_System, got);

Way better than hidden allocations!

Buffers and builders

Besides pure functions, bytes and strings also provide types like bytes.Buffer, bytes.Reader, and strings.Builder. I ported them using the same approach as with functions.

For types that allocate memory, like Buffer, the allocator becomes a struct field:

// A Buffer is a variable-sized buffer of bytes
// with Read and Write methods.
typedef struct {
    mem_Allocator a;
    so_Slice buf;
    so_int off;
} bytes_Buffer;

// Usage example.
bytes_Buffer buf = bytes_NewBuffer(mem_System, (so_Slice){0});
bytes_Buffer_WriteString(&buf, so_str("hello"));
bytes_Buffer_WriteString(&buf, so_str(" world"));
so_String str = bytes_Buffer_String(&buf);
if (so_string_ne(str, so_str("hello world"))) {
    so_panic("Buffer.WriteString failed");
}
bytes_Buffer_Free(&buf);

The code is pretty wordy — most C developers would dislike using bytes_Buffer_WriteString instead of something shorter like buf_writestr. My solution to this problem is to automatically translate Go code to C (which is actually what I do when porting Go's stdlib). If you're interested, check out the post about this approach — Solod: Go can be a better C.

Types that don't allocate, like bytes.Reader, need no special treatment — they translate directly to C structs without an allocator field.

The strings package is the twin of bytes, so porting it was uneventful. Here's strings.Builder usage example in Go and C side by side:

// go
var sb strings.Builder
sb.WriteString("Hello")
sb.WriteByte(',')
sb.WriteRune(' ')
sb.WriteString("world")
s := sb.String()
if s != "Hello, world" {
    panic("want sb.String() == 'Hello, world'")
}

// c
strings_Builder sb = {.a = mem_System};
strings_Builder_WriteString(&sb, so_str("Hello"));
strings_Builder_WriteByte(&sb, ',');
strings_Builder_WriteRune(&sb, U' ');
strings_Builder_WriteString(&sb, so_str("world"));
so_String s = strings_Builder_String(&sb);
if (so_string_ne(s, so_str("Hello, world"))) {
    so_panic("want sb.String() == 'Hello, world'");
}
strings_Builder_Free(&sb);

Again, the C code is just a more verbose version of Go's implementation, plus explicit memory allocation.

Benchmarks

What's the point of writing C code if it's slow, right? I decided it was time to benchmark the ported C types and functions against their Go versions.

To do that, I ported the benchmarking part of Go's testing package. Surprisingly, the simplified version was only 300 lines long and included everything I needed:

Figuring out how many iterations to run.
Running the benchmark function in a loop.
Recording metrics (ns/op, MB/s, B/op, allocs/op).
Reporting the results.

Here's a sample benchmark for the strings.Builder type:

static so_String someStr = so_str("some string sdljlk jsklj3lkjlk djlkjw");
static const so_int numWrite = 16;
volatile so_String sink = {0};

void main_WriteString_AutoGrow(testing_B* b) {
    mem_Allocator a = testing_B_Allocator(b);
    for (; testing_B_Loop(b);) {
        strings_Builder sb = strings_NewBuilder(a);
        for (so_int i = 0; i < numWrite; i++) {
            strings_Builder_WriteString(&sb, someStr);
        }
        sink = strings_Builder_String(&sb);
        strings_Builder_Free(&sb);
    }
}

// more benchmarks...

Reads almost like Go's benchmarks.

To monitor memory usage, I created Tracker — a memory allocator that wraps another allocator and keeps track of allocations:

// A Stats records statistics about the memory allocator.
typedef struct {
    uint64_t Alloc;
    uint64_t TotalAlloc;
    uint64_t Mallocs;
    uint64_t Frees;
} mem_Stats;

// A Tracker wraps an Allocator and tracks all
// allocations and deallocations made through it.
typedef struct {
    mem_Allocator Allocator;
    mem_Stats Stats;
} mem_Tracker;

so_Result mem_Tracker_Alloc(void* self, so_int size, so_int align) {
    mem_Tracker* t = self;
    so_Result res = t->Allocator.Alloc(t->Allocator.self, size, align);
    // ...
    t->Stats.Alloc += (uint64_t)(size);
    t->Stats.TotalAlloc += (uint64_t)(size);
    t->Stats.Mallocs++;
    return (so_Result){.val.as_ptr = res.val.as_ptr, .err = NULL};
}

void mem_Tracker_Free(void* self, void* ptr, so_int size, so_int align) {
    mem_Tracker* t = self;
    t->Allocator.Free(t->Allocator.self, ptr, size, align);
    t->Stats.Alloc -= (uint64_t)(size);
    t->Stats.Frees++;
}

The benchmark gets an allocator through the testing_RunBenchmarks function and wraps it in a Tracker to keep track of allocations:

int main(void) {
    so_Slice benchs = {(testing_Benchmark[4]){
        {.Name = so_str("WriteS_AutoGrow"), .F = main_WriteString_AutoGrow},
        {.Name = so_str("WriteS_PreGrow"), .F = main_WriteString_PreGrow},
        {.Name = so_str("WriteB_AutoGrow"), .F = main_Write_AutoGrow},
        {.Name = so_str("WriteB_PreGrow"), .F = main_Write_PreGrow}},
        4, 4};
    testing_RunBenchmarks(mem_System, benchs);
}

There's no auto-discovery, but the manual setup is quite straightforward.

Optimizing search

With the benchmarking setup ready, I ran benchmarks on the strings package. Some functions did well — about 1.5-2x faster than their Go equivalents:

go
Benchmark_Clone-8      12143073      98.50 ns/op    1024 B/op    1 allocs/op
Benchmark_Fields-8       791077    1524 ns/op        288 B/op    1 allocs/op
Benchmark_Repeat-8      9197040     127.3 ns/op     1024 B/op    1 allocs/op

c
Benchmark_Clone        27935466      41.84 ns/op    1024 B/op    1 allocs/op
Benchmark_Fields        1319384     907.7 ns/op      272 B/op    1 allocs/op
Benchmark_Repeat       18445929      64.11 ns/op    1024 B/op    1 allocs/op

But Index (searching for a substring in a string) was a total disaster — it was nearly 20 times slower than in Go:

go
Benchmark_Index-8      47874408      25.14 ns/op       0 B/op    0 allocs/op

c
Benchmark_Index          483787     483.1 ns/op        0 B/op    0 allocs/op

The problem was caused by the IndexByte function we looked at earlier:

// IndexByte returns the index of the first instance
// of c in b, or -1 if c is not present in b.
func IndexByte(b []byte, c byte) int {
    for i, x := range b {
        if x == c {
            return i
        }
    }
    return -1
}

This "pure" Go implementation is just a fallback. On most platforms, Go uses a specialized version of IndexByte written in assembly.

For the C version, the easiest solution was to use memchr, which is also optimized for most platforms:

static inline so_int bytealg_IndexByte(so_Slice b, so_byte c) {
    void* at = memchr(b.ptr, (int)c, b.len);
    if (at == NULL) return -1;
    return (so_int)((char*)at - (char*)b.ptr);
}

With this fix, the benchmark results changed drastically:

go
Benchmark_Index-8        47874408    25.14 ns/op    0 B/op    0 allocs/op
Benchmark_IndexByte-8    54982188    21.98 ns/op    0 B/op    0 allocs/op

c
Benchmark_Index          33552540    35.21 ns/op    0 B/op    0 allocs/op
Benchmark_IndexByte      36868624    32.81 ns/op    0 B/op    0 allocs/op

Still not quite as fast as Go, but it's close. Honestly, I don't know why the memchr-based implementation is still slower than Go's assembly here, but I decided not to pursue it any further.

After running the rest of the strings function benchmarks, the ported versions won all of them except for two:

Benchmark	Go	C (mimalloc)	C (arena)	Winner
Clone	99ns	42ns	34ns	C - 2.4x
Compare	47ns	36ns	36ns	C - 1.3x
Fields	1524ns	908ns	912ns	C - 1.7x
Index	25ns	35ns	34ns	Go - 0.7x
IndexByte	22ns	33ns	33ns	Go - 0.7x
Repeat	127ns	64ns	67ns	C - 1.9x
ReplaceAll	243ns	200ns	203ns	C - 1.2x
Split	1899ns	1399ns	1423ns	C - 1.3x
ToUpper	2066ns	1602ns	1622ns	C - 1.3x
Trim	501ns	373ns	375ns	C - 1.3x

Benchmarking details

Optimizing builder

strings.Builder is a common way to compose strings from parts in Go, so I tested its performance too. The results were worse than I expected:

go
Benchmark_WriteS_AutoGrow-8   5385492   224.0 ns/op   1424 B/op   5 allocs/op
Benchmark_WriteS_PreGrow-8   10692721   112.9 ns/op    640 B/op   1 allocs/op

c
Benchmark_WriteS_AutoGrow     5659255   212.9 ns/op   1147 B/op   5 allocs/op
Benchmark_WriteS_PreGrow      9811054   122.1 ns/op    592 B/op   1 allocs/op

Here, the C version performed about the same as Go, but I expected it to be faster. Unlike Index, Builder is written entirely in Go, so there's no reason the ported version should lose in this benchmark.

The WriteString method looked almost identical in Go and C:

// WriteString appends the contents of s to b's buffer.
// It returns the length of s and a nil error.
func (b *Builder) WriteString(s string) (int, error) {
    b.buf = append(b.buf, s...)
    return len(s), nil
}

static so_Result strings_Builder_WriteString(void* self, so_String s) {
    strings_Builder* b = self;
    strings_Builder_grow(b, so_len(s));
    b->buf = so_extend(so_byte, b->buf, so_string_bytes(s));
    return (so_Result){.val.as_int = so_len(s), .err = NULL};
}

Go's append automatically grows the backing slice, while strings_Builder_grow does it manually (so_extend, on the contrary, doesn't grow the slice — it's merely a memcpy wrapper). So, there shouldn't be any difference. I had to investigate.

Looking at the compiled binary, I noticed a difference in how the functions returned results. Go returns multiple values in separate registers, so (int, error) uses three registers: one for 8-byte int, two for the error interface (implemented as two 8-byte pointers). But in C, so_Result was a single struct made up of two so_Value unions and a so_Error pointer:

typedef union {
    bool as_bool;        // 1 byte
    so_int as_int;       // 8 bytes
    int64_t as_i64;      // 8 bytes
    so_String as_string; // 16 bytes (ptr + len)
    so_Slice as_slice;   // 24 bytes (ptr + len + cap)
    void* as_ptr;        // 8 bytes
    // ... other types
} so_Value;

typedef struct {
    so_Value val;        // 24 bytes
    so_Value val2;       // 24 bytes
    so_Error err;        // 8 bytes
} so_Result;

Of course, this 56-byte monster can't be returned in registers — the C calling convention passes it through memory instead. Since WriteString is on the hot path in the benchmark, I figured this had to be the issue. So I switched from a single monolithic so_Result type to signature-specific types for multi-return pairs:

so_R_bool_err for (bool, error);
so_R_int_err for (so_int, error);
so_R_str_err for (so_String, error);
etc.

Now, the Builder.WriteString implementation in C looked like this:

typedef struct {
    so_int val;
    so_Error err;
} so_R_int_err;

static so_R_int_err strings_Builder_WriteString(void* self, so_String s) {
    // ...
}

so_R_int_err is only 16 bytes — small enough to be returned in two registers. Problem solved! But it wasn't — the benchmark only showed a slight improvement.

After looking into it more, I finally found the real issue: unlike Go, the C compiler wasn't inlining WriteString calls. Adding inline and moving strings_Builder_WriteString to the header file made all the difference:

go
Benchmark_WriteS_AutoGrow-8   5385492   224.0 ns/op   1424 B/op   5 allocs/op
Benchmark_WriteS_PreGrow-8   10692721   112.9 ns/op    640 B/op   1 allocs/op

c
Benchmark_WriteS_AutoGrow    10344024   115.9 ns/op   1147 B/op   5 allocs/op
Benchmark_WriteS_PreGrow     41045286    28.74 ns/op   592 B/op   1 allocs/op

2-4x faster. That's what I was hoping for!

Wrapping up

Porting bytes and strings was a mix of easy parts and interesting challenges. The pure functions were straightforward — just translate the syntax and pay attention to operator precedence. The real design challenge was memory management. Using allocators turned out to be a good solution, making memory allocation clear and explicit without being too difficult to use.

The benchmarks showed that the C versions outperformed Go in most cases, sometimes by 2-4x. The only exceptions were Index and IndexByte, where Go relies on hand-written assembly. The strings.Builder optimization was an interesting challenge: what seemed like a return-type issue was actually an inlining problem, and fixing it gave a nice speed boost.

There's a lot more of Go's stdlib to port. In the next post, we'll cover time — a very unique Go package. In the meantime, if you'd like to write Go that translates to C — with no runtime and manual memory management — I invite you to try Solod. The bytes and strings packages are included, of course.

★ Subscribe to keep up with new posts.