Native threading and multiprocessing in Go

As you probably know, the only way to run tasks concurrently in Go is by using goroutines. But what if we bypass the runtime and run tasks directly on OS threads or even processes? I decided to give it a try.

To safely manage threads and processes in Go, I'd normally need to modify Go's internals. But since this is just a research project, I chose to (ab)use cgo and syscalls instead. That's how I created multi — a small package that explores unconventional ways to handle concurrency in Go.

Features • Goroutines • Threads • Processes • Benchmarks • Final thoughts

Features

Multi offers three types of "concurrent groups". Each one has an API similar to sync.WaitGroup, but they work very differently under the hood:

goro.Group runs Go functions in goroutines that are locked to OS threads. Each function executes in its own goroutine. Safe to use in production, although unnecessary, because the regular non-locked goroutines work just fine.
pthread.Group runs Go functions in separate OS threads using POSIX threads. Each function executes in its own thread. This implementation bypasses Go's runtime thread management. Calling Go code from threads not created by the Go runtime can lead to issues with garbage collection, signal handling, and the scheduler. Not meant for production use.
proc.Group runs Go functions in separate OS processes. Each function executes in its own process forked from the main one. This implementation uses process forking, which is not supported by the Go runtime and can cause undefined behavior, especially in programs with multiple goroutines or complex state. Not meant for production use.

All groups offer an API similar to sync.WaitGroup.

goro.Group

Runs Go functions in goroutines that are locked to OS threads.

ch := make(chan int, 2)

g := goro.NewGroup()
g.Go(func() error {
    // do something
    ch <- 42
    return nil
})
g.Go(func() error {
    // do something
    ch <- 42
    return nil
})
g.Wait()

goro.Group starts a regular goroutine for each Go call, and assigns it to its own thread. Here's a simplified implementation:

// Thread represents a goroutine locked to an OS thread.
type Thread struct {
    f       func()
    done    chan struct{}
}

// Start launches the goroutine.
func (t *Thread) Start() {
    go func() {
        runtime.LockOSThread()
        defer runtime.UnlockOSThread()
        t.f()
        close(t.done)
    }()
}

// Wait blocks until the goroutine completes.
func (t *Thread) Wait() {
    <-t.done
}

goro/thread.go

You can use channels and other standard concurrency tools inside the functions managed by the group.

pthread.Group

Runs Go functions in separate OS threads using POSIX threads.

ch := make(chan int, 2)

g := pthread.NewGroup()
g.Go(func() error {
    // do something
    ch <- 42
    return nil
})
g.Go(func() error {
    // do something
    ch <- 42
    return nil
})
g.Wait()

pthread.Group creates a native OS thread for each Go call. It uses cgo to start and join threads. Here is a simplified implementation:

/*
#include <pthread.h>

extern void* threadFunc(void*);
*/
import "C"

// Thread represents a Go function executed in a separate OS thread.
type Thread struct {
    tid     C.pthread_t
    f       func()
}

// Start launches the thread.
func (t *Thread) Start() {
    h := cgo.NewHandle(t) // ensure t is kept alive for the thread lifetime
    C.pthread_create(&t.tid, nil, (*[0]byte)(C.threadFunc), unsafe.Pointer(h))
}

// Wait blocks until the thread completes.
func (t *Thread) Wait() {
    C.pthread_join(t.tid, nil)
}

//export threadFunc
func threadFunc(arg unsafe.Pointer) unsafe.Pointer {
    h := cgo.Handle(arg)
    t := h.Value().(*Thread)
    t.f()
    h.Delete()
    return nil
}

pthread/thread.go

You can use channels and other standard concurrency tools inside the functions managed by the group.

proc.Group

Runs Go functions in separate OS processes forked from the main one.

ch := proc.NewChan[int]()
defer ch.Close()

g := proc.NewGroup()
g.Go(func() error {
    // do something
    ch.Send(42)
    return nil
})
g.Go(func() error {
    // do something
    ch.Send(42)
    return nil
})
g.Wait()

proc.Group forks the main process for each Go call. It uses syscalls to fork processes and wait for them to finish. Here is a simplified implementation:

// Process represents a Go function executed in a forked OS process.
type Process struct {
    f       func()
    pid     int
}

// Start launches the process.
func (p *Process) Start() {
    pid, _, _ := syscall.Syscall(syscall.SYS_FORK, 0, 0, 0)

    if pid != 0 {
        // In parent.
        p.pid = int(pid)
        return
    }

    // Fork succeeded, now in child.
    p.f()
    os.Exit(0)
}

// Wait blocks until the process completes.
func (p *Process) Wait() {
    var ws syscall.WaitStatus
    syscall.Wait4(p.pid, &ws, 0, nil)
}

proc/process.go

You can only use proc.Chan to exchange data between processes, since regular Go channels and other concurrency tools don't work across process boundaries.

Benchmarks

Running some CPU-bound workload (with no allocations or I/O) on Apple M1 gives these results:

goos: darwin
goarch: arm64
ncpu: 8
gomaxprocs: 8
workers: 4
sync.WaitGroup: n=100   t=60511 µs/exec
goro.Group:     n=100   t=60751 µs/exec
pthread.Group:  n=100   t=60791 µs/exec
proc.Group:     n=100   t=61640 µs/exec

And here are the results from GitHub actions:

goos: linux
goarch: amd64
ncpu: 4
gomaxprocs: 4
workers: 4
sync.WaitGroup: n=100   t=145256 µs/exec
goro.Group:     n=100   t=145813 µs/exec
pthread.Group:  n=100   t=148968 µs/exec
proc.Group:     n=100   t=147572 µs/exec

One execution here means a group of 4 workers each doing 10 million iterations of generating random numbers and adding them up. See the benchmark code for details.

As you can see, the default concurrency model (sync.WaitGroup in the results, using standard goroutine scheduling without meddling with threads or processes) works just fine and doesn't add any noticeable overhead. You probably already knew that, but it's always good to double-check, right?

Final thoughts

I don't think anyone will find these concurrent groups useful in real-world situations, but it's still interesting to look at possible (even if flawed) implementations and compare them to Go's default (and only) concurrency model.

Check out the nalgeon/multi repo for the implementation.

──

P.S. Want to learn more about concurrency? Check out my interactive book

★ Subscribe to keep up with new posts.