Benchmarks
Tulpar’s tagline is “as easy as Python, as fast as C”. On CPU it sits in
Rust/Go territory; on HTTP its multi-core listen_pool server out-throughputs
Go’s net/http and leaves FastAPI far behind — all from a single self-contained
binary with no runtime to ship.
CPU benchmarks
Section titled “CPU benchmarks”benchmarks/fib.tpr uses the typed AOT path (explicit : int return type) for
native LLVM i64 codegen. The recursion depth is read from an env var so LLVM
can’t partially evaluate the result at compile time.
| Environment | C | Tulpar AOT | Ratio |
|---|---|---|---|
Windows 11 / MinGW64, fib(35), best-of-3 | 83 ms | 114 ms | 1.37× C |
Linux WSL2 / gcc 15 -O2, fib(40), best-of-5 | 140 ms | 261 ms | 1.86× C |
The ratio depends on the C compiler, not just on Tulpar. Both targets are
LLVM/AOT-native; the gap is wider against gcc -O2 (a very aggressive C) than
against MinGW’s older GCC. Either way Tulpar lands in the ~1.4–1.9× C band —
the same neighbourhood as Rust and Go, and orders of magnitude ahead of
interpreted Python.
HTTP throughput
Section titled “HTTP throughput”benchmarks/loadtest (a native C load generator) hammers each server with
GET / returning JSON {"hello":"world"} over keep-alive connections,
concurrency swept 1–12 (kept under the box’s core count so the load generator
never starves the server), 4 s per level, best run confirmed by a second pass.
Box: 14-vCPU WSL2. Each runtime in its recommended single-process config.
| Server | req/sec | p50 latency | Configuration |
|---|---|---|---|
Tulpar listen_pool | ~36k | 0.32 ms | all 14 cores, 1 process |
Go net/http | ~30k | 0.38 ms | all cores (default), 1 process |
Node.js http | ~8.7k | 1.06 ms | 1 thread (default) |
| FastAPI (uvicorn) | ~3.5k | 3.31 ms | 1 worker (default) |
Tulpar listen | ~4–4.7k | 0.22 ms | 1 thread, serial accept loop |
Threading models differ and are labelled above: listen_pool and Go’s
net/http use every core out of the box, while Node and a single uvicorn worker
default to one. Tulpar’s single-thread listen() is a serial accept loop —
it has the lowest per-request latency (0.22 ms p50) but serialises keep-alive
connections, so for throughput use listen_pool (or listen_async). Notably,
single-thread listen() trails single-thread Node here; Tulpar’s lead comes
from listen_pool scaling cleanly across cores (p50 stays at 0.32 ms at 36k
req/s, with a clean sub-millisecond tail).
Versus FastAPI specifically, Tulpar also wins decisively on latency (~10× lower p50) and footprint — see the dedicated Wings vs FastAPI writeup (p50 0.31 ms vs 28 ms under load, 6.7 MB vs 54 MB RSS, 2 MB self-contained binary vs Python + ~50 MB of deps).
What got us here
Section titled “What got us here”Hot-path optimisations applied:
call(handler_name)dlsym cache (256-slot FNV-1a hash) — eliminates the symbol-table walk per request.- TCP_NODELAY on accept — removes Nagle’s 40ms batching delay on small JSON responses (+13% req/sec).
- Static thread-local recv buffer — drops a 64 KB malloc/free pair per keep-alive request.
- Per-request arena reset + per-request malloc region — bounded memory on long-running servers without leaking.
- Thread-local scratch buffers in built-ins — non-TLS statics raced under
listen_pool(atoStringbuffer caused ~1.1% spurious 404s until fixed). break/continuereal codegen — was silently no-op’d before, prevents LLVM from emitting suboptimal phi nodes around induction variables.
Trade-offs we explicitly skipped (analysed, low value):
- Object-key inline caching (
req["method"]~0.3 % of HTTP path). - String concat coalescing (
a + b + c~0.1 % of HTTP path).
Methodology
Section titled “Methodology”- CPU:
gcc -O2(Linux gcc 15 / Windows MinGW64). Best-of-N wall-clock. The workload input is opaque to the compiler (read at runtime) so-O2can’t constant-fold it, andNis large enough that process startup is negligible. - HTTP: native
benchmarks/loadtest, single box, keep-alive, concurrency kept ≤ core count so the load generator and server don’t fight for CPU. Toolchains: Go 1.23, Node 22, Python 3.14 + FastAPI/uvicorn. Servers run in their default single-process configuration (core usage labelled per row). - Absolute numbers are box-specific; treat the ratios and latency as the
portable signal. Reproduce CPU via
benchmarks/run_benchmarks.sh; the HTTP servers + driver used here are minimal equivalents returning the same JSON.