Python 3.15’s interpreter for Windows x86-64 should hopefully be 15% faster | Ken Jin

By lumpaDecember 25, 2025Hacker News: Front Page

Python 3.15’s interpreter for Windows x86-64 should hopefully be 15% faster 24 December 2025 Some time ago I posted an apology peice for Python’s tail caling results. I apologized for communicating performance results without noticing a compiler bug had occured. I can proudly say today that I am partially retracting that apology, but only for two platforms-MacOs AArch64 (XCode Clang) and Windows x86-64 (MSVC). In our own experiments, the tail calling interpreter for CPython was found to beat the computed goto interpreter by 5% on pyperformance on AArch64 macOS using XCode Clang, and roughly 15% on pyperformance on Windows on an experimental internal version of MSVC. The Windows build is against a switch-case interpreter, but this in theory shouldn’t matter too much, more on that in the next section. This is of course, a hopefully accurate result. I tried to be more diligent here, but I am of course not infallible. However, I have found that sharing early and making a fool of myself often works well, as it has led to people catching bugs in my code, so I shall continue doing so :). Also this assumes the change doesn’t get reverted later in Python 3.15’s development cycle. Brief background on interpreters Just a recap. There are two popular current ways of writing C-based interpreters. Switch-cases: switch (opcode) { case INST_1: ... case INST_2: ... } Where we just switch-case to the correct instruction handler. And the other popular way is a GCC/Clang extension called labels-as-values/computed gotos. goto *dispatch_table[opcode]; INST_1: ... INST_2: ... Which is basically the same idea, but to instead jump to the address of the next label. Traditionally, the key optimization here is that it needs only one jump to go to the next instruction, while in the switch-case interpreter, a naiive compiler would need two jumps. With modern compilers however, the benefits of the computed gotos is a lot less, mainly because modern compilers have gotten better and modern hardware has also gotten better. In Nelson Elhage’s excellent investigation on the next kind of interpreter, the speedup of computed gotos over switch case on modern Clang was only in the low single digits on pyperformance. A 3rd way that was suggested decades ago, but not really entirely feasible is call/tail-call threaded interpreters. In this scheme, each bytecode handler is its own function, and we tail-call from one handler to the next in the instruction stream: return dispatch_table[opcode]; PyObject *INST_1(...) { } PyObject *INST_2(...) { } This wasn’t too feasible in C for one main reason-tail call optimization was merely an optimization . It’s something the C compiler might do, or might not do. This means if you’re unlucky and the C compiler chooses not to perform the tail call, your interpreter might stack overflow! Some time ago, Clang introduced __attribute__((musttail)) , which allowed for mandating that a call must be tail-called. Otherwise, the compilation will fail. To my knowledge, the first time this was popularized for use in a mainstream interpreter was in Josh Haberman’s...

Preview: ~500 words

Continue reading at Hacker News

Read Full Article

Read on Your E-Reader

Python 3.15’s interpreter for Windows x86-64 should hopefully be 15% faster | Ken Jin

More from Hacker News: Front Page