The legend of "x86 CPUs decode instructions into RISC form internally"

Published on the 30th of June 2020

Topics: microarchitectural-musings, mythbusting

There is a widespread idea that modern high-performance x86 processors work by decoding the "complex" x86 instructions into "simple" RISC-like instructions that the rest of the pipeline then operates on. But how close is this idea to how the processors actually work internally?

To answer this question, let's analyze how different x86 processors, ranging from the first "modern" Intel microarchitecture, P6, to their current designs, handle the following simple loop (the code is 32-bit just to allow us to discuss very old x86 processors):

x86 assembly
.loop:
    add [edx], eax
    add edx, 4
    sub eax, 1
    jnz .loop

RISC-V: setting the stage

First things first, though: what would that code look like on a RISC architecture? We need this to establish a target to compare to. There are many different RISC architectures in the wild, so let's just arbitrarily choose RISC-V, because it's free and open:

RISC-V assembly
.loop:
    lw a2, 0(a0)
    add a2, a2, a1
    sw a2, 0(a0)
    addi a1, a1, -1
    addi a0, a0, 4
    bnez a1, .loop

This is pretty much what we'd expect on any architecture that closely follows RISC principles: the "complex" operation of adding a register to a memory location is split into three instructions, namely a memory load to a temporary register, then integer addition that operates entirely on registers, and finally a memory store. The other three x86 instructions: addition of a constant to a register, subtraction of a constant, and a conditional branch are already simple enough and are therefore almost identical in RISC-V.

There's a difference in the branch between RISC-V and x86 here: x86 provides conditional branches based on flag bits set by arithmetic operations, like sub in our example, whereas RISC-V's branches work by directly comparing the values of two registers, here a1 and the always-zero x0. This difference doesn't matter much in this case, because we're comparing the result of an arithmetic operation against zero, so the difference comes out to needing to tell RISC-V what register to compare to zero, while on x86 what's tested against zero is implied to be the result of the preceding arithmetic operation.

P6: long, long ago

Introduced in 1995 with the Pentium Pro, P6 was the first out-of-order x86 microarchitecture. It was later used with minor changes in the Pentium II and III, and is a direct ancestor of Intel's modern high-performance microarchitectures. How does it handle our loop?

The first instruction, add [edx], eax, is decoded into the following four micro-operations:

Load a 32-bit value from the address contained in edx into a temporary unnamed register.
Add the value in eax to the value loaded by the first micro-operation, affecting flags.
Send the result of the addition to the memory store unit. Note that the address of the store is not used here!
Send the address contained in edx to the memory store unit.

The following three instructions are all simple, decoding directly into one micro-operation each.

We end up with seven micro-operations total. That's actually one more than our RISC-V code has instructions, because the memory store is split into two! The reason behind splitting the memory store into micro-operations is a design quirk: each micro-operation in P6 can have up to two inputs. The x86 architecture supports addressing modes of the form register + register * scale + constant, which have two register inputs. The data to store to memory is another input, bringing us to three… which is one more than we can encode. Therefore, stores are split into a "store data" micro-operation with one input, and "store address" with two.

Verdict: the legend is absolutely true for P6, the micro-operations are very RISC-like, arguably more than an actual RISC architecture due to an implementation quirk.

Pentium M: introduction of micro-fusion

The successor of P6 was Pentium M, where the "M" likely stood for "mobile". It was used primarily in laptops, where the contemporary NetBurst wasn't feasible due to its excessive power consumption and thermal requirements. There were a few Pentium M-based server processors, and there were socket adapters that let one use these processors in standard desktop socket 478 motherboards, but weren't very common.

Pentium M introduced micro-operation fusion, or micro-fusion for short, where some pairs of micro-operations decoded from the same instructions could be joined together. These fused pairs were kept together in as much of the pipeline as possible: they were generated as as one by the instruction decoders, they were treated as one micro-operation by the register renaming mechanism, they were using a single entry in the reorder buffer, and they were treated as one in retirement stations. Just about the only place where they weren't treated as one micro-operation was in the execution units themselves, as for example the memory load unit wouldn't know what to do with integer addition, so it never received that portion of the micro-fused pair. Therefore, arguably, micro-fused pairs were single micro-operations for all intents and purposes, splitting was just an implementation detail of the execution units.

The only instruction that was decoded into multiple micro-operations in P6 was the addition to a memory location. Can it benefit from micro-fusion in Pentium M? Indeed, it can, it's now decoded into three micro-operations that pretty much exactly match how our RISC-V code does the same operation:

Load a 32-bit value from the address contained in edx into a temporary unnamed register.
Add the value in eax to the value loaded by the first micro-operation, affecting flags.
Send the result of the addition and the address to the memory store unit.

The addition, the subtraction and the conditional branch all remain single micro-operations in Pentium M.

Verdict: we have a perfect match between our RISC-V code and Pentium M micro-operations in this particular case.

Core: improved micro-fusion

The immensely successful Core architecture was first released in 2006, when Intel finally realized that NetBurst was a dead-end, and took their mobile-focused Pentium M, which was derived from P6, and improved it further, giving us the legendary Core 2 Solo/Duo/Quad processors.

The improvements over Pentium M were many, such as AMD64 support, an additional instruction decoder and proper 128-bit vector execution units, but the one we're interested is improved micro-fusion.

In Pentium M, there were two cases where micro-fusion applied: the two parts of a store could be fused together, and memory loads could be fused together with common arithmetic operations in instructions like add eax, [edx]. Unfortunately, when the memory location was the destination operand, Pentium M could only fuse the two parts of a store. Core, however, lifted this restriction, allowing it do perform both kinds of micro-fusion at once. Thus on Core the first instruction of our loop is decoded into just two micro-operations:

Load a 32-bit value from the address contained in edx into a temporary register, and add the value in eax to it, affecting flags.
Send the result of the addition and the address in edx to the memory store unit.

As before, the remaining three instructions were just decoded into one micro-operation each.

Verdict: things are getting murky here: "add the value from a memory location to a register" is not really a very RISC-like operation.

Sandy Bridge: (improved) branch fusion

2011's Sandy Bridge was the first P6-derived design based on a physical register file, finally solving the problem of permanent register read stalls that plagued P6 and its descendants up to this point. In many other ways, it was a significant yet merely evolutionary improvement over the preceding Core and Nehalem microarchitectures.

What's important to us here though is the improvements to branch fusion that Sandy Bridge introduced over its predecessors.

Branch fusion, often called using the general term macro-fusion — but in all currently existing x86 processors it's restricted to branches, so I'll stick with the more precise term — is the act of fusing together a branch and a preceding arithmetic instruction, typically a comparison. In x86 it was actually first introduced in Core, but it was restricted to fusing cmp and test instructions with the immediately following conditional branch in certain cases. What we have is a regular subtraction, so Core wasn't able to fuse it.

Sandy Bridge, however, recognizes more patterns as eligible for branch fusion. Our pattern, a subtraction of an immediate from a register followed by a jump if the result was not zero, is among these new patterns. On Sandy Bridge and newer processors our entire loop thus decodes to four micro-operations:

Load a 32-bit value from the address contained in edx into a temporary register, and add the value in eax to it, affecting flags.
Send the result of the addition and the address in edx to the memory store unit.
Add 4 to edx, affecting flags.
Subtract 1 from eax, affecting flags, jumping back to the start of the loop if the result is not zero.

In fact, on Sandy Bridge and newer sub with register destination can be fused with a following branch even if the second operand is a memory location, meaning the sequence sub eax, [rdx] followed by jz .foo can be fused into a single micro-operation!

Verdict: on Sandy Bridge, and newer high-performance Intel processors, the correspondence between micro-operations and RISC instructions is completely lost. In fact, we have as many micro-operations as the "complex" x86 instructions we started with.

What about AMD processors and Intel Atoms? Or NetBurst (shudders)?

So far, this article focused entirely on P6-derived microarchitectures, but there are several other out-of-order x86 microarchitecture families enjoying, or having enjoyed, significant use: NetBurst (which was atrocious), the Intel Atom family, AMD K7/K8/K10, AMD Bobcat/Jaguar/Puma, AMD's Bulldozer family (which was as bad as NetBurst, but in different ways), and last but definitely not least AMD Zen (which I consider amazing).

Let's focus on everything on that list but NetBurst first. The reason for this grouping is simple: none of these microarchitectures ever split instructions like add [edx], eax into multiple micro-operations. And none of them are capable of branch-fusing a branch with a preceding subtraction — the Bulldozer and Zen families can perform branch fusion, but only cmp and test instructions are eligible. We feed all of these processor our loop of four instructions, they're going to decode it into four micro-operations, each corresponding directly to the original instruction.

As for NetBurst, it behaved very much like P6 in this regard, with the exception that stores which used addressing modes with only one input register were kept as one micro-operation, only complex addressing modes required splitting into store-data and store-address; our example loop was thus decoded to six micro-operations corresponding directly to the RISC-V code.

Final verdict

There is some truth to the story that x86 processors decode instructions into RISC-like form internally. This was, in fact, pretty much how P6 worked, later improvements however made the correspondence tortuous at best. Some microarchitecture families, on the other hand, never did anything of the sort, meaning it was never anywhere near a true statement for them.