Blog archives for topic mythbusting - Fanael's random ruminations

Dependency-breaking zeroing XOR in P6

Published on the 8th of April 2021

Topics: microarchitecture-archeology, mythbusting

In x86 assembly language, a common idiom for setting the value of a register to 0 is to use the exclusive-or instruction with both operands being the same register, such as xor eax, eax. It was originally intended as a size optimization: the obvious mov eax, 0 is encoded as five bytes, of which four are used to store the constant 0, while the exclusive-or solution needs merely two, and is equally as fast, so it quickly became widespread.

By the time the P6 microarchitecture was being designed, the xor zeroing idiom was already nigh-universal in compiler output and hand-written assembly alike, so it was specifically recognized as a zeroing idiom for the purpose of avoiding partial register stalls in code such as this:

x86 assembly
    xor eax, eax
    mov al, [ecx]
    ; use eax

In code tuned for the original Pentium or earlier processors, this was the usual way of zero-extending an 8-bit (or 16-bit with ax instead of al) value into the full 32-bit register, as the movzx instruction was slower. P6, starting from the very first Pentium Pro, recognized that after a xor of a register with itself, the register held 0, which avoided the partial register stall that would otherwise occur when modifying a low part of a register followed by operations on the full 32 bits.

Unfortunately, the Pentium Pro as originally designed was too ambitious to be realized using then-available lithography technology without making the chip too big — and thus too prone to manufacturing defects — so some features had to go. Segment register renaming and beefier secondary decoders were some of the notable victims of that process.

I assume that the ability to recognize that the exclusive-or zeroing idiom doesn't really depend on the previous value of a register, so that it can be dispatched immediately without waiting for the old value — thus breaking the dependency chain — met the same fate; the Pentium Pro shipped without it.

Some of the cut features were introduced in later models: segment register renaming, for example, was added back in the Pentium II. Maybe dependency-breaking zeroing XOR was added in later P6 models too? After all, it seems such a simple yet important thing, and indeed, I remember seeing people claim that's the case in some old forum posts and mailing list messages. On the other hand, some sources, such as Agner Fog's optimization manuals say that not only it was never present in any of the P6 processors, it was also missing in Pentium M.

Whatever the case may be, there's only one way to make sure: test it!

Read the full article…

The legend of "x86 CPUs decode instructions into RISC form internally"

Published on the 30th of June 2020

Topics: microarchitectural-musings, mythbusting

There is a widespread idea that modern high-performance x86 processors work by decoding the "complex" x86 instructions into "simple" RISC-like instructions that the rest of the pipeline then operates on. But how close is this idea to how the processors actually work internally?

To answer this question, let's analyze how different x86 processors, ranging from the first "modern" Intel microarchitecture, P6, to their current designs, handle the following simple loop (the code is 32-bit just to allow us to discuss very old x86 processors):

x86 assembly
.loop:
    add [edx], eax
    add edx, 4
    sub eax, 1
    jnz .loop

Read the full article…