Dependency-breaking zeroing XOR in P6
Published on the
Topics: microarchitecture-archeology, mythbusting
In x86 assembly language, a common idiom for setting the value of a register to 0 is to use the exclusive-or instruction with both operands being the same register, such as xor eax, eax
. It was originally intended as a size optimization: the obvious mov eax, 0
is encoded as five bytes, of which four are used to store the constant 0, while the exclusive-or solution needs merely two, and is equally as fast, so it quickly became widespread.
By the time the P6 microarchitecture was being designed, the xor
zeroing idiom was already nigh-universal in compiler output and hand-written assembly alike, so it was specifically recognized as a zeroing idiom for the purpose of avoiding partial register stalls in code such as this:
x86 assembly
xor eax, eax
mov al, [ecx]
; use eax
In code tuned for the original Pentium or earlier processors, this was the usual way of zero-extending an 8-bit (or 16-bit with ax
instead of al
) value into the full 32-bit register, as the movzx
instruction was slower. P6, starting from the very first Pentium Pro, recognized that after a xor
of a register with itself, the register held 0, which avoided the partial register stall that would otherwise occur when modifying a low part of a register followed by operations on the full 32 bits.
Unfortunately, the Pentium Pro as originally designed was too ambitious to be realized using then-available lithography technology without making the chip too big — and thus too prone to manufacturing defects — so some features had to go. Segment register renaming and beefier secondary decoders were some of the notable victims of that process.
I assume that the ability to recognize that the exclusive-or zeroing idiom doesn't really depend on the previous value of a register, so that it can be dispatched immediately without waiting for the old value — thus breaking the dependency chain — met the same fate; the Pentium Pro shipped without it.
Some of the cut features were introduced in later models: segment register renaming, for example, was added back in the Pentium II. Maybe dependency-breaking zeroing XOR was added in later P6 models too? After all, it seems such a simple yet important thing, and indeed, I remember seeing people claim that's the case in some old forum posts and mailing list messages. On the other hand, some sources, such as Agner Fog's optimization manuals say that not only it was never present in any of the P6 processors, it was also missing in Pentium M.
Whatever the case may be, there's only one way to make sure: test it!
Read the full article…