Blog archives for topic microarchitecture-archeology

Dependency-breaking zeroing XOR in P6

Published on the 8th of April 2021

Topics: microarchitecture-archeology, mythbusting

In x86 assembly language, a common idiom for setting the value of a register to 0 is to use the exclusive-or instruction with both operands being the same register, such as xor eax, eax. It was originally intended as a size optimization: the obvious mov eax, 0 is encoded as five bytes, of which four are used to store the constant 0, while the exclusive-or solution needs merely two, and is equally as fast, so it quickly became widespread.

By the time the P6 microarchitecture was being designed, the xor zeroing idiom was already nigh-universal in compiler output and hand-written assembly alike, so it was specifically recognized as a zeroing idiom for the purpose of avoiding partial register stalls in code such as this:

x86 assembly
    xor eax, eax
    mov al, [ecx]
    ; use eax

In code tuned for the original Pentium or earlier processors, this was the usual way of zero-extending an 8-bit (or 16-bit with ax instead of al) value into the full 32-bit register, as the movzx instruction was slower. P6, starting from the very first Pentium Pro, recognized that after a xor of a register with itself, the register held 0, which avoided the partial register stall that would otherwise occur when modifying a low part of a register followed by operations on the full 32 bits.

Unfortunately, the Pentium Pro as originally designed was too ambitious to be realized using then-available lithography technology without making the chip too big — and thus too prone to manufacturing defects — so some features had to go. Segment register renaming and beefier secondary decoders were some of the notable victims of that process.

I assume that the ability to recognize that the exclusive-or zeroing idiom doesn't really depend on the previous value of a register, so that it can be dispatched immediately without waiting for the old value — thus breaking the dependency chain — met the same fate; the Pentium Pro shipped without it.

Some of the cut features were introduced in later models: segment register renaming, for example, was added back in the Pentium II. Maybe dependency-breaking zeroing XOR was added in later P6 models too? After all, it seems such a simple yet important thing, and indeed, I remember seeing people claim that's the case in some old forum posts and mailing list messages. On the other hand, some sources, such as Agner Fog's optimization manuals say that not only it was never present in any of the P6 processors, it was also missing in Pentium M.

Whatever the case may be, there's only one way to make sure: test it!

Read the full article…

Hardware prefetching in Pentium III

Published on the 25th of October 2019

Topics: microarchitecture-archeology

Nowadays pretty much every processor featuring cache hierarchy has some form of automatic memory prefetch mechanism. In small low-power processors it may be very simplistic and only prefetch if the addresses are increasing, in high-performance processors it may try to predict all sorts of regular and semi-regular patterns, but the basic mechanism will be present in some form.

We weren't, however, always so lucky. The first mainstream x86 processor with hardware prefetching logic was the Pentium 4 Willamette, first released in November 2000, which used the highly contentious NetBurst microarchitecture. Due to Willamette's mediocrity, Intel didn't kill off their older P6 microarchitecture, and so the hardware prefetching was later supposedly added to the P6-based Pentium III Tualatin (released in June 2001), where its efficacy was disputed due to much lower CPU-chipset bandwidth compared to NetBurst.

The P6 microarchitecture introduced in Pentium Pro, in 1995, and used with minor changes in Pentium II and III, is a direct ancestor of Intel's current high-performance x86 microarchitectures — at the time of writing, Ice Lake for mobile, Skylake for everything else. There were many improvements in the last 24 years, and most major bottlenecks were fixed, but the family resemblance is still there.

P6's direct successor was Pentium M, created for mobile processors due to NetBurst's high power draw rendering it unsuitable for those applications. The Core microarchitecture powering the famous Core 2 series, that let Intel take the performance crown back from AMD after years of humiliation due to their marketing-driven, physics-ignoring bet on NetBurst, was a direct successor of Pentium M.

But how effective was the hardware prefetching in Tualatin, really? And is it really true that Coppermine, the Pentium III revision preceding Tualatin, didn't feature automatic prefetching? Fortunately, by choosing an appropriate test case and using CPU performance counters, it is possible to perform a gray-box test that answers these questions.

Read the full article…