Hardware prefetching in Pentium III
Published on the
Topics: microarchitecture-archeology
Nowadays pretty much every processor featuring cache hierarchy has some form of automatic memory prefetch mechanism. In small low-power processors it may be very simplistic and only prefetch if the addresses are increasing, in high-performance processors it may try to predict all sorts of regular and semi-regular patterns, but the basic mechanism will be present in some form.
We weren't, however, always so lucky. The first mainstream x86 processor with hardware prefetching logic was the Pentium 4 Willamette, first released in November 2000, which used the highly contentious NetBurst microarchitecture. Due to Willamette's mediocrity, Intel didn't kill off their older P6 microarchitecture, and so the hardware prefetching was later supposedly added to the P6-based Pentium III Tualatin (released in June 2001), where its efficacy was disputed due to much lower CPU-chipset bandwidth compared to NetBurst.
The P6 microarchitecture introduced in Pentium Pro, in 1995, and used with minor changes in Pentium II and III, is a direct ancestor of Intel's current high-performance x86 microarchitectures — at the time of writing, Ice Lake for mobile, Skylake for everything else. There were many improvements in the last 24 years, and most major bottlenecks were fixed, but the family resemblance is still there.
P6's direct successor was Pentium M, created for mobile processors due to NetBurst's high power draw rendering it unsuitable for those applications. The Core microarchitecture powering the famous Core 2 series, that let Intel take the performance crown back from AMD after years of humiliation due to their marketing-driven, physics-ignoring bet on NetBurst, was a direct successor of Pentium M.
But how effective was the hardware prefetching in Tualatin, really? And is it really true that Coppermine, the Pentium III revision preceding Tualatin, didn't feature automatic prefetching? Fortunately, by choosing an appropriate test case and using CPU performance counters, it is possible to perform a gray-box test that answers these questions.
The test
The thing I want to test is whether or not Pentium III's prefetcher, if existent and effective, can handle accessing memory linearly with large variable strides. The first algorithm with such an access pattern that came to my mind was Shell sort. When used with the 3-smooth numbers (i.e. numbers of the form 2p3q, with natural p and q) as the gap sequence, its time complexity is O(nlog2n), which is practical even for large values of n.
As a comparison baseline the obvious choice is then heap sort, due to its infamy for being extremely cache-hostile and already being available in the C++ standard library. In fact, if comparisons and swaps are cheap, heap sort's cache hostility can and does make up for its theoretical complexity advantage, making it not much faster than 3-smooth Shell sort. On my Core i5-4590 (Haswell) the difference between the two algorithms is indeed less than a factor of two:
Shell (interactive) $ g++ -O3 p3-prefetch-shell-sort.cc && time ./a.out # heap sort ./a.out 2,77s user 0,00s system 99% cpu 2,771 total $ g++ -O3 -DUSE_SHELL_SORT p3-prefetch-shell-sort.cc && time ./a.out # Shell sort ./a.out 4,52s user 0,00s system 99% cpu 4,517 total
The implementation of Shell sort has been lifted from Wikipedia and C++-ified with no other significant changes:
C++
void shell_sort(std::uint32_t* array, std::size_t length)
{
if(length <= 1) {
return;
}
auto current_gap = std::find_if(std::begin(gaps), std::end(gaps),
[&](std::size_t gap){return gap >= length;});
do {
--current_gap;
const auto gap = *current_gap;
for(std::size_t i = gap; i < length; ++i) {
auto temp = std::move(array[i]);
std::size_t j = i;
for(; j >= gap && array[j - gap] > temp; j -= gap) {
array[j] = std::move(array[j - gap]);
}
array[j] = std::move(temp);
}
} while(current_gap > std::begin(gaps));
}
The complete source of the benchmark program is available under the CC0 license.
In practice, neither algorithm is a very good choice for general-purpose sorting; instead, use something really cache-friendly and theoretically efficient, like introsort or merge sort, one of these is almost certainly used to implement the sorting routine in the standard library of your language of choice already, so just use that. Shell sort and heap sort were only chosen here for their memory access patterns.
With that code, a Coppermine Pentium III at 1 GHz (thanks to daemon32 for running the test!), a Tualatin Pentium III at 1.13 GHz, a Dothan Pentium M at 1.6 GHz for comparison purposes, GCC 8, Intel's performance event register reference, and the excellent Linux perf tool, let's go measuring.
The results
There is a problem. The Intel performance monitoring event reference doesn't mention anything about prefetching in its P6 chapter! Prefetching events are mentioned in the Pentium M reference, however, and considering P6 family history, it's not unreasonable to believe the same events would be used in Pentium III. These events are, in syntax suitable for perf stat -e
, cpu/event=0xf0,name=upward-prefetches/ and cpu/event=0xf8,name=downward-prefetches/. They count the number of prefetch requests, and therefore the number of prefetched cache lines. This will be important later on.
So first things first: these events don't report anything on Coppermine, always showing zeros. This adds credence to the idea that Coppermine and earlier P6 models didn't have hardware prefetching unit. On Tualatin, however, things are more interesting, so without further ado, let's dive into the performance counter stats!
Coppermine
Event | Value | Unit | Comment | Time active |
---|---|---|---|---|
Task clock | 33,276.77 | msec | 0.999 CPUs utilized | |
Context switches | 204 | 0.006 K/sec | ||
Page faults | 14,715 | 0.442 K/sec | ||
Cycles | 32,854,157,356 | 0.987 GHz | 20.00% | |
Stalled cycles, frontend | 25,399,721,497 | 77.31% frontend cycles idle | 20.01% | |
Instructions | 7,109,652,282 | 0.22 instructions per cycle 3.57 stalled cycles per instruction | 20.01% | |
Branches | 1,188,575,659 | 35.718 M/sec | 20.01% | |
Mispredicted branches | 196,353,276 | 16.52% of all branches | 20.00% | |
Elapsed time | 33.307 | seconds |
Event | Value | Unit | Comment | Time active |
---|---|---|---|---|
Task clock | 133,899.70 | msec | 0.999 CPUs utilized | |
Context switches | 647 | 0.005 K/sec | ||
Page faults | 14,717 | 0.110 K/sec | ||
Cycles | 132,163,280,671 | 0.987 GHz | 20.00% | |
Stalled cycles, frontend | 96,185,812,551 | 72.78% frontend cycles idle | 20.00% | |
Instructions | 53,356,713,225 | 0.40 instructions per cycle 1.80 stalled cycles per instruction | 20.00% | |
Branches | 8,501,189,300 | 63.489 M/sec | 20.00% | |
Mispredicted branches | 269,043,077 | 3.16% of all branches | 20.00% | |
Elapsed time | 133.983 | seconds |
The first observation: it's slow. But more importantly, it tells us why it is slow: the out-of-order engine is full, but instructions cannot retire because they're still waiting on memory accesses to complete (as there are no other significant sources of stalls in this program), which stalls the front-end too. Heap sort stalls a lot because it's heap sort; Shell sort stalls a lot, albeit a bit less, because with large gap sizes, at least until they get small enough to resemble insertion sort cache-wise, the program performs lots of memory accesses to far away addresses that either haven't had a chance to get into cache yet, or have already been evicted to make room for newer ones, and there's no prefetching to do it ahead-of-time.
Tualatin
Event | Value | Unit | Comment | Time active |
---|---|---|---|---|
Task clock | 31,190.69 | msec | 0.993 CPUs utilized | |
Page faults | 1,429 | 0.046 K/sec | ||
Cycles | 33,344,788,971 | 1.069 GHz | 28.53% | |
Stalled cycles, frontend | 25,490,259,991 | 76.44% frontend cycles idle | 28.60% | |
Instructions | 7,084,795,042 | 0.21 instructions per cycle 3.60 stalled cycles per instruction | 28.60% | |
Branches | 1,221,344,046 | 39.157 M/sec | 28.59% | |
Mispredicted branches | 195,036,348 | 15.97% of all branches | 28.59% | |
Upward prefetches | 71,322,504 | 2.287 M/sec | 28.57% | |
Downward prefetches | 0 | 0.000 K/sec | 28.51% | |
Elapsed time | 31.404 | seconds |
Event | Value | Unit | Comment | Time active |
---|---|---|---|---|
Task clock | 72,946.84 | msec | 0.991 CPUs utilized | |
Page faults | 1,434 | 0.020 K/sec | ||
Cycles | 78,228,149,161 | 1.072 GHz | 28.58% | |
Stalled cycles, frontend | 43,812,136,162 | 56.01% frontend cycles idle | 28.58% | |
Instructions | 53,121,702,516 | 0.68 instructions per cycle 0.82 stalled cycles per instruction | 28.57% | |
Branches | 8,465,544,250 | 116.051 M/sec | 28.58% | |
Mispredicted branches | 264,197,085 | 3.12% of all branches | 28.56% | |
Upward prefetches | 704,750,223 | 9.661 M/sec | 28.55% | |
Downward prefetches | 0 | 0.000 K/sec | 28.58% | |
Elapsed time | 73.609 | seconds |
Heap sort is still being heap sort, but Shell sort is almost two times as fast, and the only relevant hardware difference was the introduction of hardware prefetching in Tualatin. It's also obvious from these numbers that Tualatin's hardware prefetching only considers ascending address sequences, as the downward prefetch counter is stuck at 0.
The prefetch rate, that is cache line size — the original P6 is a weird member of the P6 family, as its cache line size is 32 bytes, not 64 as in all of its descendants! — times the number of prefetch requests divided by time, is about 295 MB/s. Since the array is 60 million bytes, or slightly over 57 MB, and only gap sizes of 8 and below touch adjacent cache lines, my educated guess is that it's likely the prefetcher does handle variable strides. Sure, it's unlikely to be perfect. Sure, the memory bandwidth is limited, with the 133 MHz front-side bus the theoretical maximum is about a gigabyte per second, less in practice. But the prefetcher is there, handles variable strides, and does it well enough that at least in this case it greatly improves performance.
Dothan
This is not a Pentium III. It's a Pentium M, it's not even the original P6 anymore, it contains many improvements compared to its predecessor, most of which survive to this day in modern P6 descendants. It also has significantly bigger cache and faster processor-chipset, and thus processor-memory, interface. But let's still see how it handles this test, because this is the CPU for which the performance events used are documented, and out of sheer curiosity.
Event | Value | Unit | Comment | Time active |
---|---|---|---|---|
Task clock | 16,842.52 | msec | 0.996 CPUs utilized | |
Page faults | 1,433 | 0.085 K/sec | ||
Cycles | 26,310,833,013 | 1.562 GHz | 28.57% | |
Stalled cycles, frontend | 18,575,088,740 | 70.60% frontend cycles idle | 28.58% | |
Instructions | 7,093,179,520 | 0.27 instructions per cycle 2.62 stalled cycles per instruction | 28.56% | |
Branches | 1,217,221,905 | 72.271 M/sec | 28.56% | |
Mispredicted branches | 207,078,239 | 17.01% of all branches | 28.55% | |
Upward prefetches | 21,031,331 | 1.249 M/sec | 28.57% | |
Downward prefetches | 92,661,140 | 5.502 M/sec | 28.60% | |
Elapsed time | 16.903 | seconds |
Event | Value | Unit | Comment | Time active |
---|---|---|---|---|
Task clock | 33,594.70 | msec | 0.996 CPUs utilized | |
Page faults | 1,435 | 0.043 K/sec | ||
Cycles | 52,612,387,270 | 1.566 GHz | 28.59% | |
Stalled cycles, frontend | 23,082,925,885 | 43.87% frontend cycles idle | 28.55% | |
Instructions | 53,098,993,093 | 1.01 instructions per cycle 0.43 stalled cycles per instruction | 28.55% | |
Branches | 8,469,337,848 | 252.103 M/sec | 28.57% | |
Mispredicted branches | 290,400,768 | 3.43% of all branches | 28.61% | |
Upward prefetches | 355,720,357 | 10.589 M/sec | 28.59% | |
Downward prefetches | 12,126 | 0.361 K/sec | 28.57% | |
Elapsed time | 33.716 | seconds |
Both algorithms benefit from the increased cache, but Shell sort more so. This is also the first time we've seen more than 1 instruction per cycle in this post. Not that it's somehow an great result: my Haswell box mentioned earlier executes 2.4 instructions per cycle in that same test, for whoever is wondering, but that's a processor released 10 years later than this Pentium M.
The total number of upward prefetches, when expressed in bytes — Pentium M has now-standard 64-byte cache lines — is almost exactly the same as on Tualatin, confirming that the upward prefetch counter indeed works there.
The conclusion
This test shows that the legends are true. Coppermine doesn't have automatic hardware prefetching, and thus presumably Katmai and older P6 processors don't either. It was introduced in a limited, but functional, form in Tualatin: hardware prefetching only works for ascending addresses, and is limited by slow memory interface, but can handle variable strides, and even the undocumented performance events appear to function correctly.