Hardware prefetching in Pentium III

Published on the 25th of October 2019

Nowadays pretty much every processor featuring cache hierarchy has some form of automatic memory prefetch mechanism. In small low-power processors it may be very simplistic and only prefetch if the addresses are increasing, in high-performance processors it may try to predict all sorts of regular and semi-regular patterns, but the basic mechanism will be present in some form.

We weren't, however, always so lucky. The first mainstream x86 processor with hardware prefetching logic was the Pentium 4 Willamette, first released in November 2000, which used the highly contentious NetBurst microarchitecture. Due to Willamette's mediocrity, Intel didn't kill off their older P6 microarchitecture, and so the hardware prefetching was later supposedly added to the P6-based Pentium III Tualatin (released in June 2001), where its efficacy was disputed due to much lower CPU-chipset bandwidth compared to NetBurst.

The P6 microarchitecture introduced in Pentium Pro, in 1995, and used with minor changes in Pentium II and III, is a direct ancestor of Intel's current high-performance x86 microarchitectures — at the time of writing, Ice Lake for mobile, Skylake for everything else. There were many improvements in the last 24 years, and most major bottlenecks were fixed, but the family resemblance is still there.

P6's direct successor was Pentium M, created for mobile processors due to NetBurst's high power draw rendering it unsuitable for those applications. The Core microarchitecture powering the famous Core 2 series, that let Intel take the performance crown back from AMD after years of humiliation due to their marketing-driven, physics-ignoring bet on NetBurst, was a direct successor of Pentium M.

But how effective was the hardware prefetching in Tualatin, really? And is it really true that Coppermine, the Pentium III revision preceding Tualatin, didn't feature automatic prefetching? Fortunately, by choosing an appropriate test case and using CPU performance counters, it is possible to perform a gray-box test that answers these questions.

The test

The thing I want to test is whether or not Pentium III's prefetcher, if existent and effective, can handle accessing memory linearly with large variable strides. The first algorithm with such an access pattern that came to my mind was Shell sort. When used with the 3-smooth numbers (i.e. numbers of the form 2^p3^q, with natural p and q) as the gap sequence, its time complexity is O(nlog²n), which is practical even for large values of n.

As a comparison baseline the obvious choice is then heap sort, due to its infamy for being extremely cache-hostile and already being available in the C++ standard library. In fact, if comparisons and swaps are cheap, heap sort's cache hostility can and does make up for its theoretical complexity advantage, making it not much faster than 3-smooth Shell sort. On my Core i5-4590 (Haswell) the difference between the two algorithms is indeed less than a factor of two:

Shell (interactive)
$ g++ -O3 p3-prefetch-shell-sort.cc && time ./a.out # heap sort
./a.out  2,77s user 0,00s system 99% cpu 2,771 total
$ g++ -O3 -DUSE_SHELL_SORT p3-prefetch-shell-sort.cc && time ./a.out # Shell sort
./a.out  4,52s user 0,00s system 99% cpu 4,517 total

The implementation of Shell sort has been lifted from Wikipedia and C++-ified with no other significant changes:

C++
void shell_sort(std::uint32_t* array, std::size_t length)
{
    if(length <= 1) {
        return;
    }

    auto current_gap = std::find_if(std::begin(gaps), std::end(gaps),
        [&](std::size_t gap){return gap >= length;});
    do {
        --current_gap;
        const auto gap = *current_gap;
        for(std::size_t i = gap; i < length; ++i) {
            auto temp = std::move(array[i]);
            std::size_t j = i;
            for(; j >= gap && array[j - gap] > temp; j -= gap) {
                array[j] = std::move(array[j - gap]);
            }
            array[j] = std::move(temp);
        }
    } while(current_gap > std::begin(gaps));
}

The complete source of the benchmark program is available under the CC0 license.

In practice, neither algorithm is a very good choice for general-purpose sorting; instead, use something really cache-friendly and theoretically efficient, like introsort or merge sort, one of these is almost certainly used to implement the sorting routine in the standard library of your language of choice already, so just use that. Shell sort and heap sort were only chosen here for their memory access patterns.

With that code, a Coppermine Pentium III at 1 GHz (thanks to daemon32 for running the test!), a Tualatin Pentium III at 1.13 GHz, a Dothan Pentium M at 1.6 GHz for comparison purposes, GCC 8, Intel's performance event register reference, and the excellent Linux perf tool, let's go measuring.

The results

There is a problem. The Intel performance monitoring event reference doesn't mention anything about prefetching in its P6 chapter! Prefetching events are mentioned in the Pentium M reference, however, and considering P6 family history, it's not unreasonable to believe the same events would be used in Pentium III. These events are, in syntax suitable for perf stat -e, cpu/event=0xf0,name=upward-prefetches/ and cpu/event=0xf8,name=downward-prefetches/. They count the number of prefetch requests, and therefore the number of prefetched cache lines. This will be important later on.

So first things first: these events don't report anything on Coppermine, always showing zeros. This adds credence to the idea that Coppermine and earlier P6 models didn't have hardware prefetching unit. On Tualatin, however, things are more interesting, so without further ado, let's dive into the performance counter stats!

Coppermine

Performance event counts for heap sort on Coppermine

Event	Value	Unit	Comment	Time active
Task clock	33,276.77	msec	0.999 CPUs utilized
Context switches	204		0.006 K/sec
Page faults	14,715		0.442 K/sec
Cycles	32,854,157,356		0.987 GHz	20.00%
Stalled cycles, frontend	25,399,721,497		77.31% frontend cycles idle	20.01%
Instructions	7,109,652,282		0.22 instructions per cycle 3.57 stalled cycles per instruction	20.01%
Branches	1,188,575,659		35.718 M/sec	20.01%
Mispredicted branches	196,353,276		16.52% of all branches	20.00%
Elapsed time	33.307	seconds

Performance event counts for shell sort on Coppermine

Event	Value	Unit	Comment	Time active
Task clock	133,899.70	msec	0.999 CPUs utilized
Context switches	647		0.005 K/sec
Page faults	14,717		0.110 K/sec
Cycles	132,163,280,671		0.987 GHz	20.00%
Stalled cycles, frontend	96,185,812,551		72.78% frontend cycles idle	20.00%
Instructions	53,356,713,225		0.40 instructions per cycle 1.80 stalled cycles per instruction	20.00%
Branches	8,501,189,300		63.489 M/sec	20.00%
Mispredicted branches	269,043,077		3.16% of all branches	20.00%
Elapsed time	133.983	seconds

The first observation: it's slow. But more importantly, it tells us why it is slow: the out-of-order engine is full, but instructions cannot retire because they're still waiting on memory accesses to complete (as there are no other significant sources of stalls in this program), which stalls the front-end too. Heap sort stalls a lot because it's heap sort; Shell sort stalls a lot, albeit a bit less, because with large gap sizes, at least until they get small enough to resemble insertion sort cache-wise, the program performs lots of memory accesses to far away addresses that either haven't had a chance to get into cache yet, or have already been evicted to make room for newer ones, and there's no prefetching to do it ahead-of-time.

Tualatin

Performance event counts for heap sort on Tualatin

Event	Value	Unit	Comment	Time active
Task clock	31,190.69	msec	0.993 CPUs utilized
Page faults	1,429		0.046 K/sec
Cycles	33,344,788,971		1.069 GHz	28.53%
Stalled cycles, frontend	25,490,259,991		76.44% frontend cycles idle	28.60%
Instructions	7,084,795,042		0.21 instructions per cycle 3.60 stalled cycles per instruction	28.60%
Branches	1,221,344,046		39.157 M/sec	28.59%
Mispredicted branches	195,036,348		15.97% of all branches	28.59%
Upward prefetches	71,322,504		2.287 M/sec	28.57%
Downward prefetches	0		0.000 K/sec	28.51%
Elapsed time	31.404	seconds

Performance event counts for shell sort on Tualatin

Event	Value	Unit	Comment	Time active
Task clock	72,946.84	msec	0.991 CPUs utilized
Page faults	1,434		0.020 K/sec
Cycles	78,228,149,161		1.072 GHz	28.58%
Stalled cycles, frontend	43,812,136,162		56.01% frontend cycles idle	28.58%
Instructions	53,121,702,516		0.68 instructions per cycle 0.82 stalled cycles per instruction	28.57%
Branches	8,465,544,250		116.051 M/sec	28.58%
Mispredicted branches	264,197,085		3.12% of all branches	28.56%
Upward prefetches	704,750,223		9.661 M/sec	28.55%
Downward prefetches	0		0.000 K/sec	28.58%
Elapsed time	73.609	seconds

Heap sort is still being heap sort, but Shell sort is almost two times as fast, and the only relevant hardware difference was the introduction of hardware prefetching in Tualatin. It's also obvious from these numbers that Tualatin's hardware prefetching only considers ascending address sequences, as the downward prefetch counter is stuck at 0.

The prefetch rate, that is cache line size — the original P6 is a weird member of the P6 family, as its cache line size is 32 bytes, not 64 as in all of its descendants! — times the number of prefetch requests divided by time, is about 295 MB/s. Since the array is 60 million bytes, or slightly over 57 MB, and only gap sizes of 8 and below touch adjacent cache lines, my educated guess is that it's likely the prefetcher does handle variable strides. Sure, it's unlikely to be perfect. Sure, the memory bandwidth is limited, with the 133 MHz front-side bus the theoretical maximum is about a gigabyte per second, less in practice. But the prefetcher is there, handles variable strides, and does it well enough that at least in this case it greatly improves performance.

Dothan

This is not a Pentium III. It's a Pentium M, it's not even the original P6 anymore, it contains many improvements compared to its predecessor, most of which survive to this day in modern P6 descendants. It also has significantly bigger cache and faster processor-chipset, and thus processor-memory, interface. But let's still see how it handles this test, because this is the CPU for which the performance events used are documented, and out of sheer curiosity.

Performance event counts for heap sort on Dothan Pentium M

Event	Value	Unit	Comment	Time active
Task clock	16,842.52	msec	0.996 CPUs utilized
Page faults	1,433		0.085 K/sec
Cycles	26,310,833,013		1.562 GHz	28.57%
Stalled cycles, frontend	18,575,088,740		70.60% frontend cycles idle	28.58%
Instructions	7,093,179,520		0.27 instructions per cycle 2.62 stalled cycles per instruction	28.56%
Branches	1,217,221,905		72.271 M/sec	28.56%
Mispredicted branches	207,078,239		17.01% of all branches	28.55%
Upward prefetches	21,031,331		1.249 M/sec	28.57%
Downward prefetches	92,661,140		5.502 M/sec	28.60%
Elapsed time	16.903	seconds

Performance event counts for shell sort on Dothan Pentium M

Event	Value	Unit	Comment	Time active
Task clock	33,594.70	msec	0.996 CPUs utilized
Page faults	1,435		0.043 K/sec
Cycles	52,612,387,270		1.566 GHz	28.59%
Stalled cycles, frontend	23,082,925,885		43.87% frontend cycles idle	28.55%
Instructions	53,098,993,093		1.01 instructions per cycle 0.43 stalled cycles per instruction	28.55%
Branches	8,469,337,848		252.103 M/sec	28.57%
Mispredicted branches	290,400,768		3.43% of all branches	28.61%
Upward prefetches	355,720,357		10.589 M/sec	28.59%
Downward prefetches	12,126		0.361 K/sec	28.57%
Elapsed time	33.716	seconds

Both algorithms benefit from the increased cache, but Shell sort more so. This is also the first time we've seen more than 1 instruction per cycle in this post. Not that it's somehow an great result: my Haswell box mentioned earlier executes 2.4 instructions per cycle in that same test, for whoever is wondering, but that's a processor released 10 years later than this Pentium M.

The total number of upward prefetches, when expressed in bytes — Pentium M has now-standard 64-byte cache lines — is almost exactly the same as on Tualatin, confirming that the upward prefetch counter indeed works there.

The conclusion

This test shows that the legends are true. Coppermine doesn't have automatic hardware prefetching, and thus presumably Katmai and older P6 processors don't either. It was introduced in a limited, but functional, form in Tualatin: hardware prefetching only works for ascending addresses, and is limited by slow memory interface, but can handle variable strides, and even the undocumented performance events appear to function correctly.

The test §

The results §

Coppermine §

Tualatin §

Dothan §

The conclusion §