Skip to main content
Fanael's random ruminations

Hardware prefetching in Pentium III

Published on the

Topics: microarchitecture-archeology

Nowadays pretty much every processor featuring cache hierarchy has some form of automatic memory prefetch mechanism. In small low-power processors it may be very simplistic and only prefetch if the addresses are increasing, in high-performance processors it may try to predict all sorts of regular and semi-regular patterns, but the basic mechanism will be present in some form.

We weren't, however, always so lucky. The first mainstream x86 processor with hardware prefetching logic was the Pentium 4 Willamette, first released in November 2000, which used the highly contentious NetBurst microarchitecture. Due to Willamette's mediocrity, Intel didn't kill off their older P6 microarchitecture, and so the hardware prefetching was later supposedly added to the P6-based Pentium III Tualatin (released in June 2001), where its efficacy was disputed due to much lower CPU-chipset bandwidth compared to NetBurst.

The P6 microarchitecture introduced in Pentium Pro, in 1995, and used with minor changes in Pentium II and III, is a direct ancestor of Intel's current high-performance x86 microarchitectures — at the time of writing, Ice Lake for mobile, Skylake for everything else. There were many improvements in the last 24 years, and most major bottlenecks were fixed, but the family resemblance is still there.

P6's direct successor was Pentium M, created for mobile processors due to NetBurst's high power draw rendering it unsuitable for those applications. The Core microarchitecture powering the famous Core 2 series, that let Intel take the performance crown back from AMD after years of humiliation due to their marketing-driven, physics-ignoring bet on NetBurst, was a direct successor of Pentium M.

But how effective was the hardware prefetching in Tualatin, really? And is it really true that Coppermine, the Pentium III revision preceding Tualatin, didn't feature automatic prefetching? Fortunately, by choosing an appropriate test case and using CPU performance counters, it is possible to perform a gray-box test that answers these questions.

The test

The thing I want to test is whether or not Pentium III's prefetcher, if existent and effective, can handle accessing memory linearly with large variable strides. The first algorithm with such an access pattern that came to my mind was Shell sort. When used with the 3-smooth numbers (i.e. numbers of the form 2p3q, with natural p and q) as the gap sequence, its time complexity is O(nlog2n), which is practical even for large values of n.

As a comparison baseline the obvious choice is then heap sort, due to its infamy for being extremely cache-hostile and already being available in the C++ standard library. In fact, if comparisons and swaps are cheap, heap sort's cache hostility can and does make up for its theoretical complexity advantage, making it not much faster than 3-smooth Shell sort. On my Core i5-4590 (Haswell) the difference between the two algorithms is indeed less than a factor of two:

Shell (interactive)
$ g++ -O3 p3-prefetch-shell-sort.cc && time ./a.out # heap sort
./a.out  2,77s user 0,00s system 99% cpu 2,771 total
$ g++ -O3 -DUSE_SHELL_SORT p3-prefetch-shell-sort.cc && time ./a.out # Shell sort
./a.out  4,52s user 0,00s system 99% cpu 4,517 total

The implementation of Shell sort has been lifted from Wikipedia and C++-ified with no other significant changes:

C++
void shell_sort(std::uint32_t* array, std::size_t length)
{
    if(length <= 1) {
        return;
    }

    auto current_gap = std::find_if(std::begin(gaps), std::end(gaps),
        [&](std::size_t gap){return gap >= length;});
    do {
        --current_gap;
        const auto gap = *current_gap;
        for(std::size_t i = gap; i < length; ++i) {
            auto temp = std::move(array[i]);
            std::size_t j = i;
            for(; j >= gap && array[j - gap] > temp; j -= gap) {
                array[j] = std::move(array[j - gap]);
            }
            array[j] = std::move(temp);
        }
    } while(current_gap > std::begin(gaps));
}

The complete source of the benchmark program is available under the CC0 license.

In practice, neither algorithm is a very good choice for general-purpose sorting; instead, use something really cache-friendly and theoretically efficient, like introsort or merge sort, one of these is almost certainly used to implement the sorting routine in the standard library of your language of choice already, so just use that. Shell sort and heap sort were only chosen here for their memory access patterns.

With that code, a Coppermine Pentium III at 1 GHz (thanks to daemon32 for running the test!), a Tualatin Pentium III at 1.13 GHz, a Dothan Pentium M at 1.6 GHz for comparison purposes, GCC 8, Intel's performance event register reference, and the excellent Linux perf tool, let's go measuring.

The results

There is a problem. The Intel performance monitoring event reference doesn't mention anything about prefetching in its P6 chapter! Prefetching events are mentioned in the Pentium M reference, however, and considering P6 family history, it's not unreasonable to believe the same events would be used in Pentium III. These events are, in syntax suitable for perf stat -e, cpu/event=0xf0,name=upward-prefetches/ and cpu/event=0xf8,name=downward-prefetches/. They count the number of prefetch requests, and therefore the number of prefetched cache lines. This will be important later on.

So first things first: these events don't report anything on Coppermine, always showing zeros. This adds credence to the idea that Coppermine and earlier P6 models didn't have hardware prefetching unit. On Tualatin, however, things are more interesting, so without further ado, let's dive into the performance counter stats!

Coppermine

Performance event counts for heap sort on Coppermine
EventValueUnitCommentTime active
Task clock33,276.77msec0.999 CPUs utilized
Context switches2040.006 K/sec
Page faults14,7150.442 K/sec
Cycles32,854,157,3560.987 GHz20.00%
Stalled cycles, frontend25,399,721,49777.31% frontend cycles idle20.01%
Instructions7,109,652,2820.22 instructions per cycle
3.57 stalled cycles per instruction
20.01%
Branches1,188,575,65935.718 M/sec20.01%
Mispredicted branches196,353,27616.52% of all branches20.00%
Elapsed time33.307seconds
Performance event counts for shell sort on Coppermine
EventValueUnitCommentTime active
Task clock133,899.70msec0.999 CPUs utilized
Context switches6470.005 K/sec
Page faults14,7170.110 K/sec
Cycles132,163,280,6710.987 GHz20.00%
Stalled cycles, frontend96,185,812,55172.78% frontend cycles idle20.00%
Instructions53,356,713,2250.40 instructions per cycle
1.80 stalled cycles per instruction
20.00%
Branches8,501,189,30063.489 M/sec20.00%
Mispredicted branches269,043,0773.16% of all branches20.00%
Elapsed time133.983seconds

The first observation: it's slow. But more importantly, it tells us why it is slow: the out-of-order engine is full, but instructions cannot retire because they're still waiting on memory accesses to complete (as there are no other significant sources of stalls in this program), which stalls the front-end too. Heap sort stalls a lot because it's heap sort; Shell sort stalls a lot, albeit a bit less, because with large gap sizes, at least until they get small enough to resemble insertion sort cache-wise, the program performs lots of memory accesses to far away addresses that either haven't had a chance to get into cache yet, or have already been evicted to make room for newer ones, and there's no prefetching to do it ahead-of-time.

Tualatin

Performance event counts for heap sort on Tualatin
EventValueUnitCommentTime active
Task clock31,190.69msec0.993 CPUs utilized
Page faults1,4290.046 K/sec
Cycles33,344,788,9711.069 GHz28.53%
Stalled cycles, frontend25,490,259,99176.44% frontend cycles idle28.60%
Instructions7,084,795,0420.21 instructions per cycle
3.60 stalled cycles per instruction
28.60%
Branches1,221,344,04639.157 M/sec28.59%
Mispredicted branches195,036,34815.97% of all branches28.59%
Upward prefetches71,322,5042.287 M/sec28.57%
Downward prefetches00.000 K/sec28.51%
Elapsed time31.404seconds
Performance event counts for shell sort on Tualatin
EventValueUnitCommentTime active
Task clock72,946.84msec0.991 CPUs utilized
Page faults1,4340.020 K/sec
Cycles78,228,149,1611.072 GHz28.58%
Stalled cycles, frontend43,812,136,16256.01% frontend cycles idle28.58%
Instructions53,121,702,5160.68 instructions per cycle
0.82 stalled cycles per instruction
28.57%
Branches8,465,544,250116.051 M/sec28.58%
Mispredicted branches264,197,0853.12% of all branches28.56%
Upward prefetches704,750,2239.661 M/sec28.55%
Downward prefetches00.000 K/sec28.58%
Elapsed time73.609seconds

Heap sort is still being heap sort, but Shell sort is almost two times as fast, and the only relevant hardware difference was the introduction of hardware prefetching in Tualatin. It's also obvious from these numbers that Tualatin's hardware prefetching only considers ascending address sequences, as the downward prefetch counter is stuck at 0.

The prefetch rate, that is cache line size — the original P6 is a weird member of the P6 family, as its cache line size is 32 bytes, not 64 as in all of its descendants! — times the number of prefetch requests divided by time, is about 295 MB/s. Since the array is 60 million bytes, or slightly over 57 MB, and only gap sizes of 8 and below touch adjacent cache lines, my educated guess is that it's likely the prefetcher does handle variable strides. Sure, it's unlikely to be perfect. Sure, the memory bandwidth is limited, with the 133 MHz front-side bus the theoretical maximum is about a gigabyte per second, less in practice. But the prefetcher is there, handles variable strides, and does it well enough that at least in this case it greatly improves performance.

Dothan

This is not a Pentium III. It's a Pentium M, it's not even the original P6 anymore, it contains many improvements compared to its predecessor, most of which survive to this day in modern P6 descendants. It also has significantly bigger cache and faster processor-chipset, and thus processor-memory, interface. But let's still see how it handles this test, because this is the CPU for which the performance events used are documented, and out of sheer curiosity.

Performance event counts for heap sort on Dothan Pentium M
EventValueUnitCommentTime active
Task clock16,842.52msec0.996 CPUs utilized
Page faults1,4330.085 K/sec
Cycles26,310,833,0131.562 GHz28.57%
Stalled cycles, frontend18,575,088,74070.60% frontend cycles idle28.58%
Instructions7,093,179,5200.27 instructions per cycle
2.62 stalled cycles per instruction
28.56%
Branches1,217,221,90572.271 M/sec28.56%
Mispredicted branches207,078,23917.01% of all branches28.55%
Upward prefetches21,031,3311.249 M/sec28.57%
Downward prefetches92,661,1405.502 M/sec28.60%
Elapsed time16.903seconds
Performance event counts for shell sort on Dothan Pentium M
EventValueUnitCommentTime active
Task clock33,594.70msec0.996 CPUs utilized
Page faults1,4350.043 K/sec
Cycles52,612,387,2701.566 GHz28.59%
Stalled cycles, frontend23,082,925,88543.87% frontend cycles idle28.55%
Instructions53,098,993,0931.01 instructions per cycle
0.43 stalled cycles per instruction
28.55%
Branches8,469,337,848252.103 M/sec28.57%
Mispredicted branches290,400,7683.43% of all branches28.61%
Upward prefetches355,720,35710.589 M/sec28.59%
Downward prefetches12,1260.361 K/sec28.57%
Elapsed time33.716seconds

Both algorithms benefit from the increased cache, but Shell sort more so. This is also the first time we've seen more than 1 instruction per cycle in this post. Not that it's somehow an great result: my Haswell box mentioned earlier executes 2.4 instructions per cycle in that same test, for whoever is wondering, but that's a processor released 10 years later than this Pentium M.

The total number of upward prefetches, when expressed in bytes — Pentium M has now-standard 64-byte cache lines — is almost exactly the same as on Tualatin, confirming that the upward prefetch counter indeed works there.

The conclusion

This test shows that the legends are true. Coppermine doesn't have automatic hardware prefetching, and thus presumably Katmai and older P6 processors don't either. It was introduced in a limited, but functional, form in Tualatin: hardware prefetching only works for ascending addresses, and is limited by slow memory interface, but can handle variable strides, and even the undocumented performance events appear to function correctly.