Just a "quick" update on the development status of my low-pause incremental garbage collector (code name Blue Ridge): it's progressing rather nicely, considering how little motivation my clinically depressed self is able to muster up most days.
No points for guessing where the code name comes from.
The garbage collector as currently implemented, in just above 2 thousand lines of code, is already working, that is, it correctly allocates memory and collects garbage, even under address and undefined behavior sanitizers, though it's limited to stop-the-world operation at the moment. This is merely a pragmatic decision on what order to do things in: I'd rather have heap verifiers implemented and tested before diving into incrementality, as catching any potential heap corruptions or invariant violations right away is much preferable to trying to divine the cause of a crash from a core dump.
That being said, most of the hard work for supporting incremental collections has already been done. Most of the collector's code is written in a way that makes incremental work easy enough, it just happens to not be used in that way yet.
Performance appears to be fine so far. Pure allocation rate of object that die quickly is limited by how fast the CPU is able to write object data to memory, like in any good moving GC, as the hot path of allocation consists of just advancing a pointer. Collection time is dominated by marking, as one could expect. Throughput, measured by taking the wall-clock time of a program that allocates a linked list of 10 million conses then throws it away, repeating 500 times, is not bad either, being within the same order of magnitude as HotSpot's Shenandoah collector in its passive (stop-the-world only) mode, and beating handily its Parallel and G1 collectors with all four collectors limited to a 1 GB heap, which again, is not unexpected: this is not a very good workload+heap size combination for generational collectors, as they're forced to prematurely tenure conses, which then results in a significant number of painful major collections, while Blue Ridge and Shenandoah aren't affected due to their non-generational nature.
Blue Ridge in its current state also features policies, switchable at runtime, that are informed whenever interesting events happen, such as a new region being allocated or the current phase being changed, and that in turn decide whether or not start a cycle, and if so what kind — incremental or stop-the-world — and what regions should belong to the collection set. It is through this mechanism that the aforementioned 1 GB heap limit has been implemented, as Blue Ridge's heap is normally discontinuous, dynamically resizable in either direction and theoretically unlimited.
All things considered, the development of my garbage collector is progressing well, if slowly due to circumstances beyond my control. I hope I'll be able to have a functionally complete version soon, because that's where the real fun begins: playing around with all the knobs the get the lowest latency, lowest footprint and highest throughput possible.
Nowadays pretty much every processor featuring cache hierarchy has some form of automatic memory prefetch mechanism. In small low-power processors it may be very simplistic and only prefetch if the addresses are increasing, in high-performance processors it may try to predict all sorts of regular and semi-regular patterns, but the basic mechanism will be present in some form.
We weren't, however, always so lucky. The first mainstream x86 processor with hardware prefetching logic was the Pentium 4 Willamette, first released in November 2000, which used the highly contentious NetBurst microarchitecture. Due to Willamette's mediocrity, Intel didn't kill off their older P6 microarchitecture, and so the hardware prefetching was later supposedly added to the P6-based Pentium III Tualatin (released in June 2001), where its efficacy was disputed due to much lower CPU-chipset bandwidth compared to NetBurst.
The P6 microarchitecture introduced in Pentium Pro, in 1995, and used with minor changes in Pentium II and III, is a direct ancestor of Intel's current high-performance x86 microarchitectures — at the time of writing, Ice Lake for mobile, Skylake for everything else. There were many improvements in the last 24 years, and most major bottlenecks were fixed, but the family resemblance is still there.
P6's direct successor was Pentium M, created for mobile processors due to NetBurst's high power draw rendering it unsuitable for those applications. The Core microarchitecture powering the famous Core 2 series, that let Intel take the performance crown back from AMD after years of humiliation due to their marketing-driven, physics-ignoring bet on NetBurst, was a direct successor of Pentium M.
But how effective was the hardware prefetching in Tualatin, really? And is it really true that Coppermine, the Pentium III revision preceding Tualatin, didn't feature automatic prefetching? Fortunately, by choosing an appropriate test case and using CPU performance counters, it is possible to perform a gray-box test that answers these questions.
There are too many garbage-collected runtimes using very simplistic designs, such as reference counting or naïve stop the world mark and sweep. These are valid choices in that they're simple and correct, but unfortunately they induce undue costs, such as unpredictable and unbounded pause times.
I decided to have a go at designing and implementing something better as a part of my custom lisp-dialect implementation, and this blog post is an attempt to describe and organize this design. Since my toy lisp is single-threaded, so is its garbage collector. It is possible to extend this design into a parallel one, it just is not one of my goals.
It should be noted that this design is in no way revolutionary, I'm merely following existing techniques. In particular, Red Hat's excellent Shenandoah has been a major inspiration.