…sometimes, at least. When compared to Java under OpenJDK.
I was performing some experiments with various ways of implementing persistent random-access sequences. At one point, after implementing some of them in Java, I had the brilliant idea to try and implement some of them in C# as well and see how performance compares.
Unfortunately, this comparison was not very favorable towards C#: it consistently performed around 4-6× slower, four to six times, than the same data structure in Java. That was certainly… not what I expected. I was expecting the performance to be very close, if not C#-favored. After all, the Common Language Runtime supports value types, whereas the Java Virtual Machine at the moment does not, and some of the data structures I tested can benefit from reduced indirection.
Profiling revealed that the hottest functions were those that created new objects. While this does make sense, as persistent data structures using path copying do create many objects, spending north of 80% wall clock time on just allocating new objects, instead of on actual logic, is certainly a lot.
The next step I took to understand why those functions take so much time was, naturally, looking at the generated code. The differences were quite stark, and explained the performance gap almost immediately, so why don't we take a look?
While trying to implement concurrent evacuation I've found out that there is a serious problem with my simple incremental low-pause garbage collector design. Implementing the phase ordering exactly as specified would corrupt the root set, resulting in not very fun to debug crashes, even with extensive heap verification and lots of assertions everywhere. Why? Let's go through the pertinent assumptions, and the way they interact, carefully, and find a solution.
Just a "quick" update on the development status of my low-pause incremental garbage collector (code name Blue Ridge): it's progressing rather nicely, considering how little motivation my clinically depressed self is able to muster up most days.
No points for guessing where the code name comes from.
The garbage collector as currently implemented, in just above 2 thousand lines of code, is already working, that is, it correctly allocates memory and collects garbage, even under address and undefined behavior sanitizers, though it's limited to stop-the-world operation at the moment. This is merely a pragmatic decision on what order to do things in: I'd rather have heap verifiers implemented and tested before diving into incrementality, as catching any potential heap corruptions or invariant violations right away is much preferable to trying to divine the cause of a crash from a core dump.
That being said, most of the hard work for supporting incremental collections has already been done. Most of the collector's code is written in a way that makes incremental work easy enough, it just happens to not be used in that way yet.
Performance appears to be fine so far. Pure allocation rate of object that die quickly is limited by how fast the CPU is able to write object data to memory, like in any good moving GC, as the hot path of allocation consists of just advancing a pointer. Collection time is dominated by marking, as one could expect. Throughput, measured by taking the wall-clock time of a program that allocates a linked list of 10 million conses then throws it away, repeating 500 times, is not bad either, being within the same order of magnitude as HotSpot's Shenandoah collector in its passive (stop-the-world only) mode, and beating handily its Parallel and G1 collectors with all four collectors limited to a 1 GB heap, which again, is not unexpected: this is not a very good workload+heap size combination for generational collectors, as they're forced to prematurely tenure conses, which then results in a significant number of painful major collections, while Blue Ridge and Shenandoah aren't affected due to their non-generational nature.
Blue Ridge in its current state also features policies, switchable at runtime, that are informed whenever interesting events happen, such as a new region being allocated or the current phase being changed, and that in turn decide whether or not start a cycle, and if so what kind — incremental or stop-the-world — and what regions should belong to the collection set. It is through this mechanism that the aforementioned 1 GB heap limit has been implemented, as Blue Ridge's heap is normally discontinuous, dynamically resizable in either direction and theoretically unlimited.
All things considered, the development of my garbage collector is progressing well, if slowly due to circumstances beyond my control. I hope I'll be able to have a functionally complete version soon, because that's where the real fun begins: playing around with all the knobs the get the lowest latency, lowest footprint and highest throughput possible.
There are too many garbage-collected runtimes using very simplistic designs, such as reference counting or naïve stop the world mark and sweep. These are valid choices in that they're simple and correct, but unfortunately they induce undue costs, such as unpredictable and unbounded pause times.
I decided to have a go at designing and implementing something better as a part of my custom lisp-dialect implementation, and this blog post is an attempt to describe and organize this design. Since my toy lisp is single-threaded, so is its garbage collector. It is possible to extend this design into a parallel one, it just is not one of my goals.
It should be noted that this design is in no way revolutionary, I'm merely following existing techniques. In particular, Red Hat's excellent Shenandoah has been a major inspiration.