.NET object initialization is surprisingly slow
Published on the
Topics: java, dot-net, garbage-collection
…sometimes, at least. When compared to Java under OpenJDK.
I was performing some experiments with various ways of implementing persistent random-access sequences. At one point, after implementing some of them in Java, I had the brilliant idea to try and implement some of them in C# as well and see how performance compares.
Unfortunately, this comparison was not very favorable towards C#: it consistently performed around 4-6× slower, four to six times, than the same data structure in Java. That was certainly… not what I expected. I was expecting the performance to be very close, if not C#-favored. After all, the Common Language Runtime supports value types, whereas the Java Virtual Machine at the moment does not, and some of the data structures I tested can benefit from reduced indirection.
Profiling revealed that the hottest functions were those that created new objects. While this does make sense, as persistent data structures using path copying do create many objects, spending north of 80% wall clock time on just allocating new objects, instead of on actual logic, is certainly a lot.
The next step I took to understand why those functions take so much time was, naturally, looking at the generated code. The differences were quite stark, and explained the performance gap almost immediately, so why don't we take a look?
The setup
But first, let's come up with a minimal yet representative microbenchmark that successfully shows the speed difference between the two platforms. I'll show the Java version, the C# translation is trivial and therefore left as an exercise for the reader.
Java
public final class Main {
private Main() {
}
public static void main(final String[] args) {
measure(25_000); // warm-up
measure(25_000_000);
}
private static void measure(final int repeats) {
final var startTime = System.nanoTime();
for (int i = 0; i < repeats; i += 1) {
build(8, i);
}
final var endTime = System.nanoTime();
System.out.format("Took %.3f s\n", (endTime - startTime) / 1.0e9);
}
private static Foo build(final int level, final int value) {
if (level < 0) {
return null;
}
final var child = build(level - 1, value);
return new Foo(value, child, child, child, child);
}
private record Foo(int value, Foo one, Foo two, Foo three, Foo four) {
}
}
This code creates a nondescript quaternary tree of height 8 with significant node sharing, immediately throws it away, then repeats that process a bunch of times. Almost no logic at all, just memory allocations.
This is not a good way to do microbenchmarking on platforms with optimizing compilers: for example, one problem is that the compiler could realize that the call to build
does nothing apart from allocating a bunch of useless objects, and therefore could be completely eliminated.
I was careful enough to disable inlining of build
using -XX:CompileCommand
, which is enough to inhibit allocation elision in this case, on the current version of OpenJDK, then verified the generated code to see if the compiler didn't do anything destructive anyway. If you're not a trained professional who likes living dangerously, you should prefer libraries like the Java microbenchmark harness instead, which take care of those nasty optimizing compilers. And handle just-in-time compiler warm-up better. And more.
Running this flawed microbenchmark on my Zen 2 desktop yields the following results:
Language | Runtime | Operating system | Average time | 95% confidence interval |
---|---|---|---|---|
Java | OpenJDK 18.0.2 | Debian | 897 ms | [888.5 ms, 905.9 ms] |
C# | .NET 6.0.8 | Windows | 2942 ms | [2575 ms, 3308.4 ms] |
.NET ends up taking over three times as much time on average, with large standard deviation, which is close to my prior observations. Same version of .NET under Linux is even slower, likely because it does not have assembly intrinsics for some operations on non-Windows platforms yet.
Enough waffling about, let's take a look at the disassembly, shall we?
OpenJDK generated code
x86 assembly
build:
mov [rsp - 0x14000], eax ; probe for stack overflow
push rbp
sub rsp, 32
mov ebp, edx ; save the integer value for later
test esi, esi ; reached base case?
jl .return_null
dec esi
db 0x66, 0x66 ; align the address in the call…
nop ; …for possible patching
call build
mov [rsp], rax ; save the child in case we need to call into runtime
mov rax, [r15 + 0xF0] ; load thread allocation pointer
mov r10, rax
add r10, 32 ; bump the pointer by object size
cmp r10, [r15 + 0x100] ; compare it against thread allocation limit
jae .allocate_through_runtime
mov [r15 + 0xF0], r10 ; store updated allocation pointer
prefetchnta [r10 + 0x100] ; prefetch for future allocations
mov qword [rax], 1 ; store mark word of the newly allocated object
mov dword [rax + 8], 0xC031F0 ; store the Main$Foo class word
.fill_object:
mov [rax + 12], ebp ; store the integer value field
mov r10, [rsp] ; reload the saved child reference
mov r11, r10 ; lol C2, redundant move
mov [rax + 16], r11d ; store field one…
mov [rax + 20], r11d ; …two…
mov [rax + 24], r11d ; …three…
mov [rax + 28], r11d ; …and finally four
.epilogue:
add rsp, 32
pop rbp
.stack_watermark_check:
cmp rsp, [r15 + 0x338] ; poll stack watermark barrier
ja .stack_watermark_safepoint
ret
.return_null:
xor eax, eax
jmp .epilogue
.allocate_through_runtime:
mov rsi, 0x800C031F0 ; Main$Foo class metadata
call _new_instance_Java
jmp .fill_object
.stack_watermark_safepoint:
; stack pointer above high water mark, need to enter safepoint
mov r10, .stack_watermark_check
mov [r15 + 0x350], r10
jmp SafepointBlob
What we see here is pretty sensible. Sure, there's a couple of minor imperfections here and there, but overall C2, the HotSpot virtual machine optimizing compiler, did a good job. The code is laid out in such a way that the common case has no taken branches. Allocation is dealt with by bumping a pointer. The address of the thread-local context is kept in r15
, so there's no need to load it from anywhere.
Note that object references occupy 4 bytes of space in our object, not 8 like native pointers: on 64-bit platforms, HotSpot can, under certain conditions, compress the references. The speed penalty caused by decoding compressed references ranges from none to minuscule on common platforms, so this saves memory and improves cache utilization at very little cost.
Most importantly, though, object initialization is as simple as it gets: after filling the object header, the fields are initialized with simple memory stores. There are no garbage collector write barriers of any kind here: the compiler realized it's storing fields of a freshly-allocated object, which must be in the young generation by definition, so barriers can be safely elided.
To show that barrier elision is an optimization, we can tell the runtime to not inline the constructor of our record and look at its disassembly, here with the parallel garbage collector, because its intergenerational write barrier is relatively simple:
x86 assembly
constructor_disabled_inlining:
sub rsp, 24
mov [rsp + 16], rbp ; probably should've been a push
mov [rsi + 12], edx ; store the integer field
mov r10, rsi ; copy the address for card table marking
mov r11, rcx
mov [rsi + 16], r11d ; store field one…
shr r10, 9 ; compute card table entry index
mov [rsi + 20], r8d ; …two…
mov r8, r9
mov [rsi + 24], r8d ; …three…
mov r8, rdi
mov [rsi + 28], r8d ; …and finally four
mov r11, 0x7F1ACAE2F000 ; card table base address
mov [r11 + r10], r12b ; mark card table entry (r12 = 0)
add rsp, 16
pop rbp ; see, told you saving rbp should've been a push
cmp rsp, [r15 + 0x338] ; poll stack watermark barrier
ja .stack_watermark_safepoint
ret
We can clearly see a card-marking write barrier at the end. A more naïve compiler would emit a barrier after every write of a reference field, and the baseline compiler C1 does exactly that, but C2 was able to coalesce those barriers into one.
.NET generated code
x86 assembly
Build:
push rdi
push rsi
push rbx
sub rsp, 32
mov esi, edx ; save the integer value for later
test ecx, ecx ; reached base case?
jge .recursive_call
xor eax, eax
add rsp, 32
pop rbx
pop rsi
pop rdi
ret
.recursive_call:
dec ecx
mov edx, esi
call stub_for_Build ; which is just "jmp Build"
mov rdi, rax ; save the child in callee-saved register
mov rcx, 0x7FFF74422F40 ; Main.Foo class metadata
call CORINFO_HELP_NEWSFAST
mov rbx, rax ; save the new object in callee-saved register
mov [rbx + 40], esi ; store the integer value field
lea rcx, [rbx + 8]
mov rdx, rdi
call CORINFO_HELP_ASSIGN_REF ; store field one…
lea rcx, [rbx + 16]
mov rdx, rdi
call CORINFO_HELP_ASSIGN_REF ; …two…
lea rcx, [rbx + 24]
mov rdx, rdi
call CORINFO_HELP_ASSIGN_REF ; …three…
lea rcx, [rbx + 32]
mov rdx, rdi
call CORINFO_HELP_ASSIGN_REF ; …and finally four
mov rax, rbx
add rsp, 32
pop rbx
pop rsi
pop rdi
ret
Uh-oh. Are those function calls for allocation and reference field assignment? Presumably the reference field assignment function also performs write barrier shenanigans, seeing as the integer field is stored directly? Let's see what those functions look like, then, starting with allocation.
x86 assembly
CORINFO_HELP_NEWSFAST:
mov edx, [rcx + 4] ; load object size
mov r11, [rel _tls_index] ; load the thread-local storage index
mov rax, [gs:0x58] ; load TLS base from thread environment block
mov rax, [rax + r11 * 8] ; load our TLS entry
mov r11d, CurrentThreadInfo - tls_start
mov r11, [rax + r11] ; load the thread-local context pointer
mov r10, [r11 + 0x60] ; load thread allocation limit
mov rax, [r11 + 0x58] ; load thread allocation pointer
add rdx, rax
cmp rdx, r10
ja .allocate_through_runtime
mov [r11 + 0x58], rdx ; store updated allocation pointer
mov [rax], rcx ; store class metadata
ret
.allocate_through_runtime:
jmp JIT_NEW
Allocation turns out to be just pointer bumping after all. Unfortunately, to get to the thread-local context, we need to perform four memory loads first, two of which depend on previous loads; whereas OpenJDK just keeps it in a register. That, and the fact it's a function call, certainly introduces some additional cost, but not enough to explain the observed difference.
What about that reference field assignment then?
x86 assembly
CORINFO_HELP_ASSIGN_REF:
mov [rcx], rdx ; perform the assignment
nop dword [rax] ; align the address in mov for patching
mov rax, 0x1B9BA871018 ; start of young generation?
cmp rdx, rax ; in bounds?
jb .exit
nop ; align the address in mov for patching
mov rax, 0x1B9A34FAF60 ; level 0 card table address
shr rcx, 11
cmp byte [rcx + rax], 0xFF ; already marked?
jne .update_card_table_0
rep ret
.update_card_table_0:
mov byte [rcx + rax], 0xFF ; mark the card table entry
shr rcx, 10
xchg ax, ax ; align the address in mov for patching
mov rax, 0x1B9DA81326C ; level 1 card table address
cmp byte [rcx + rax], 0xFF ; already marked?
jne .update_card_table_1
rep ret
.update_card_table_1:
mov byte [rcx + rax], 0xFF ; mark the card table entry
ret
nop dword [rax]
.exit:
rep ret
There it is! This, honestly, explains everything. Note the early out condition: it skips card marking work entirely only if the value we're assigning points into the old generation. Even if the referrer and the referent are both young, we still end up touching the card tables.
Note that card table updates are conditional: each entry is written to only if it's not already marked. This is done to avoid some false sharing with multi-threaded mutators on multiprocessor systems. It does introduce some branching, but in our case at least, those branches tend to be very predictable.
In the end, all of this work ends up hurting performance, making .NET slower than OpenJDK in this case. Yes, processors try to make any code fast anyway, but out-of-order execution is not magic. Processors can try to reorder instructions as much as they can to fill their execution units, but dependency chains restrict that, they still have to run every instruction, and inefficiencies can quickly add up.
Improvements to the just-in-time compiler, such as emitting certain crucial intrinsic operations like allocations and write barriers inline, and introducing the ability to coalesce and/or elide barriers, would certainly be welcome.