Latest articles

Measuring pg_stat_statements overhead

Published on the 12th of May 2023

pg_stat_statements is a fantastically useful PostgreSQL extension that collects execution statistics of all statements executed by the server, to allow the database administrator to monitor and analyze any possible performance issue. For a good reason, it is one of the, if not the most installed extension, and even a cursory internet search will reveal numerous sources extolling its virtues.

Of course, collecting those statistics has some small performance overhead, which is widely reported to be negligible. There's nothing wrong with checking the veracity of those claims ourselves, however: hardware and software differences can matter, after all, and the hardware I'm running this database on is not exactly usual.

Read the full article…

Yes, you can embrace, extend, extinguish free software

Published on the 22nd of September 2022

KHTML

Read the full article…

.NET object initialization is surprisingly slow

Published on the 18th of August 2022

Topics: java, dot-net, garbage-collection

…sometimes, at least. When compared to Java under OpenJDK.

I was performing some experiments with various ways of implementing persistent random-access sequences. At one point, after implementing some of them in Java, I had the brilliant idea to try and implement some of them in C# as well and see how performance compares.

Unfortunately, this comparison was not very favorable towards C#: it consistently performed around 4-6× slower, four to six times, than the same data structure in Java. That was certainly… not what I expected. I was expecting the performance to be very close, if not C#-favored. After all, the Common Language Runtime supports value types, whereas the Java Virtual Machine at the moment does not, and some of the data structures I tested can benefit from reduced indirection.

Profiling revealed that the hottest functions were those that created new objects. While this does make sense, as persistent data structures using path copying do create many objects, spending north of 80% wall clock time on just allocating new objects, instead of on actual logic, is certainly a lot.

The next step I took to understand why those functions take so much time was, naturally, looking at the generated code. The differences were quite stark, and explained the performance gap almost immediately, so why don't we take a look?

Read the full article…

Dependency-breaking zeroing XOR in P6

Published on the 8th of April 2021

Topics: microarchitecture-archeology, mythbusting

In x86 assembly language, a common idiom for setting the value of a register to 0 is to use the exclusive-or instruction with both operands being the same register, such as xor eax, eax. It was originally intended as a size optimization: the obvious mov eax, 0 is encoded as five bytes, of which four are used to store the constant 0, while the exclusive-or solution needs merely two, and is equally as fast, so it quickly became widespread.

By the time the P6 microarchitecture was being designed, the xor zeroing idiom was already nigh-universal in compiler output and hand-written assembly alike, so it was specifically recognized as a zeroing idiom for the purpose of avoiding partial register stalls in code such as this:

x86 assembly
    xor eax, eax
    mov al, [ecx]
    ; use eax

In code tuned for the original Pentium or earlier processors, this was the usual way of zero-extending an 8-bit (or 16-bit with ax instead of al) value into the full 32-bit register, as the movzx instruction was slower. P6, starting from the very first Pentium Pro, recognized that after a xor of a register with itself, the register held 0, which avoided the partial register stall that would otherwise occur when modifying a low part of a register followed by operations on the full 32 bits.

Unfortunately, the Pentium Pro as originally designed was too ambitious to be realized using then-available lithography technology without making the chip too big — and thus too prone to manufacturing defects — so some features had to go. Segment register renaming and beefier secondary decoders were some of the notable victims of that process.

I assume that the ability to recognize that the exclusive-or zeroing idiom doesn't really depend on the previous value of a register, so that it can be dispatched immediately without waiting for the old value — thus breaking the dependency chain — met the same fate; the Pentium Pro shipped without it.

Some of the cut features were introduced in later models: segment register renaming, for example, was added back in the Pentium II. Maybe dependency-breaking zeroing XOR was added in later P6 models too? After all, it seems such a simple yet important thing, and indeed, I remember seeing people claim that's the case in some old forum posts and mailing list messages. On the other hand, some sources, such as Agner Fog's optimization manuals say that not only it was never present in any of the P6 processors, it was also missing in Pentium M.

Whatever the case may be, there's only one way to make sure: test it!

Read the full article…

Blog update

Published on the 1st of April 2021

Topics: meta

In the past few days I've made some changes — some immediately noticeable, some less so — to the blog. Why not have a look at them?

I've been writing another, more substantial and much larger article for quite some time. I initially wanted to publish it in December, then January, then March, but getting it into a shape I'm comfortable with takes much more time and effort than I anticipated. Sorry about that.

The big, immediately noticeable change is that the main page is no longer a copy of the latest article. I've changed it to the tried and true format of using the introductory section of the last several articles, where "several" is defined here as five.

With that change, the need for displaying an article permalink in the header vanished: since pages aren't getting copied anymore, there's no need for an explicit, unambiguous, canonical link to an article. The address the browser displays when you read an article is now the canonical link, putting one in the article header as well would just be superfluous.

While the other changes are minor and not as readily — if at all — noticeable to most readers, they still warrant coverage, in their own subsections.

Read the full article…