Modulo or Bitwise

I had an interesting thing said to me: “Did you know that modulo is much less efficient than bitwise comparison?” As someone who spent time I painstakingly went through all E-series resistor values to find those that would make my voltage divider be power of 2, I definitely saw that in action. But, that got me thinking. While 8-bit PIC microcontroller doesn’t have a hardware divider and thus any modulo is a torture, what about modern computers? How much slower do they get?

Quick search brought me a few hits and one conclusive StackOverflow answer. Searching a bit more brought me to another answer where they even did measurements. And difference was six-fold. But I was left with a bit of nagging as both of these were 10+ years old. What is a difference you might expect on a modern CPU? And, more importantly for me, what are differences in C#?

Well, I quickly ran some benchmarks and results are below.

TestParallelMeanStDev
(i % 4) == 0No202.3 us0.24 us
(i & 0b11) == 0No201.9 us0.12 us
(i % 4) == 0CpuCount206.4 us7.78 us
(i & 0b11) == 0CpuCount196.5 us5.63 us
(i % 4) == 0CpuCount*2563.9 us7.90 us
(i & 0b11) == 0CpuCount*2573.9 us6.52 us

My expectations were not only wrong but slightly confusing too.

As you can see from table above, I did 3 tests, single threaded, default parallel for, and then parallel for loop with CPU overcommitment. Single threaded test is where I saw what I expected but not in amount expected. Bitwise was quite consistently winning but by ridiculous margins. Unless I was doing something VERY specific, there is no chance I would care about the difference.

If we run test in Parallel.For, difference becomes slightly more obvious. And had I stayed just on those two, I would have said that assumption holds for modern CPUs too.

However, once I overcommitted CPU resources, suddely modulo was actually better. And that is something that’s hard to explain if we take assumption that modulo just uses divide to be true.

So, I decided to sneak a bit larger peek - into .NET CLR. And I discovered that bitwise operation was fully omitted while modulo operation was still there. However, then runtime smartly decided to remove both. Thus, I was testing nothing vs. almost nothing.

Ok, after I placed a strategic extra instructions to prevent optimization, I got the results below.

TestParallelMeanStDev
(i % 4) == 0No203.1 us0.16 us
(i & 0b11) == 0No202.9 us0.06 us
(i % 4) == 0CpuCount1,848.6 us13.13 us
(i & 0b11) == 0CpuCount1,843.9 us6.76 us
(i % 4) == 0CpuCount*21,202.7 us7.32 us
(i & 0b11) == 0CpuCount*21,201.6 us6.75 us

And yes, bitwise is indeed faster than modulo but by really low margin. The only thing new test “fixed” was that discrepancy in speed when you have too many threads.

Just to make extra sure that the compiler wasn’t doing “funny stuff”, I decompiled both to IL.

ldloc.1
ldc.i4.4
rem
ldc.i4.0
ceq
ldloc.1
ldc.i4.3
and
ldc.i4.0
ceq

Pretty much exactly the same, the only difference being usage of and for bitwise check while rem was used for modulo. In modern CPUs these two instructions seem pretty much equivalent. And when I say modern, I use that lossely since I saw the same going back a few generations .

Interestingly, just in case runtime changed those to the same code, I also checked modulo 10 just to confirm. That one was actually faster than modulo 4. That leads me to believe there are some nice optimizations happening here. But I still didn’t know if this was .NET framework or really something CPU does.

As a last resort, I went down to C and compiled it with -O0 -S. Unfortunately, even with -O0, if you use % 4, it will be converted for bitwise. Thus, I checked it against % 5.

Bitwise check compiled down to just 3 instructions (or just one if we exclude load and check).

movl	-28(%rbp), %eax
andl	$3, %eax
testl	%eax, %eax

But modulo went crazy route.

movl	-28(%rbp), %ecx
movslq	%ecx, %rax
imulq	$1717986919, %rax, %rax
shrq	$32, %rax
movl	%eax, %edx
sarl	%edx
movl	%ecx, %eax
sarl	$31, %eax
subl	%eax, %edx
movl	%edx, %eax
sall	$2, %eax
addl	%edx, %eax
subl	%eax, %ecx
movl	%ecx, %edx
testl	%edx, %edx

It converted division into multiplication and gets to remainder that way. All in all, quite impressive optimization. And yes, this occupies more memory so there are other consequences to the performance (e.g. uses more cache memory).

So, if you are really persistent with testint, difference does exist. It’s not six-fold but it can be noticeable.

At the end, do I care? Not really. Unless I am working on microcontrollers, I won’t stop using modulo where it makes sense. It makes intent much more clear and that, to me, is worth it. Even better, compilers will just take care of this for you.

So, while modulo is less efficient, stories of its slowness have been exaggerated a bit.

PS: If you want to run my tests on your system, files are available.

Never Gonna BOM You Up

.NET supported Unicode from its very beginning. Pretty much anything you might need for Unicode manipulation is there. Yes, as early adopters, they made a bet on UTF-16 that didn’t pay off since rest of the world has moved toward UTF-8 as an (almost) exclusive encoding. However, if we ignore a bit higher memory footprint, C# strings made Unicode as easy as it gets.

And, while UTF-8 is not a native encoding for its strings, C# is no slouch and has a convenient Encoding.UTF8 static property allowing for easy conversion. However, if you do use that Encoding.UTF8.GetBytes() function, you will get a bit extra.

That something extra is Byte order mark. Its intention is noble - to help detect endianess. However, its usage for UTF-8 is of dubious help since 8-bit encoding doesn’t really have issues with endianness to start with. Unicode specification itself does allows for one but doesn’t recommend it. It merely acknowledges it might happen as a side-effect of data conversion from other unicode encodings that do have endianness.

So, in theory, UTF-8 with BOM should be perfectly acceptable. In practice, only Microsoft really embraced UTF-8 BOM. Pretty much everybody else decided to have UTF-8 without BOM as that allowed for full compatibility with 7-bit ASCII.

With time, .NET/C# stopped being Windows-only and, by today, became really good multiplatform solution. And now, helper function that ought to simplify things is actually producing output that will annoy many command-line tools that don’t expect it. If you read the documentation, solution exists - just create your own UTF-8 converter instance.

private static readonly Encoding Utf8 = new UTF8Encoding(encoderShouldEmitUTF8Identifier: false);

Now you can call Utf8.GetBytes() instead and you will get expected result on all platforms, including Windows - no BOM, no problems.

So, one could argue that Encoding.UTF8 default should be changed to what is more appropriate value. I mean, .NET is multiplatform and the current default doesn’t work everywhere. One could argue but this default is not changing, ever.

When any project starts, decisions must be made. And you won’t know for a while if those decisions were good. On the other hand, people will start depending on whatever behavior you selected.

In the case of BOM, it might be that developer got so used to having those three extra bytes that, instead checking the file content, they simply use <=3 as a signal file is empty. Or they have a script that takes output of some C# application and just strips the first three bytes blindly before moving it to non-BOM friendly input. Or any other decision somebody made in project years ago. It doesn’t really matter how bad someones code is. What matters is that code is currently working and new C# release shouldn’t silently break somebody’s code.

So, I am reasonably sure that Microsoft won’t ever change this default. And, begrudgingly, I agree with that. Some bad choices are simply meant to stay around.


PS: And don’t let me start talking about GUIDs and their binary format…

CoreCompile into the Ages

For one project of mine I started having a curious issue. After adding a few, admittedly a bit complicated, classes my compile times under Linux shot to eternity. But that was only when running with dotnet command line tools. In Visual Studio under Windows, all worked just fine.

Under dotnet I would just see CoreCompile step counting seconds, and then minutes. I tried increasing log level - nothing. I tried not cleaning stuff, i.e. using cached files - nothing. So, I tried cleaning up my .csproj file - hm… things improved, albeit just a bit.

A bit of triage later and I was reasonably sure that .NET code analyzer are the culprit. Reason why changes to .csproj reduced the time was because I had AnalysisMode set quite high. Default AnalysisMode simply checks less.

While disabling .NET analyzers altogether was out of question, I was quite OK with not running them all the time. So, until .NET under Linux gets a bit more performant, I simply included EnableNETAnalyzers=false in my build scripts.

  -p:EnableNETAnalyzers=false

Another problem solved.

Custom StringBuilder Pool

In my last post I grumbled about ObjectPool being a separate package. That was essentially the single downside to use it. So, how hard is to implement our own StringBuilder pool?

Well, not that hard. The whole thing can be something like this:

internal static class StringBuilderPool {

    private static readonly ConcurrentQueue<StringBuilder> Pool = new();

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static StringBuilder Get() {
        return Pool.TryDequeue(out StringBuilder? sb) ? sb : new StringBuilder(4096);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static bool Return(StringBuilder sb) {
        sb.Length = 0;
        Pool.Enqueue(sb);
        return true;
    }

}

In our Get method we check if we have any stored StringBuilder. If yes, we just return the same. If no, we create a new instance.

In the Return method we just add the returned instance to the queue.

Now, this is not exactly an ObjectPool equivalent. For example, it doesn’t limit the pool size. And it will keep large objects around forever. However, for my case it was good enough and unlikely to cause any problems.

And performance… Well, performance is promising, to say the least:

TestMeanErrorStdDevGen0Gen1Allocated
StringBuilder (small)15.762 ns0.3650 ns0.4057 ns0.0181-152 B
ObjectPool (small)17.257 ns0.0616 ns0.0576 ns0.0057-48 B
Custom pool (small)16.864 ns0.0192 ns0.0150 ns0.0057-48 B
Concatenation (small)9.716 ns0.1634 ns0.1528 ns0.0105-88 B
StringBuilder (medium)58.125 ns0.6429 ns0.6013 ns0.0526-440 B
ObjectPool (medium)23.226 ns0.0517 ns0.0484 ns0.0115-96 B
Custom pool (medium)23.660 ns0.2515 ns0.1963 ns0.0115-96 B
Concatenation (medium)66.353 ns1.3307 ns1.2447 ns0.0793-664 B
StringBuilder (large)190.293 ns0.7781 ns0.6498 ns0.24960.00102088 B
ObjectPool (large)92.556 ns0.9281 ns0.8228 ns0.0755-632 B
Custom pool (large)91.470 ns0.5478 ns0.5124 ns0.0755-632 B
Concatenation (large)1,430.599 ns11.5971 ns10.8479 ns4.01690.005733600 B

Pretty much its on-par with ObjectPool implementation. Honestly, results are close enough to be equivalent for all practical purposes.

So, if you don’t want to pull the whole Microsoft.Extensions.ObjectPool just for caching a few StringBuilder instances, consider rolling your own.

To Pool or Not to Pool

Illustration

For a project of mine I “had” to do a lot of string concatenations. Easy solution was just to have a string builder and go wild. But I wondered, does it make sense to use ObjectPool (found in Microsoft.Extensions.ObjectPool package). Thus, I decided to do a few benchmarks.

For my use case, “small” was just appending 3 items to a StringBuilder. The “medium” is does total of 21 appends. And finally, “large” does 201 appends. And no, there is no real reason why I used those exact numbers other than loop ended up being nice. :)

After all this, benchmark results (courtesy of BenchmarkDotNet):

TestMeanErrorStdDevGen0Gen1Allocated
StringBuilder (small)16.295 ns0.1240 ns0.1160 ns0.0181-152 B
StringBuilder Pool (small)17.958 ns0.3125 ns0.2609 ns0.0057-48 B
StringBuilder (medium)87.052 ns1.5177 ns1.4197 ns0.08320.0001696 B
StringBuilder Pool (medium)31.245 ns0.1815 ns0.1417 ns0.0181-152 B
StringBuilder (large)304.724 ns1.6736 ns1.3975 ns0.45200.00293784 B
StringBuilder Pool (large)172.615 ns1.5325 ns1.4335 ns0.1471-1232 B

As you can see, if you are doing just a few appends, it’s probably not worth messing with ObjectPool. Not that you should use StringBuilder either. If you are adding 4 or fewer strings, you might as well concatenate them - it’s actually more performant.

However, if you are adding 5 or more strings together, pool is no worse than instantiating a new StringBuilder. So, for pretty much any scenario where you would use StringBuilder, it pays off to pool it.

Is there a situation where you would avoid pool? Well, performance-wise, I would say probably no. I ran multiple tests and, on my computer, there was no situation where StringBuilder alone was better than either pool or concat. Yes, StringBuilder is performant at low number of appends, but string concatenation is better. As soon as you go over a few appends, ObjectPool actually makes sense.

However, an elephant in the room is ObjectPool’s dependency on external package. Call me old fashioned but there is a value in not depending on extra packages.

The final decision is, of course, dependant on you. But, if performance is important, I see no reason why not to use ObjectPool. I only wish it wasn’t an extra package.


For curious ones, code was as follows:

[Benchmark]
public string Large_WithoutPool() {
    var sb = new StringBuilder();
    sb.Append("Hello");
    for (var i = 0; i < 100; i++) {
        sb.Append(' ');
        sb.Append("World");
    }
    return sb.ToString();
}

[Benchmark]
public string Large_WithPool() {
    var sb = StringBuilderPool.Get();
    try {
        sb.Append("Hello");
        for (var i = 0; i < 100; i++) {
            sb.Append(' ');
            sb.Append("World");
        }
        return sb.ToString();
    } finally {
        sb.Length = 0;
        StringBuilderPool.Return(sb);
    }
}

And yes, I also tested just a simple string concatenation (quite optimized for smaller number of concatenations):

TestMeanErrorStdDevGen0Gen1Allocated
Concatenation (small)9.820 ns0.2365 ns0.2429 ns0.0105-88 B
Concatenation (medium)146.901 ns1.6561 ns1.2930 ns0.2294-1920 B
Concatenation (large)4,710.573 ns43.5370 ns96.4750 ns15.20540.0458127200 B