Optimise C# in a ridiculous way #1924

RReverser · 2024-10-31T11:40:07Z

Description of Changes

Clickbait title, but there is no other way to say this. It must be the weirdest way I optimised anything yet.

I was suspicious that a lot of benchmarks, in particular those that deal with iteration - which we have for C# as of recently - take about the same amount of time. After digging in for a bit, I narrowed it down to merely presence of this destructor. Looks like the way destructors are implemented for the Wasm target somehow deoptimises the execution to the point where simply removing it speeds up benchmarks by 15x or more.

We don't really need it anyway, it was more of a nice-to-have in case user does something weird, since regular iterator usage like foreach always calls Dispose() correctly.

Further investigation and an upstream issue to .NET coming up, but meanwhile here are benchmark differences before/after:

Benchmark	Speedup	C# before	C# rate before	C# after	C# rate after
special/db_game/circles/load=10	14.58	3.3±0.24ms	? ?/sec	224.1±21.57µs	? ?/sec
special/db_game/circles/load=100	14.93	3.3±0.20ms	? ?/sec	223.4±16.61µs	? ?/sec
stdb_module/mem/filter/string/index/load=2048/count=256	15.79	3.1±0.11ms	321 Elem/sec	197.0±2.97µs	5.0 KElem/sec
stdb_module/mem/filter/u64/index/load=2048/count=256	19.15	3.2±0.11ms	315 Elem/sec	165.3±2.86µs	5.9 KElem/sec
stdb_module/mem/insert_bulk/u32_u64_u64/btree_each_column/load=2048/count=256	2.00	3.2±0.11ms	314 Elem/sec	1589.3±45.07µs	629 Elem/sec
stdb_module/mem/insert_bulk/u32_u64_u64/unique_0/load=2048/count=256	16.71	3.2±0.10ms	312 Elem/sec	191.7±2.98µs	5.1 KElem/sec
stdb_module/mem/iterate/u32_u64_str/unique_0/count=256	31.72	3.1±0.11ms	325 Elem/sec	97.0±2.47µs	10.1 KElem/sec
stdb_module/mem/iterate/u32_u64_u64/unique_0/count=256	34.63	3.4±0.11ms	296 Elem/sec	97.5±2.40µs	10.0 KElem/sec
stdb_module/mem/update_bulk/u32_u64_str/unique_0/load=2048/count=256	14.31	3.1±0.12ms	318 Elem/sec	219.2±7.24µs	4.5 KElem/sec
stdb_module/mem/update_bulk/u32_u64_u64/unique_0/load=2048/count=256	15.02	3.2±0.06ms	315 Elem/sec	210.9±2.46µs	4.6 KElem/sec

API and ABI breaking changes

If this is an API or ABI breaking change, please apply the
corresponding GitHub label.

Expected complexity level and risk

How complicated do you think these changes are? Grade on a scale from 1 to 5,
where 1 is a trivial change, and 5 is a deep-reaching and complex change.

This complexity rating applies not only to the complexity apparent in the diff,
but also to its interactions with existing and future code.

If you answered more than a 2, explain what is complex about the PR,
and what other components it interacts with in potentially concerning ways.

Testing

Describe any testing you've done, and any testing you'd like your reviewers to do,
so that you're confident that all the changes work as expected!

Write a test you've completed here.
Write a test you want a reviewer to do here, so they can check it off when they're satisfied.

Clickbait title, but there is no other way to say this. It must be the weirdest way I optimised anything yet. I was suspicious that a lot of benchmarks, in particular those that deal with iteration - which we have for C# as of recently - take about the same amount of time. After digging in for a bit, I narrowed it down to merely presence of this destructor. Looks like the way destructors are implemented for the Wasm target somehow deoptimises the execution to the point where simply removing it speeds up benchmarks by 15x or more. We don't really need it anyway, it was more of a nice-to-have in case user does something weird, since regular iterator usage like `foreach` always calls `Dispose()` correctly. Further investigation and an upstream issue to .NET coming up, but meanwhile here are benchmark differences before/after: | benchmark | improvement | csharp_time | csharp_rate | csharp_no_destructor_time | csharp_no_destructor_rate | | ----------------------------------------------------------------------------- | ----------- | ----------- | ------------ | ------------------------- | ------------------------- | | special/db_game/circles/load=10 | 14.58 | 3.3±0.24ms | ? ?/sec | 224.1±21.57µs | ? ?/sec | | special/db_game/circles/load=100 | 14.93 | 3.3±0.20ms | ? ?/sec | 223.4±16.61µs | ? ?/sec | | stdb_module/mem/filter/string/index/load=2048/count=256 | 15.79 | 3.1±0.11ms | 321 Elem/sec | 197.0±2.97µs | 5.0 KElem/sec | | stdb_module/mem/filter/u64/index/load=2048/count=256 | 19.15 | 3.2±0.11ms | 315 Elem/sec | 165.3±2.86µs | 5.9 KElem/sec | | stdb_module/mem/insert_bulk/u32_u64_u64/btree_each_column/load=2048/count=256 | 2.00 | 3.2±0.11ms | 314 Elem/sec | 1589.3±45.07µs | 629 Elem/sec | | stdb_module/mem/insert_bulk/u32_u64_u64/unique_0/load=2048/count=256 | 16.71 | 3.2±0.10ms | 312 Elem/sec | 191.7±2.98µs | 5.1 KElem/sec | | stdb_module/mem/iterate/u32_u64_str/unique_0/count=256 | 31.72 | 3.1±0.11ms | 325 Elem/sec | 97.0±2.47µs | 10.1 KElem/sec | | stdb_module/mem/iterate/u32_u64_u64/unique_0/count=256 | 34.63 | 3.4±0.11ms | 296 Elem/sec | 97.5±2.40µs | 10.0 KElem/sec | | stdb_module/mem/update_bulk/u32_u64_str/unique_0/load=2048/count=256 | 14.31 | 3.1±0.12ms | 318 Elem/sec | 219.2±7.24µs | 4.5 KElem/sec | | stdb_module/mem/update_bulk/u32_u64_u64/unique_0/load=2048/count=256 | 15.02 | 3.2±0.06ms | 315 Elem/sec | 210.9±2.46µs | 4.6 KElem/sec |

gefjon

How did you even find this? And to be clear, is it removing the finalizer, or adding the sealed which has this effect?

RReverser · 2024-10-31T16:24:06Z

How did you even find this?

It started from digging into why iterators specifically are so slow, making some optimisations, being weirded out by not noticing any improvement, and eventually ended with a "scientific guess" that if it all takes roughly constant amount of time, GC must be involved somehow. And voila, indeed it was.

And to be clear, is it removing the finalizer, or adding the sealed which has this effect?

It's removing the finalizer.

sealed was added after the fact, because after removing the finalizer I also removed GC.SuppressFinalize(this); - which is no longer necessary - and then linter would warn about subclasses potentially adding their own finalizers. So I made the class sealed just to tell the linter it's okay and there will be no subclasses.

RReverser requested review from lcodes, jdetter, cloutiertyler and SteveBoytsun October 31, 2024 11:43

SteveBoytsun approved these changes Oct 31, 2024

View reviewed changes

gefjon approved these changes Oct 31, 2024

View reviewed changes

RReverser added this pull request to the merge queue Oct 31, 2024

Merged via the queue into master with commit 8448484 Oct 31, 2024
11 checks passed

RReverser deleted the ingvar/csharp-speedup branch October 31, 2024 16:58

mamcx pushed a commit that referenced this pull request Nov 5, 2024

Optimise C# in a ridiculous way (#1924)

25de1b1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise C# in a ridiculous way #1924

Optimise C# in a ridiculous way #1924

RReverser commented Oct 31, 2024 •

edited

Loading

gefjon left a comment

RReverser commented Oct 31, 2024

Optimise C# in a ridiculous way #1924

Optimise C# in a ridiculous way #1924

Conversation

RReverser commented Oct 31, 2024 • edited Loading

Description of Changes

API and ABI breaking changes

Expected complexity level and risk

Testing

gefjon left a comment

Choose a reason for hiding this comment

RReverser commented Oct 31, 2024

RReverser commented Oct 31, 2024 •

edited

Loading