Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise C# in a ridiculous way #1924

Merged
merged 1 commit into from
Oct 31, 2024
Merged

Optimise C# in a ridiculous way #1924

merged 1 commit into from
Oct 31, 2024

Conversation

RReverser
Copy link
Contributor

@RReverser RReverser commented Oct 31, 2024

Description of Changes

Clickbait title, but there is no other way to say this. It must be the weirdest way I optimised anything yet.

I was suspicious that a lot of benchmarks, in particular those that deal with iteration - which we have for C# as of recently - take about the same amount of time. After digging in for a bit, I narrowed it down to merely presence of this destructor. Looks like the way destructors are implemented for the Wasm target somehow deoptimises the execution to the point where simply removing it speeds up benchmarks by 15x or more.

We don't really need it anyway, it was more of a nice-to-have in case user does something weird, since regular iterator usage like foreach always calls Dispose() correctly.

Further investigation and an upstream issue to .NET coming up, but meanwhile here are benchmark differences before/after:

Benchmark Speedup C# before C# rate before C# after C# rate after
special/db_game/circles/load=10 14.58 3.3±0.24ms ? ?/sec 224.1±21.57µs ? ?/sec
special/db_game/circles/load=100 14.93 3.3±0.20ms ? ?/sec 223.4±16.61µs ? ?/sec
stdb_module/mem/filter/string/index/load=2048/count=256 15.79 3.1±0.11ms 321 Elem/sec 197.0±2.97µs 5.0 KElem/sec
stdb_module/mem/filter/u64/index/load=2048/count=256 19.15 3.2±0.11ms 315 Elem/sec 165.3±2.86µs 5.9 KElem/sec
stdb_module/mem/insert_bulk/u32_u64_u64/btree_each_column/load=2048/count=256 2.00 3.2±0.11ms 314 Elem/sec 1589.3±45.07µs 629 Elem/sec
stdb_module/mem/insert_bulk/u32_u64_u64/unique_0/load=2048/count=256 16.71 3.2±0.10ms 312 Elem/sec 191.7±2.98µs 5.1 KElem/sec
stdb_module/mem/iterate/u32_u64_str/unique_0/count=256 31.72 3.1±0.11ms 325 Elem/sec 97.0±2.47µs 10.1 KElem/sec
stdb_module/mem/iterate/u32_u64_u64/unique_0/count=256 34.63 3.4±0.11ms 296 Elem/sec 97.5±2.40µs 10.0 KElem/sec
stdb_module/mem/update_bulk/u32_u64_str/unique_0/load=2048/count=256 14.31 3.1±0.12ms 318 Elem/sec 219.2±7.24µs 4.5 KElem/sec
stdb_module/mem/update_bulk/u32_u64_u64/unique_0/load=2048/count=256 15.02 3.2±0.06ms 315 Elem/sec 210.9±2.46µs 4.6 KElem/sec

API and ABI breaking changes

If this is an API or ABI breaking change, please apply the
corresponding GitHub label.

Expected complexity level and risk

How complicated do you think these changes are? Grade on a scale from 1 to 5,
where 1 is a trivial change, and 5 is a deep-reaching and complex change.

This complexity rating applies not only to the complexity apparent in the diff,
but also to its interactions with existing and future code.

If you answered more than a 2, explain what is complex about the PR,
and what other components it interacts with in potentially concerning ways.

Testing

Describe any testing you've done, and any testing you'd like your reviewers to do,
so that you're confident that all the changes work as expected!

  • Write a test you've completed here.
  • Write a test you want a reviewer to do here, so they can check it off when they're satisfied.

Clickbait title, but there is no other way to say this. It must be the weirdest way I optimised anything yet.

I was suspicious that a lot of benchmarks, in particular those that deal with iteration - which we have for C# as of recently - take about the same amount of time. After digging in for a bit, I narrowed it down to merely presence of this destructor. Looks like the way destructors are implemented for the Wasm target somehow deoptimises the execution to the point where simply removing it speeds up benchmarks by 15x or more.

We don't really need it anyway, it was more of a nice-to-have in case user does something weird, since regular iterator usage like `foreach` always calls `Dispose()` correctly.

Further investigation and an upstream issue to .NET coming up, but meanwhile here are benchmark differences before/after:

| benchmark                                                                     | improvement | csharp_time | csharp_rate  | csharp_no_destructor_time | csharp_no_destructor_rate |
| ----------------------------------------------------------------------------- | ----------- | ----------- | ------------ | ------------------------- | ------------------------- |
| special/db_game/circles/load=10                                               | 14.58       | 3.3±0.24ms  | ? ?/sec      | 224.1±21.57µs             | ? ?/sec                   |
| special/db_game/circles/load=100                                              | 14.93       | 3.3±0.20ms  | ? ?/sec      | 223.4±16.61µs             | ? ?/sec                   |
| stdb_module/mem/filter/string/index/load=2048/count=256                       | 15.79       | 3.1±0.11ms  | 321 Elem/sec | 197.0±2.97µs              | 5.0 KElem/sec             |
| stdb_module/mem/filter/u64/index/load=2048/count=256                          | 19.15       | 3.2±0.11ms  | 315 Elem/sec | 165.3±2.86µs              | 5.9 KElem/sec             |
| stdb_module/mem/insert_bulk/u32_u64_u64/btree_each_column/load=2048/count=256 | 2.00        | 3.2±0.11ms  | 314 Elem/sec | 1589.3±45.07µs            | 629 Elem/sec              |
| stdb_module/mem/insert_bulk/u32_u64_u64/unique_0/load=2048/count=256          | 16.71       | 3.2±0.10ms  | 312 Elem/sec | 191.7±2.98µs              | 5.1 KElem/sec             |
| stdb_module/mem/iterate/u32_u64_str/unique_0/count=256                        | 31.72       | 3.1±0.11ms  | 325 Elem/sec | 97.0±2.47µs               | 10.1 KElem/sec            |
| stdb_module/mem/iterate/u32_u64_u64/unique_0/count=256                        | 34.63       | 3.4±0.11ms  | 296 Elem/sec | 97.5±2.40µs               | 10.0 KElem/sec            |
| stdb_module/mem/update_bulk/u32_u64_str/unique_0/load=2048/count=256          | 14.31       | 3.1±0.12ms  | 318 Elem/sec | 219.2±7.24µs              | 4.5 KElem/sec             |
| stdb_module/mem/update_bulk/u32_u64_u64/unique_0/load=2048/count=256          | 15.02       | 3.2±0.06ms  | 315 Elem/sec | 210.9±2.46µs              | 4.6 KElem/sec             |
Copy link
Contributor

@gefjon gefjon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you even find this? And to be clear, is it removing the finalizer, or adding the sealed which has this effect?

@RReverser
Copy link
Contributor Author

How did you even find this?

It started from digging into why iterators specifically are so slow, making some optimisations, being weirded out by not noticing any improvement, and eventually ended with a "scientific guess" that if it all takes roughly constant amount of time, GC must be involved somehow. And voila, indeed it was.

And to be clear, is it removing the finalizer, or adding the sealed which has this effect?

It's removing the finalizer.

sealed was added after the fact, because after removing the finalizer I also removed GC.SuppressFinalize(this); - which is no longer necessary - and then linter would warn about subclasses potentially adding their own finalizers. So I made the class sealed just to tell the linter it's okay and there will be no subclasses.

@RReverser RReverser added this pull request to the merge queue Oct 31, 2024
Merged via the queue into master with commit 8448484 Oct 31, 2024
11 checks passed
@RReverser RReverser deleted the ingvar/csharp-speedup branch October 31, 2024 16:58
mamcx pushed a commit that referenced this pull request Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants