Performance improvement in. Net 5

Time:2020-9-27

Performance improvement in. Net 5

In previous versions of. Net core, the significant performance improvements found in that release were actually described in the blog. From. Net core 2.0 to. Net core 2.1 to. Net core 3.0
Talking about more and more things. Interestingly, however, each time you want to know if there are enough meaningful improvements to ensure another article is published. . net 5 has implemented a number of performance improvements, although the final release is not planned until this fall, and there are likely to be more improvements by then, it’s important to highlight the improvements that are now available. In this article, we focus on about 250 PR, which have contributed greatly to the performance improvement of. Net 5 as a whole.

install

Benchmark.NET It is now a specification tool for measuring the performance of. Net code, which can easily analyze the throughput and allocation of code segments. As a result, most of the examples in this article are measured using microbenchmarks written with the tool. Firstly, a directory is created, and then it is extended by dotnet tool

mkdir Benchmarks
cd Benchmarks
dotnet new console

Generated Benchmarks.csproj The content of is expanded as follows:

Exe
    true
    true
    net5.0;netcoreapp3.1;net48

This allows you to benchmark against. Net framework 4.8,. Net core 3.1, and. Net 5 (the nightly build version is currently installed for preview 8). . csproj also references Benchmark.NET Nuget package (its latest version is version 12.1) to be able to use its features, then reference several other libraries and packages, especially to support the ability to run tests on. Net framework 4.8.
Then, the Program.cs The files are updated to the same folder as follows:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Diagnosers;
using BenchmarkDotNet.Running;
using System;
using System.Buffers.Text;
using System.Collections;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Collections.Immutable;
using System.IO;
using System.Linq;
using System.Net;
using System.Net.Http;
using System.Net.Security;
using System.Net.Sockets;
using System.Runtime.CompilerServices;
using System.Threading;
using System.Threading.Tasks;
using System.Text;
using System.Text.Json;
using System.Text.RegularExpressions;

[MemoryDiagnoser]
public class Program
{
    static void Main(string[] args) => BenchmarkSwitcher.FromAssemblies(new[] { typeof(Program).Assembly }).Run(args);

    // BENCHMARKS GO HERE
}

For each test, the benchmark code copy / paste shown in each example is displayed"// BENCHMARKS GO HERE"The location of.
To run the benchmark, do the following:

dotnet run -c Release -f net48 --runtimes net48 netcoreapp31 netcoreapp50 --filter ** --join

This tells us Benchmark.NET :

  • Use. Net framework 4.8 to benchmark.
  • Run benchmarks for. Net framework 4.8,. Net core 3.1, and. Net 5, respectively.
  • Include all benchmarks in the assembly (do not filter out any benchmarks).
  • The output of all benchmarks is combined and displayed at the end of the run (rather than throughout the process).

In some cases, a target specific API doesn’t exist; I just omitted this part of the command line.

Finally, please note the following:

  • From a runtime and core library point of view, it’s not much better than its predecessor, which was released a few months ago. However, some improvements have been made, in some cases. Net 5 improvements have now been ported back to. Net core 3.1, where these changes are considered to have sufficient impact to ensure that they can be added to the long term support (LTS) version. As a result, all the comparisons I make here are for the latest. Net core 3.1 service version (3.1.5), not for. Net core 3.0.
  • Since the comparison is about. Net 5 and. Net core 3.1, and. Net core 3.1 does not include the mono runtime, improvements to mono are not discussed, and there is no specific “blazer.”. So, when it comes to “runtime,” it refers to coreclr, which contains multiple runtimes even from. Net 5, and all of these have been improved.
  • Most of the examples run on windows because you also want to be able to compare with. Net framework 4.8. However, unless otherwise noted, all the examples shown are for windows, Linux, and MacOS.
  • It should be noted that all the measurement data here are carried out on the desktop of, and the measurement results may be different. Micro benchmarking is very sensitive to many factors, including the number of processors, processor architecture, memory and cache speed, and so on. In general, however, I focus on performance improvements and include examples that can generally tolerate such differences.

Let’s start

GC

For all those who are interested in. Net and performance, garbage collection is usually their top concern. A lot of effort has been spent on reducing allocations, not because the allocation behavior itself is particularly expensive, but because of the subsequent costs of cleaning up these allocations through a garbage collector (GC). However, no matter how much work is required to reduce the allocation, the vast majority of the workload will lead to this situation, so it is important to continuously improve the tasks and speed that GC can complete.

This version has done a lot of work in improving GC. For example, dotnet/corecl_implementsa form of work theft for GC’s “mark” phase. .net GC is a tracing collector, which means (at a very high level) when it runs, it starts from a set of “roots” (known intrinsic accessible locations, such as static fields), traverses one object to another, and makes each object “mark” accessible; After all of these traversal, any unmarked object is inaccessible and can be collected. This tag represents most of the time it takes to execute the collection, and this PR improves tag performance by better balancing the work performed by each thread involved in the collection. When running with server GC, each core has a thread participating in the collection, and when threads complete the tagging work assigned to them, they are now able to do unfinished work from other threads “steel” to help complete the entire collection faster.

Another example is that the decompression of the “ephemeral” section of dotnet / runtime is optimized (gen0 and GEN1 are called “ephemeral” because they are objects that are expected to last only a short time). Returns the memory page to the operating system after the last active object of the segment. So the problem with GC is, when should this resolution occur, and how much should it resolve at any time, because in the near future, it may need to allocate additional pages for additional allocations.

Or take dotnet / runtime, for example, which improves GC scalability on machines with high core counts by reducing lock contention involved in GC static scanning. Or dotnet / runtime, which avoids costly memory resets (essentially telling the operating system that the associated memory is no longer interested) unless the GC sees it in a low memory situation. Or dotnet / runtime, which (though not merged, is expected to be used for. Net5) builds on the work of @ damageboy to vectorize the sort used in GC. Or dotnet / coreclr, which reduces the time it takes for GC to suspend threads, which is necessary for it to obtain a stable view to accurately determine which threads are in use.

This is just a few of the changes that have been made to improve GC itself, but the last point brings me a particularly interesting topic because it involves a lot of work we have done in. Net in recent years. In this release, we have continued, and even accelerated, the migration of native implementations from the coreclr runtime from C / C + + to replace System.Private.Corelib Normal C ා managed code in. This has a number of benefits, including making it easier for us to share an implementation across multiple runtimes (such as coreclr and mono), and even easier for us to evolve API surface areas, such as reusing the same logic to handle arrays and spans. But to some people’s surprise, these benefits also include many aspects of performance. One way of doing this goes back to the original motivation for using the managed runtime: security. By default, the code written in C ා “is” safe “, because the runtime ensures that all memory access checks the boundary, and only through explicit operations visible in the code (such as using the unsafe keyword, Marshall class, unsafe class, etc.) can developers remove this validation. As a result, as a maintainer of an open source project, our work on the shipping security system greatly reduces the risk of security issues introduced by these bugs by contributing to managed code in the form of managed code: although such code can certainly contain errors, and may pass code review and automated testing, we can sleep better at night and know that these bugs introduce security issues. This, in turn, means that we are more likely to accept improvements in managed code, and faster, faster contributions from contributors, and faster validation we help with. We also found that when using C ා instead of C, more contributors are interested in exploring performance improvements, and more people are experimenting faster to achieve better performance.

However, we see more direct performance improvements from porting. The overhead required to call the runtime by managed code is relatively small, but if the call frequency is high, the overhead will increase. Consider dotnet / coreclr ා 27700, which moves the implementation of sorting original type array from coreclr’s local code to corelib’s C ා. In addition to this code, it also provides new public API support for sorting spans, and it also reduces the cost of sorting smaller arrays, because the cost of sorting comes mainly from the conversion from managed code. We can see this in a small benchmark that just uses arrays. Sort the array of int [], double [] and string [] containing 10 elements:

public class DoubleSorting : Sorting { protected override double GetNext() => _random.Next(); }
public class Int32Sorting : Sorting { protected override int GetNext() => _random.Next(); }
public class StringSorting : Sorting
{
    protected override string GetNext()
    {
        var dest = new char[_random.Next(1, 5)];
        for (int i = 0; i < dest.Length; i++) dest[i] = (char)('a' + _random.Next(26));
        return new string(dest);
    }
}

public abstract class Sorting
{
    protected Random _random;
    private T[] _orig, _array;

    [Params(10)]
    public int Size { get; set; }

    protected abstract T GetNext();

    [GlobalSetup]
    public void Setup()
    {
        _random = new Random(42);
        _orig = Enumerable.Range(0, Size).Select(_ => GetNext()).ToArray();
        _array = (T[])_orig.Clone();
        Array.Sort(_array);
    }

    [Benchmark]
    public void Random()
    {
        _orig.AsSpan().CopyTo(_array);
        Array.Sort(_array);
    }
}
Type Runtime Mean Ratio
DoubleSorting .NET FW 4.8 88.88 ns 1.00
DoubleSorting .NET Core 3.1 73.29 ns 0.83
DoubleSorting .NET 5.0 35.83 ns 0.40
Int32Sorting .NET FW 4.8 66.34 ns 1.00
Int32Sorting .NET Core 3.1 48.47 ns 0.73
Int32Sorting .NET 5.0 31.07 ns 0.47
StringSorting .NET FW 4.8 2,193.86 ns 1.00
StringSorting .NET Core 3.1 1,713.11 ns 0.78
StringSorting .NET 5.0 1,400.96 ns 0.64

This in itself is a good benefit of this migration, because we added it in. Net5 through dotnet / runtime System.Half , a new raw 16 bit floating-point, and in managed code, this sort implementation optimization is applied almost immediately, whereas previous local implementations require a lot of extra work because there is no half of the C + + standard type. But here’s a more significant performance advantage, which brings us back to where I started: GC.

One interesting indicator of GC is “pause time,” which actually means how long the GC must pause the runtime to perform its work. Longer pause times have a direct impact on latency, which is a key indicator of all workload patterns. As mentioned earlier, GC may need to pause threads in order to get a consistent world view and ensure that it can safely move objects, but if a thread is executing C / C + + code at run time, GC may need to wait for the thread to pause before the call completes. As a result, the more work we do in managed code rather than native code, the better the GC pause time. We can use the same array. For an example of sorting, take a look at this. Consider this procedure:

using System;
using System.Diagnostics;
using System.Threading;

class Program
{
    public static void Main()
    {
        new Thread(() =>
        {
            var a = new int[20];
            while (true) Array.Sort(a);
        }) { IsBackground = true }.Start();

        var sw = new Stopwatch();
        while (true)
        {
            sw.Restart();
            for (int i = 0; i < 10; i++)
            {
                GC.Collect();
                Thread.Sleep(15);
            }
            Console.WriteLine(sw.Elapsed.TotalSeconds);
        }
    }
}

This allows a thread to continuously sort a small array in a tight loop, while on the main thread, it performs GCS 10 times, with about 15 milliseconds between each GCs. We expect this cycle to take a little more than 150 milliseconds. But when I run on. Net core 3.1, I get this number of seconds

6.6419048
5.5663149
5.7430339
6.032052
7.8892468

In this case, it is difficult for GC to interrupt the thread that performs the sort, resulting in GC pauses much longer than expected. Fortunately, when I ran this on. Net5, I got this number:

0.159311
0.159453
0.1594669
0.1593328
0.1586566

This is exactly what we predicted. By moving the array. Sorting the implementations into managed code makes it easier for the runtime to suspend the implementation when needed, and we enable the GC to do its work better.

Of course, this is not limited to Array.Sort 。 For example, dotnet / runtime ᦇ 32722 moves stdelemref and ldelemaref JIT helper to C ᦇ dotnet / runtime 訙 32353 moves part of unbox helpers to C (and uses the appropriate GC polling position to detect the rest), and dotnet / coreclr ᦇ 27603 / dotnet / coreclr 񖓿 27634 / dotnet / coreclr 褳 27123/ Dotnet / coreclr moves more arrays, such as Array.Clear and Array.Copy To C, dotnet / coreclr moves more buffers to C, while dotnet / coreclr moves more buffers to C Enum.CompareTo Move to C. Some of these changes then enable subsequent gains, such as dotnet / runtime? 32342 and dotnet / runtime? 3, which utilize Buffer.Memmove To gain additional benefits in various string and array methods.

The last thought about this set of changes is that another interesting thing to note is how micro optimizations made in a version are based on assumptions that later proved to be invalid, and when used, they need to be prepared and willing to adapt. In my. Net core 3.0 blog, I mentioned “peer-to-peer” changes like dotnet / coreclr ා 21756, which changes many call sites that use arrays. Copy (source, destination, length) instead of using arrays. Copy (source, sourceoffset, destination, destinationoffset, length), because the cost of getting the lower bounds of source array and destination array is measurable. However, through the series of changes mentioned earlier to move array processing code to C ා, the overhead of simpler overloading disappears, making it a simpler and faster choice for these operations. In this way,. Net5 PRS dotnet / coreclr ා 27641 and dotnet / corefx ා 42343 switch all of these call stations, returning more to simpler overload. Dotnet / runtime is another example of canceling previous optimizations because changes make them obsolete or actually harmful. You can always pass a character to a string. To divide, as in version.Split (‘ . ‘)。 However, the problem is that the only overload bound to split is split (params char [] separator), which means that each such call causes the cාcompiler to generate a char [] allocation. To solve this problem, previous versions added caching, pre allocated arrays and stored them in static state, which can then be used by split calls to avoid using char [] for each call. Since. Net has a split (char separator, stringsplitoptions options = stringsplitoptions. None) overload, we no longer need arrays.
As a final example, I showed how moving code out of the runtime and into managed code can help GC pause, but there are certainly other ways to make the rest of the code in the runtime helpful. Dotnet / runtime reduces GC pauses due to exception handling by ensuring that the runtime is in code race mode, such as getting the “Watson” bucket parameters (basically a set of data that uniquely identifies this particular exception and the call stack for reporting purposes). Pause.

JIT

. net5 is also an exciting version of the just in time (JIT) compiler, which includes various improvements. As with any compiler, improvements to JIT can have a wide range of effects. In general, individual changes have little impact on individual snippets, but such changes are amplified by the number of places where they are applied.
The number of optimizations that can be added to the JIT is almost unlimited, and if you give the JIT unlimited time to run such optimizations, the JIT can create optimal code for any given scenario. But JIT time is not infinite. The “just in time” nature of JIT means that it performs compilation when the application is running: JIT needs to provide assembly code for methods that have not yet been compiled when they are called. This means that threads cannot move forward until the compilation is complete, which in turn means that the JIT needs to have a strategy for what optimizations to apply and how to choose to use a limited time budget. Various technologies are used to give JIT more time, such as using AOT to compile some parts of the application as much as possible before doing as much compilation work as possible (for example, AOT compilation core library uses a technology called “readytorun”, you may hear that “R2R” or even “crossgen” is the tool to generate these images), Or use “tiered compilation”, which allows the JIT to initially compile a method with less to less optimization applied, so it is very fast, and only when it is considered valuable (that is, when the method is reused), it will spend more time recompiling it with more optimization. More generally, however, developers who participate in JIT simply choose to use the allocated time budget for optimizations that prove valuable based on the code the developers write and the code patterns they use. This means that with the development of. Net and the acquisition of new functions, new language features and new library features, JIT will also develop with the optimization of newer code styles suitable for writing.
A good example is the dotnet / runtime of @ benaadams. Span penetrates all layers of the. Net stack, because it’s working on the runtime, the core library, ASP.NET Core developers and others realized the power of managed arrays, native allocated memory, and other forms of data when writing safe and efficient code (which also unifies string processing). Similarly, value types (structures) are increasingly being used as a way to avoid object allocation overhead through stack allocation. However, a heavy dependence on this type also brings more trouble to the runtime. Coreclr runtime uses “precise” garbage collector, which means that GC can accurately track which values refer to managed objects and which values do not refer to managed objects; There are benefits, but there are costs (on the contrary, the mono runtime uses the “conservative” garbage collector, which has some performance benefits, but also means that it can interpret any value on the stack that happens to be a real-time reference to the managed object’s address). One of the costs of this is that the JIT needs to help GC by ensuring that any local that can be interpreted as an object reference is cleared to zero before GC takes notice. Otherwise, the GC may eventually see a garbage value in a local that has not been set and assume that it refers to a valid object, and “bad things” may occur. The more local people are consulted, the more cleaning needs to be done. If you clean up only a few local people, that may not be noticed. But as the number increases, the time it takes to clear these local objects can add up, especially in a small method used in very hot code paths. This situation becomes more common in spans and structures, where encoding patterns usually result in more references requiring zero (span contains references). The PR mentioned earlier solves this problem by updating the code of the JIT generated sequence number blocks, which use the XMM register instead of the rep stosd instruction to perform the reset operation. Effectively, it vectorizes the return to zero. You can see this impact through the following benchmarks:

[Benchmark]
public int Zeroing()
{
    ReadOnlySpan s1 = "hello world";
    ReadOnlySpan s2 = Nop(s1);
    ReadOnlySpan s3 = Nop(s2);
    ReadOnlySpan s4 = Nop(s3);
    ReadOnlySpan s5 = Nop(s4);
    ReadOnlySpan s6 = Nop(s5);
    ReadOnlySpan s7 = Nop(s6);
    ReadOnlySpan s8 = Nop(s7);
    ReadOnlySpan s9 = Nop(s8);
    ReadOnlySpan s10 = Nop(s9);
    return s1.Length + s2.Length + s3.Length + s4.Length + s5.Length + s6.Length + s7.Length + s8.Length + s9.Length + s10.Length;
}

[MethodImpl(MethodImplOptions.NoInlining)]
private static ReadOnlySpan Nop(ReadOnlySpan span) => default;

On my machine, I get the following results:

Method Runtime Mean Ratio
Zeroing .NET FW 4.8 22.85 ns 1.00
Zeroing .NET Core 3.1 18.60 ns 0.81
Zeroing .NET 5.0 15.07 ns 0.66

Please note that this zero actually needs to be in more cases than I mentioned. In particular, by default, the C ා specification requires all local variables to be initialized to default values before executing developer code. You can see this in an example:

using System;
using System.Runtime.CompilerServices;
using System.Threading;

unsafe class Program
{
    static void Main()
    {
        while (true)
        {
            Example();
            Thread.Sleep(1);
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Example()
    {
        Guid g;
        Console.WriteLine(*&g);
    }
}

Run it, and you should only see the guids for all the output of 0. This is because the C ා compiler issues a. Locals init flag in the IL of the compiled sample method, and. Locals init tells the JIT that it needs to zero all local variables, not just those that contain references. However, in. Net 5, there is a new property in the runtime (dotnet / runtime Ţ):

namespace System.Runtime.CompilerServices
{
    [AttributeUsage(AttributeTargets.Module | AttributeTargets.Class | AttributeTargets.Struct | AttributeTargets.Constructor | AttributeTargets.Method | AttributeTargets.Property | AttributeTargets.Event | AttributeTargets.Interface, Inherited = false)]
    public sealed class SkipLocalsInitAttribute : Attribute { }
}

C ා the compiler recognizes this property, which is used to tell the compiler not to issue. Locals init under other circumstances. If we modify the previous example slightly, we can add attributes to the whole module:

using System;
using System.Runtime.CompilerServices;
using System.Threading;

[module: SkipLocalsInit]

unsafe class Program
{
    static void Main()
    {
        while (true)
        {
            Example();
            Thread.Sleep(1);
        }
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Example()
    {
        Guid g;
        Console.WriteLine(*&g);
    }
}

You should see different results now, especially if you see a non-zero GUI. In dotnet / runtime, the core libraries in. Net5 now use this property to disable. Locals init (in previous versions,. Locals init was removed by a post compile step when building the core library). Note that the C ා compiler only allows skipolocalsinit to be used in unsafe contexts, as it can easily lead to code corruption that has not been properly validated (so think twice if / when you apply it).

In addition to making zero faster, there are also changes to eliminate zero complete. For example, dotnet / runtime, dotnet / runtime, dotnet / runtime, and dotnet / runtime all help eliminate zero when JIT can prove that it is repetitive.
Such a zero is an example of managed code that the runtime needs to guarantee its model and the requirements of the above language. Another such tax is border checking. One of the biggest advantages of using managed code is that, by default, potential security vulnerabilities for the entire class become irrelevant. The runtime ensures that the indexes of arrays, strings, and span are checked, which means that the runtime injects checks to ensure that the requested index is within the range of the indexed data (that is, greether is greater than or equal to 0 and less than the length of the data). Here is a simple example:

public static char Get(string s, int i) => s[i];

To ensure the security of this code, the runtime needs to generate a check to see if I is in the range of string s. This is done by JIT through the following assembly:

; Program.Get(System.String, Int32)
       sub       rsp,28
       cmp       edx,[rcx+8]
       jae       short M01_L00
       movsxd    rax,edx
       movzx     eax,word ptr [rcx+rax*2+0C]
       add       rsp,28
       ret
M01_L00:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 28

This assembly is generated through a convenient feature of benchmark. Add [disassemblydiagnoser] to the class containing the benchmark, and it will spit out the decomposed assembly code. We can see that the assembly will store the string (through the RCX register) and the length of the loaded string (8 bytes stored in the object, therefore, [RCX + 8]), after comparison with me, EDX registers, if I am longer than or equal to an unsigned comparison (no sign, so any negative wrapping is greater than the length), I jump to an auxiliary coreinfo_ HELP_ Rngchkfail throws an exception. There are only a few instructions, but some types of code can cost a lot of circular indexing, so it’s helpful when JIT can eliminate as much unnecessary boundary checking as possible.
JIT has been able to remove boundary checks in various cases. For example, when you write a loop:

int[] arr = ...;
for (int i = 0; i < arr.Length; i++)
    Use(arr[i]);

JIT can prove that I never go beyond the bounds of an array, so it can omit the boundary checks it will generate. In. Net 5, it can remove boundary checks in more places. For example, consider this function, which writes the byte of an integer as a character to a span:

private static bool TryToHex(int value, Span span)
{
    if ((uint)span.Length <= 7)
        return false;

    ReadOnlySpan map = new byte[] { (byte)'0', (byte)'1', (byte)'2', (byte)'3', (byte)'4', (byte)'5', (byte)'6', (byte)'7', (byte)'8', (byte)'9', (byte)'A', (byte)'B', (byte)'C', (byte)'D', (byte)'E', (byte)'F' }; ;
    span[0] = (char)map[(value >> 28) & 0xF];
    span[1] = (char)map[(value >> 24) & 0xF];
    span[2] = (char)map[(value >> 20) & 0xF];
    span[3] = (char)map[(value >> 16) & 0xF];
    span[4] = (char)map[(value >> 12) & 0xF];
    span[5] = (char)map[(value >> 8) & 0xF];
    span[6] = (char)map[(value >> 4) & 0xF];
    span[7] = (char)map[value & 0xF];
    return true;
}

private char[] _buffer = new char[100];

[Benchmark]
public bool BoundsChecking() => TryToHex(int.MaxValue, _buffer);

First, in this example, it is worth noting that we rely on the optimization of the C # compiler. be careful:

ReadOnlySpan map = new byte[] { (byte)'0', (byte)'1', (byte)'2', (byte)'3', (byte)'4', (byte)'5', (byte)'6', (byte)'7', (byte)'8', (byte)'9', (byte)'A', (byte)'B', (byte)'C', (byte)'D', (byte)'E', (byte)'F' };

This looks very expensive, like we allocate an array of bytes every time we call trytohex. In fact, it’s not like this, it’s actually better than we do:

private static readonly byte[] s_map = new byte[] { (byte)'0', (byte)'1', (byte)'2', (byte)'3', (byte)'4', (byte)'5', (byte)'6', (byte)'7', (byte)'8', (byte)'9', (byte)'A', (byte)'B', (byte)'C', (byte)'D', (byte)'E', (byte)'F' };
...
ReadOnlySpan map = s_map;

The C ා compiler can recognize the pattern of new byte array directly allocated to readonlyspan (it can also recognize sbyte and bool, but due to byte relationship, there is no larger byte). Because the nature of arrays is completely hidden by span, the C ා compiler emits bytes by actually storing them in the data part of the assembly, while span is created by wrapping static data and length pointers

IL_000c: ldsflda valuetype ''/'__StaticArrayInitTypeSize=16' ''::'2125B2C332B1113AAE9BFC5E9F7E3B4C91D828CB942C2DF1EEB02502ECCAE9E9'
IL_0011: ldc.i4.s 16
IL_0013: newobj instance void valuetype [System.Runtime]System.ReadOnlySpan'1::.ctor(void*, int32)

This is important for this JIT discussion due to ldc.i4. S16 is on top. This is the length of the IL load 16 to create the span, as JIT can see. It knows that the length of the span is 16, which means that if it can prove that access is always greater than or equal to 0 and less than 16, it does not need to perform boundary checks on the access. Dotnet / runtime ා 1644 does this. It can recognize patterns like array [index% const] and omit boundary checking when const is less than or equal to length. In the previous trytohex example, JIT can see that the map span length is 16, and it can see all indexes to the completion & 0 XF, meaning that all values are ultimately in range, so it can eliminate all boundary check maps. Combined with the fact that you may have seen that no boundary check needs to be written into the span (because it can see that the previous length check method protects all indexes to the span), and that the entire method is in. Net bounds check free 5. On my machine, the results of this benchmark are as follows:

Method Runtime Mean Ratio Code Size
BoundsChecking .NET FW 4.8 14.466 ns 1.00 830 B
BoundsChecking .NET Core 3.1 4.264 ns 0.29 320 B
BoundsChecking .NET 5.0 3.641 ns 0.25 249 B

Note that.Net5 is not only 15 percent faster than.Net core 3.1, but we can see that it has 22 percent smaller assembly code size (the extra “code size” column comes from the addition of [disassemblydiagnoser] to my benchmark class).
Another good boundary check removes @ Nathan Moore from dotnet / runtime. As I mentioned, JIT has been able to remove the very common boundary check for patterns that iterate from 0 to array, string, or span length, but there are some more common changes based on this, which have not been recognized before. For example, consider this micro benchmark, which calls a method to detect whether an integer is sorted:

private int[] _array = Enumerable.Range(0, 1000).ToArray();

[Benchmark]
public bool IsSorted() => IsSorted(_array);

private static bool IsSorted(ReadOnlySpan span)
{
    for (int i = 0; i < span.Length - 1; i++)
        if (span[i] > span[i + 1])
            return false;

    return true;
}

This small change from the previously recognized pattern is sufficient to prevent JIT from ignoring boundary checking. Now it’s not. Net5 runs 20% faster on my machine:

Method Runtime Mean Ratio Code Size
IsSorted .NET FW 4.8 1,083.8 ns 1.00 236 B
IsSorted .NET Core 3.1 581.2 ns 0.54 136 B
IsSorted .NET 5.0 463.0 ns 0.43 105 B

Another case where JIT ensures that an error category is checked is null checking. The JIT does this in collaboration with the runtime, which ensures that there are appropriate instructions to raise hardware exceptions, and then, together with the runtime, converts these errors into. Net exceptions (here)). However, sometimes instructions are only used for null checking instead of other necessary functions, and unnecessary repeated instructions can be deleted as long as the required null checking is due to some instructions. Consider this Code:

private (int i, int j) _value;

[Benchmark]
public int NullCheck() => _value.j++;

As a runnable benchmark, it does too little to accurately measure. Net with benchmarks, but it’s a good way to see the generated assembly code. In. Net core 3.1, this method produces the following assembly:

; Program.NullCheck()
       nop       dword ptr [rax+rax]
       cmp       [rcx],ecx
       add       rcx,8
       add       rcx,4
       mov       eax,[rcx]
       lea       edx,[rax+1]
       mov       [rcx],edx
       ret
; Total bytes of code 23

CMP [RCX], the ECX instruction performs a null check when calculating the address of J, and the MOV eax, [RCX] instruction performs another null check as part of the dereferencing position of J. Therefore, the first null check is actually unnecessary because the instruction does not provide any other benefits. Therefore, thanks to PRS like dotnet / runtime ා 1735 and dotnet / runtime, such repetition is more recognized by JIT than before. For. Net 5, we now get:

; Program.NullCheck()
       add       rcx,0C
       mov       eax,[rcx]
       lea       edx,[rax+1]
       mov       [rcx],edx
       ret
; Total bytes of code 12

Covariance is another case where JIT requires injection checking to ensure that developers do not accidentally break type or memory security. Consider the code

class A { }
class B { }
object[] arr = ...;
arr[0] = new A();

Does this code work? It depends. Arrays in. Net are covariant, which means that I can pass an array derived type [] as basetype [] where the derived type derives from basetype. This means that in this case, arr can be constructed as a new a [1] or a new object [1] or a new B [1]. This code should work well in the first two, but if arr is actually a B [] and attempts to store an instance into it, it must fail; otherwise, code that uses arrays as B [] may try to use B [0] as B, and things may get worse soon. Therefore, the runtime needs to prevent this from happening through covariance checking, which actually means that when an instance of a reference type is stored in an array, the runtime needs to check that the assigned type is actually compatible with the specific type of the array. With dotnet / runtime ා 189, JIT can now eliminate more covariance checking, especially if the element type of the array is sealed, such as string. As a result, micro benchmarks like this now run faster:

private string[] _array = new string[1000];

[Benchmark]
public void CovariantChecking()
{
    string[] array = _array;
    for (int i = 0; i < array.Length; i++)
        array[i] = "default";
}
Method Runtime Mean Ratio Code Size
CovariantChecking .NET FW 4.8 2.121 us 1.00 57 B
CovariantChecking .NET Core 3.1 2.122 us 1.00 57 B
CovariantChecking .NET 5.0 1.666 us 0.79 52 B

Related to this is type checking. I mentioned span beforeIt solves many problems, but also introduces a new pattern, which promotes the improvement of other areas of the systemThe same is true of the implementation itself. SpanThe constructor does covariance check and requires that t [] is actually t [] instead of u [] where u originates from t, for example:

using System;

class Program
{
    static void Main() => new Span(new B[42]);
}

class A { }
class B : A { }

Will result in an exception:

System.ArrayTypeMismatchException: Attempted to access an element as a type incompatible with the array

The exception stems from a check on the constructor of span:

if (!typeof(T).IsValueType && array.GetType() != typeof(T[]))
    ThrowHelper.ThrowArrayTypeMismatchException();

This is how PR dotnet / runtime optimizes array. Gettype()! =Typeof (t()) checks when t is sealed, while dotnet / runtime ා 1157 recognizes the typeof (T). Isvaluetype pattern and replaces it with a constant value (PR dotnet / runtime ා 1195 does the same for typeof (T1). Isassignablefrom (typeof (T2)). The end result of this is a significant improvement in micro benchmarks, such as:

class A { }
sealed class B : A { }

private B[] _array = new B[42];

[Benchmark]
public int Ctor() => new Span(_array).Length;

My results are as follows:

Method Runtime Mean Ratio Code Size
Ctor .NET FW 4.8 48.8670 ns 1.00 66 B
Ctor .NET Core 3.1 7.6695 ns 0.16 66 B
Ctor .NET 5.0 0.4959 ns 0.01 17 B

When you look at the generated assembly, the explanation for the difference is obvious, even if you’re not completely proficient in the assembly code. The following is what [disassemblydiagnoser] generates on. Net core 3.1:

; Program.Ctor()
       push      rdi
       push      rsi
       sub       rsp,28
       mov       rsi,[rcx+8]
       test      rsi,rsi
       jne       short M00_L00
       xor       eax,eax
       jmp       short M00_L01
M00_L00:
       mov       rcx,rsi
       call      System.Object.GetType()
       mov       rdi,rax
       mov       rcx,7FFE4B2D18AA
       call      CORINFO_HELP_TYPEHANDLE_TO_RUNTIMETYPE
       cmp       rdi,rax
       jne       short M00_L02
       mov       eax,[rsi+8]
M00_L01:
       add       rsp,28
       pop       rsi
       pop       rdi
       ret
M00_L02:
       call      System.ThrowHelper.ThrowArrayTypeMismatchException()
       int       3
; Total bytes of code 66

Here are the contents of. Net5:

; Program.Ctor()
       mov       rax,[rcx+8]
       test      rax,rax
       jne       short M00_L00
       xor       eax,eax
       jmp       short M00_L01
M00_L00:
       mov       eax,[rax+8]
M00_L01:
       ret
; Total bytes of code 17

As another example, in the previous GC discussion, I mentioned some of the benefits of porting native run-time code into C ා code. One thing I didn’t mention before, but I’ll mention it now, is that it led to other improvements to the system that solved the key blocker of transplantation, but also improved a lot of other situations. A good example is dotnet / runtime. When we first moved our native array sorting implementation to managed, we inadvertently caused a regression of floating-point values, which was discovered by @ nietras and then fixed in dotnet / runtime. Regression is due to the local implementation using a special optimization, we lost the possibility of management port (floating-point array, the beginning of array of all Nan values, and subsequent comparison operations can ignore Nan). We succeeded. However, the problem is that this way of expression does not lead to a large number of code duplication: native implementation template, use and management implementation use generics, but restrictions and generics, etc. inline helpers are introduced to avoid a large number of code duplication, which method calls are used for non inlineable in each comparison. PR dotnet / runtime solves this problem by allowing JIT to embed shared generic code in the same type. Consider this micro benchmark:

private C c1 = new C() { Value = 1 }, c2 = new C() { Value = 2 }, c3 = new C() { Value = 3 };

[Benchmark]
public int Compare() => Comparer.Smallest(c1, c2, c3);

class Comparer where T : IComparable
{
    public static int Smallest(T t1, T t2, T t3) =>
        Compare(t1, t2) <= 0 ?
            (Compare(t1, t3) <= 0 ? 0 : 2) :
            (Compare(t2, t3) <= 0 ? 1 : 2);

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static int Compare(T t1, T t2) => t1.CompareTo(t2);
}

class C : IComparable
{
    public int Value;
    public int CompareTo(C other) => other is null ? 1 : Value.CompareTo(other.Value);
}

The smallest method compares the three values provided and returns the index of the minimum value. It is a method on a generic type that calls another method on the same type, which in turn calls a method on a generic type parameter instance. Because the benchmark uses C as a generic type, and C is a reference type, JIT does not specialize code for this method specifically for C, but uses its generated “shared” implementation for all reference types. To enable the Compare method to be invoked to the correct interface implementation of CompareTo, the shared generic implementation uses a dictionary that maps from generic type to the correct target. In earlier versions of. Net, it was not possible to include methods that looked up general dictionaries, which meant that the smallest method could not inline the three comparison calls it made, even if compare was classified as methodimplops. Aggressive inlining. The PR mentioned earlier removes this limitation and produces a very measurable speedup in this example (and makes array sort regression repair feasible)

Method Runtime Mean Ratio
Compare .NET FW 4.8 8.632 ns 1.00
Compare .NET Core 3.1 9.259 ns 1.07
Compare .NET 5.0 5.282 ns 0.61

Most of the improvements mentioned here focus on throughput, where JIT generated code performs faster, and faster code is usually (though not always) smaller. People working in JIT are actually very concerned about code size, which in many cases is used as a primary indicator of whether changes are beneficial. Smaller code is not always faster code (it can be instructions of the same size but with different overhead), but at a high level, it is a reasonable measure. Smaller code does have direct benefits, such as less impact on instruction cache, less code to load, etc. In some cases, changes focus entirely on reducing code size, such as in the case of unnecessary duplication. Consider this simple benchmark:

private int _offset = 0;

[Benchmark]
public int Throw helpers()
{
    var arr = new int[10];
    var s0 = new Span(arr, _offset, 1);
    var s1 = new Span(arr, _offset + 1, 1);
    var s2 = new Span(arr, _offset + 2, 1);
    var s3 = new Span(arr, _offset + 3, 1);
    var s4 = new Span(arr, _offset + 4, 1);
    var s5 = new Span(arr, _offset + 5, 1);
    return s0[0] + s1[0] + s2[0] + s3[0] + s4[0] + s5[0];
}

SpanWhen the constructor parameter validation,, t is a value type, the result is two methods on the website throwerhelper class. One is an input array thrown when a failed null check is thrown, and a range of offsets and counts (for example, throwargumentnullexception, throwerhelper contains the non inlinable method, which contains the actual throw, avoiding the relevant code size in each calling website; JIT is not currently “Outgoing” (as opposed to “inlining”), so it needs to be done manually in important cases). In the example above, we created six spans, which means six calls to the span constructor, all of which will be inlined. The JIT array is empty, so it can eliminate zero checking and throwargumentnullexception inline code, but it does not know whether the offset and calculation range are within, so it needs to keep throwerhelper scope checking and calling the site. Throwargumentoutofrangeexception method. In. Net core 3.1, the throw helpers method generates the following code:

M00_L00:
       call      System.ThrowHelper.ThrowArgumentOutOfRangeException()
       int       3
M00_L01:
       call      System.ThrowHelper.ThrowArgumentOutOfRangeException()
       int       3
M00_L02:
       call      System.ThrowHelper.ThrowArgumentOutOfRangeException()
       int       3
M00_L03:
       call      System.ThrowHelper.ThrowArgumentOutOfRangeException()
       int       3
M00_L04:
       call      System.ThrowHelper.ThrowArgumentOutOfRangeException()
       int       3
M00_L05:
       call      System.ThrowHelper.ThrowArgumentOutOfRangeException()
       int       3

In. Net 5, thanks to dotnet / coreclr, JIT is able to identify this duplication, rather than all six call stations. It will eventually merge into one:

M00_L00:
       call      System.ThrowHelper.ThrowArgumentOutOfRangeException()
       int       3

All failed checks jump to this shared location, instead of each having its own copy

Method Runtime Code Size
Throw helpers .NET FW 4.8 424 B
Throw helpers .NET Core 3.1 252 B
Throw helpers .NET 5.0 222 B

These are just some of the many improvements made to JIT in. Net 5. There are many other improvements. Dotnet / runtime? 32368 causes JIT to treat the length of an array as unsigned, which enables JIT to use better instructions for some mathematical operations (such as division) performed on length. Dotnet / runtime? 25458 enables JIT to use faster 0-based comparisons for some unsigned integer operations. When the developer actually writes > = 1, use equal to! =Value of 0. Dotnet / runtime? 1378 allows JIT to recognize “constantstring”. Length as a constant value. Dotnet / runtime reduces the size of the readytorun image by removing the NOP padding. Dotnet / runtime? 330234 uses addition rather than multiplication to optimize instructions generated when x * 2 is executed when x is a floating-point or double precision number. Dotnet / runtime improves the performance of Math.FusedMultiplyAdd Code generated by the inner function. Dotnet / runtime? Makes volatile operations on arm64 cheaper by using better fence instructions than before, and dotnet / runtime? Performs peephole optimization on arm64 to remove a large number of redundant mov instructions. wait.

There are also some important changes in JIT that are disabled by default to get real feedback about them and to be able to post enable them by default. Net 5. For example, dotnet / runtime_providesan initial implementation of “on stack replacement” (OSR). I mentioned layered compilation earlier, which enables the JIT to first generate the least optimized code for a method, and then recompile the method with more optimizations when the method proves to be important. This allows code to run faster and upgrade effective methods only at run time, resulting in faster startup times. However, layered compilation depends on the ability to replace the implementation, and the new implementation is called the next time it is called. But what about long-running methods? Layered compilation is disabled by default for methods that contain loops (or, more specifically, backward branching), because they can run for a long time and cannot be replaced in a timely manner. OSR allows methods to be updated when executing code, but they are “on the stack”; PR contains many details in the design documents (also related to hierarchical compilation, dotnet / runtimeා1457 improves the call counting mechanism, which determines which methods should be recompiled and when to recompile). You can add complus to the_ TC_ Quickjitforloops and complus_ TC_ The onstack replacement environment variable is set to 1 to test the OSR. Another example is that dotnet / runtime improves the quality of generated code within try blocks, enabling JIT to save values in registers that were not previously saved. You can add complus to the_ Set the enableehwritethr environment variable to 1 to test.

There is also a bunch of waiting pull requests that have not been merged, but are likely to be released in. Net 5 (in addition, I expect more content that hasn’t been published before. Net 5). For example, dotnet / runtime allows JIT to replace some branch comparisons, such as a = = 42? 3:2 no branch implementation, which can help improve performance when the hardware cannot correctly predict which branch to adopt. Or dotnet / runtime, which allows JIT to adopt a pattern like “hello” [0] and replace it with H; Although developers usually don’t write such code, it can help when it comes to inlining by passing a constant string to an inline method and indexing it to a constant position (usually after a length check, length checking can also be a constant due to dotnet / runtime ා 1378). Or dotnet / runtime, which improves the code generation of bmi2. Multiplynoflags is intrinsic. Or dotnet / runtime, which will switch bit operations. Converting popcount to an internal cause enables the JIT to recognize when to call it with a constant parameter and replace the entire operation with a precomputed constant. Or dotnet / runtime, which removes null checks issued when const strings are used. Or dotnet / runtime ා 32000 from @ damageboy, which optimizes double negation.

Intrinsics

In. Net core 3.0, more than 1000 new hardware built-in methods have been added and recognized by JIT, so that C ᦇ code can directly target instruction sets, such as SSE4 and avx2 (DOC). These tools are then used in a set of APIs in the core library. However, intrinsic is limited to x86 / x64 architectures. In. Net 5, we put a lot of effort into adding thousands of components, especially for arm64, thanks to many contributors, especially @ Tamar christinaarm from arm holdings. Like the corresponding x86 / x64, these inclusions are well utilized in the core library function. For example, BitOperations.PopCount The () method was previously optimized to use x86 popcnt intrinsic. For. Net 5, dotnet / runtime enhances it to use arm VCNT or the equivalent arm64 CNT. Similarly, dotnet / runtime_modifiesbit operations. Leadingzerocount, trailingzerocount and log2 use the corresponding instrincs. At a higher level, dotnet / runtime_from @ gnbrkm41 enhances multiple methods in the bit set to use arm64 inclusions to match the previously added support for SSE2 and avx2. In order to ensure that the vector API can be executed well on arm64, we have done a lot of work, such as dotnet / runtime and dotnet / runtime.

In addition to arm64, other work has been done to quantify more operations. For example, @ gnbrkm41 also submitted dotnet / runtime, which was improved to new by using roundps / roundpd on x64 and frontm / frintm on arm64 Vector.Ceiling and Vector.Floor Method. Bitoperations, which is a relatively low-level type, is implemented in the form of a 1:1 wrapper of the most appropriate hardware internal functions for most operations. It is not only improved in dotnet / runtime of @ saucecontrol, but also improved and more efficient in corelib.

Finally, JIT has made a lot of modifications to deal with hardware internal characteristics and vectorization, such as dotnet / runtime, dotnet / runtime, dotnet / runtime, dotnet / runtime, dotnet / runtime, dotnet / runtime and dotnet / runtime.

Runtime helpers

GC and JIT represent most of the runtime, but there are still quite a few functions outside of these components in the runtime, and these functions have similar improvements.
Interestingly, JIT doesn’t generate code for everything from scratch. JIT calls the pre-existing helpers function in many places, and the runtime provides these helpers. The improvement of these helpers can have a significant impact on the program. Dotnet / runtime_isa good example. In a library like system. LINQ, we avoid adding extra type checking for covariant interfaces because they are much more expensive than normal interfaces. In essence, dotnet / runtime_ (which was later adjusted in dotnet / runtime) adds a cache so that the costs of these data conversions are shared equally and ultimately faster overall. This can be seen from a simple micro benchmark

private List _list = new List();

// IReadOnlyCollection is covariant
[Benchmark] public bool IsIReadOnlyCollection() => IsIReadOnlyCollection(_list);
[MethodImpl(MethodImplOptions.NoInlining)]  private static bool IsIReadOnlyCollection(object o) => o is IReadOnlyCollection;
Method Runtime Mean Ratio Code Size
IsIReadOnlyCollection .NET FW 4.8 105.460 ns 1.00 53 B
IsIReadOnlyCollection .NET Core 3.1 56.252 ns 0.53 59 B
IsIReadOnlyCollection .NET 5.0 3.383 ns 0.03 45 B

Another set of influential changes occurs in dotnet / runtime (JIT is supported in dotnet / runtime 訝). In the past, a generic method maintained only a few dedicated dictionary slots that could be used to quickly find types associated with generic methods; once these slots were used up, it would return to a slower lookup table. This restriction no longer exists, and these changes make the quick find slot available for all common lookups.

[Benchmark]
public void GenericDictionaries()
{
    for (int i = 0; i < 14; i++)
        GenericMethod(i);
}

[MethodImpl(MethodImplOptions.NoInlining)]
private static object GenericMethod(int level)
{
    switch (level)
    {
        case 0: return typeof(T);
        case 1: return typeof(List);
        case 2: return typeof(List>);
        case 3: return typeof(List>>);
        case 4: return typeof(List>>>);
        case 5: return typeof(List>>>>);
        case 6: return typeof(List>>>>>);
        case 7: return typeof(List>>>>>>);
        case 8: return typeof(List>>>>>>>);
        case 9: return typeof(List>>>>>>>>);
        case 10: return typeof(List>>>>>>>>>);
        case 11: return typeof(List>>>>>>>>>>);
        case 12: return typeof(List>>>>>>>>>>>);
        default: return typeof(List>>>>>>>>>>>>);
    }
}
Method Runtime Mean Ratio
GenericDictionaries .NET FW 4.8 104.33 ns 1.00
GenericDictionaries .NET Core 3.1 76.71 ns 0.74
GenericDictionaries .NET 5.0 51.53 ns 0.49

Text Processing

Text based processing is the foundation of many applications, and a lot of effort has been devoted to improving the base building blocks on which everything else is built. These changes extend from the micro optimization of helpers to the overhaul of the whole text processing library.
System. Char has made some good improvements in net 5. For example, dotnet / coreclr improves the performance of char. Adjust the implementation to require fewer instructions and fewer branches. Improve char. Iswhitespace then appears in a series of other methods that depend on it, such as string.IsEmptyOrWhiteSpace And adjustment:

[Benchmark]
public int Trim() => " test ".AsSpan().Trim().Length;
Method Runtime Mean Ratio Code Size
Trim .NET FW 4.8 21.694 ns 1.00 569 B
Trim .NET Core 3.1 8.079 ns 0.37 377 B
Trim .NET 5.0 6.556 ns 0.30 365 B

Another good example, dotnet / runtime, improves the performance of char. Toupper invariant and char. By improving the inlining of various methods, the call path is simplified from the public API to the core function, and the implementation is further adjusted to ensure that the JIT generates the best code, thus realizing the owerinvariant.

[Benchmark]
[Arguments("It's exciting to see great performance!")]
public int ToUpperInvariant(string s)
{
    int sum = 0;

    for (int i = 0; i < s.Length; i++)
        sum += char.ToUpperInvariant(s[i]);

    return sum;
}
Method Runtime Mean Ratio Code Size
ToUpperInvariant .NET FW 4.8 208.34 ns 1.00 171 B
ToUpperInvariant .NET Core 3.1 166.10 ns 0.80 164 B
ToUpperInvariant .NET 5.0 69.15 ns 0.33 105 B

With the exception of a single character, in virtually every version of the. Net core, we’re trying to speed up the existing formatting API. This release is no different. Although the previous version has been a great success, this one will further raise the threshold.
Int32.ToString()Is a very common operation, it is important that it is fast. Dotnet / runtime from @ ts2do makes it faster by adding non linkable fast paths to the key formatting routines used by the method, and by simplifying the paths to these routines by various public APIs. Other original toString operations have also been improved. For example, dotnet / runtime simplifies some code paths to reduce redundancy from the common API to where bits are actually written to memory.

[Benchmark] public string ToString12345() => 12345.ToString();
[Benchmark] public string ToString123() => ((byte)123).ToString();

Method Runtime Mean Ratio Allocated
ToString12345 .NET FW 4.8 45.737 ns 1.00 40 B
ToString12345 .NET Core 3.1 20.006 ns 0.44 32 B
ToString12345 .NET 5.0 10.742 ns 0.23 32 B
ToString123 .NET FW 4.8 42.791 ns 1.00 32 B
ToString123 .NET Core 3.1 18.014 ns 0.42 32 B
ToString123 .NET 5.0 7.801 ns 0.18 32 B

Similarly, in previous versions, we did a lot of optimizations for datetime and DateTimeOffset, but these improvements mainly focused on the conversion speed of day / month / year / etc. Converts the data to the correct character or byte and writes it to the destination. In dotnet / runtime ා 1944, @ ts2do focuses on the previous steps, optimizing the extraction day / month / year / and so on. Datetime {offset} is stored from the original tick count. It turns out to be very fruitful, resulting in the ability to output formats such as “O” (“round trip date / time pattern”) 30% faster than before (changes also apply the same decomposition optimization in other places where the code base of these components needs to be from a datetime, but the improvement is most easily displayed in a standard format)

private byte[] _bytes = new byte[100];
private char[] _chars = new char[100];
private DateTime _dt = DateTime.Now;

[Benchmark] public bool FormatChars() => _dt.TryFormat(_chars, out _, "o");
[Benchmark] public bool FormatBytes() => Utf8Formatter.TryFormat(_dt, _bytes, out _, 'O');
Method Runtime Mean Ratio
FormatChars .NET Core 3.1 242.4 ns 1.00
FormatChars .NET 5.0 176.4 ns 0.73
FormatBytes .NET Core 3.1 235.6 ns 1.00
FormatBytes .NET 5.0 176.1 ns 0.75

There are also many improvements on string operations, such as dotnet / coreclr and dotnet / coreclr, which significantly improve the performance of start and end operations on region aware Linux in some cases.
Of course, low-level processing is good, but today’s applications spend a lot of time performing high-level operations, such as encoding data in a specific format. For example, the previous version of. Net core is for Encoding.UTF8 But there are still further improvements in. Net 5. Dotnet / runtime optimizes it, especially for smaller inputs, to make better use of stack allocation and to improve JIT devascularization.

[Benchmark]
public string Roundtrip()
{
    byte[] bytes = Encoding.UTF8.GetBytes("this is a test");
    return Encoding.UTF8.GetString(bytes);
}
Method Runtime Mean Ratio Allocated
Roundtrip .NET FW 4.8 113.69 ns 1.00 96 B
Roundtrip .NET Core 3.1 49.76 ns 0.44 96 B
Roundtrip .NET 5.0 36.70 ns 0.32 96 B

Just as important as utf8 is the “iso-8859-1” coding, also known as “Latin1” (now publicly referred to as encoding). Encoding.Latin1 Through dotnet / runtime), it is also very important, especially for network protocols such as HTTP. Dotnet / runtime vectorizes its implementation, which is largely based on the previous Encoding.ASCII Similar optimization was carried out. This results in a very good performance improvement, which can significantly affect high-level usage in clients such as httpclient and servers such as kestrel.

private static readonly Encoding s_latin1 = Encoding.GetEncoding("iso-8859-1");

[Benchmark]
public string Roundtrip()
{
    byte[] bytes = s_latin1.GetBytes("this is a test. this is only a test. did it work?");
    return s_latin1.GetString(bytes);
}
Method Runtime Mean Allocated
Roundtrip .NET FW 4.8 221.85 ns 209 B
Roundtrip .NET Core 3.1 193.20 ns 200 B
Roundtrip .NET 5.0 41.76 ns 200 B

Coding performance improvements have also been extended to System.Text.Encodings Encoder in. PRS dotnet / corefx? 42073 and dotnet / runtime? 284 from @ gfoidl improve various textencoder types. This includes vectoring findfirstcharactertoencodeutf8 using ssse3 instructions and findfirstchartencode in JavaScript encoder. Default implementation.

private char[] _dest = new char[1000];

[Benchmark]
public void Encode() => JavaScriptEncoder.Default.Encode("This is a test to see how fast we can encode something that does not actually need encoding", _dest, out _, out _);

Regular Expressions

A very special but very common form of parsing is through regular expressions. As early as early April, I shared an article about. Net 5 System.Text.RegularExpressions Lots of detailed blog posts on performance improvements. I’m not going to repeat all of this here, but if you haven’t, I encourage you to read it, because it represents a major advance in the library. However, I also pointed out in that article that we will continue to improve regular expressions, especially by adding more support for special but common situations.
One improvement is line wrapping when regexoptions are specified. Multiline, which changes the meaning of ^ and $anchors to match at the beginning and end of any line, not just the beginning and end of the entire input string. We didn’t do anything special with the start line anchor (when multiline is specified ^), which means that as part of the findfirstchar operation (see the blog post mentioned earlier for what it means), we won’t skip it as much as possible. Dotnet / runtime teaches findfirstchar how to use vectorized indexes to jump forward to the next relevant location. This impact is highlighted in this benchmark, which deals with the text “Romeo and Juliet” downloaded from Project Gutenberg:

private readonly string _input = new HttpClient().GetStringAsync("http://www.gutenberg.org/cache/epub/1112/pg1112.txt").Result;
private Regex _regex;

[Params(false, true)]
public bool Compiled { get; set; }

[GlobalSetup]
public void Setup() => _regex = new Regex(@"^.*\blove\b.*$", RegexOptions.Multiline | (Compiled ? RegexOptions.Compiled : RegexOptions.None));

[Benchmark]
public int Count() => _regex.Matches(_input).Count;
Method Runtime Compiled Mean Ratio
Count .NET FW 4.8 False 26.207 ms 1.00
Count .NET Core 3.1 False 21.106 ms 0.80
Count .NET 5.0 False 4.065 ms 0.16
Count .NET FW 4.8 True 16.944 ms 1.00
Count .NET Core 3.1 True 15.287 ms 0.90
Count .NET 5.0 True 2.172 ms 0.13

Another improvement is in processing RegexOptions.IgnoreCase aspect. Implementation and use of ignorecase char.ToLower {invariant} to get the relevant characters to compare, but due to the culture specific mapping, there is some overhead. Dotnet / runtime allows you to avoid these overhead when the only character that is likely to be lowercase with the character being compared is the character itself.

private readonly Regex _regex = new Regex("hello.*world", RegexOptions.Compiled | RegexOptions.IgnoreCase);
private readonly string _input = "abcdHELLO" + new string('a', 128) + "WORLD123";

[Benchmark] public bool IsMatch() => _regex.IsMatch(_input);
Method Runtime Mean Ratio
IsMatch .NET FW 4.8 2,558.1 ns 1.00
IsMatch .NET Core 3.1 789.3 ns 0.31
IsMatch .NET 5.0 129.0 ns 0.05

A related improvement is dotnet / runtime 覃 which also serves regex options. Ignorecase reduces the number of virtual calls to cultureinfo implemented. Cache textinfo instead of cultureinfo from it.

private readonly Regex _regex = new Regex("Hello, \w+.", RegexOptions.Compiled | RegexOptions.IgnoreCase);
private readonly string _input = "This is a test to see how well this does.  Hello, world.";

[Benchmark] public bool IsMatch() => _regex.IsMatch(_input);
Method Runtime Mean Ratio
IsMatch .NET FW 4.8 712.9 ns 1.00
IsMatch .NET Core 3.1 343.5 ns 0.48
IsMatch .NET 5.0 100.9 ns 0.14

One of my favorite optimizations recently is dotnet / runtime (which was further enhanced in dotnet / runtime). Regex recognizes changes from an atomic ring (an explicit written or more common atomic upgrade to an automatic analytical expression), and we can update the next starting position in the scan loop (again, see the blog) based on the end of the loop, not the beginning. For many inputs, this can greatly reduce overhead. Using benchmarks and from https://github.com/mariomka/regex Data of benchmark:

private Regex _email = new Regex(@"[\w\.+-][email protected][\w\.-]+\.[\w\.-]+", RegexOptions.Compiled);
private Regex _uri = new Regex(@"[\w]+://[^/\s?#]+[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?", RegexOptions.Compiled);
private Regex _ip = new Regex(@"(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])", RegexOptions.Compiled);

private string _input = new HttpClient().GetStringAsync("https://raw.githubusercontent.com/mariomka/regex-benchmark/652d55810691ad88e1c2292a2646d301d3928903/input-text.txt").Result;

[Benchmark] public int Email() => _email.Matches(_input).Count;
[Benchmark] public int Uri() => _uri.Matches(_input).Count;
[Benchmark] public int IP() => _ip.Matches(_input).Count;
Method Runtime Mean Ratio
Email .NET FW 4.8 1,036.729 ms 1.00
Email .NET Core 3.1 930.238 ms 0.90
Email .NET 5.0 50.911 ms 0.05
Uri .NET FW 4.8 870.114 ms 1.00
Uri .NET Core 3.1 759.079 ms 0.87
Uri .NET 5.0 50.022 ms 0.06
IP .NET FW 4.8 75.718 ms 1.00
IP .NET Core 3.1 61.818 ms 0.82
IP .NET 5.0 6.837 ms 0.09

Finally, not all the focus is on the original throughput of the actual execution of regular expressions. One of the ways developers get the best throughput with regex is to specify regexoptions. Compile, which uses reflection emission to generate IL at run time, which in turn requires JIT compilation. Depending on the expression used, regex may output a large number of IL, and then a lot of JIT processing is required to generate assembly code. Dotnet/runtime_improvesjit itself to help solve this situation, and fixes some possible secondary execution time code paths triggered by IL generated by regex. Dotnet/runtime_adjustedthe IL operation used by regex engine, which makes the mode used more close to the mode issued by C ා compiler, which is very important because JIT has made more optimization of these modes. On some real workloads with hundreds of complex regular expressions, combining them can reduce the time spent on JIT expressions by more than 20 percent.

Threading and Async

One of the biggest changes to asynchrony in net 5 is actually not enabled by default, but this is another experiment to get feedback. Net 5, but in essence, dotnet / coreclr introduces the ability of asynchronous valuetask and asynchronous valuetaskImplicitly created cache and reuse objects represent the completion of an asynchronous operation, resulting in the overhead of these methods’ Automated allocation free ‘. Optimization is currently optional, which means that you need to add dotnet_ SYSTEM_ THREADING_ The poolasyncvaluetasks environment variable is set to 1 to enable it. One of the difficulties in enabling this feature is for code that may have to perform more complex operations than waiting for somevaluetaskreturningmethod(), because valuetasks have more constraints on how to use them than tasks. To help solve this problem, a new use value tasks correctly analyzer has been released that will flag most of this misuse.

[Benchmark]
public async Task ValueTaskCost()
{
    for (int i = 0; i < 1_000; i++)
        await YieldOnce();
}

private static async ValueTask YieldOnce() => await Task.Yield();
Method Runtime Mean Ratio Allocated
ValueTaskCost .NET FW 4.8 1,635.6 us 1.00 294010 B
ValueTaskCost .NET Core 3.1 842.7 us 0.51 120184 B
ValueTaskCost .NET 5.0 812.3 us 0.50 186 B

Some changes in the C ා compiler bring additional benefits to asynchronous methods in. Net 5 (the core libraries in. Net5 are compiled with updated compilers). Each asynchronous method has a “generator” responsible for generating and completing the return task, while the C ා compiler uses the generated code as part of the asynchronous method. Avoiding making structure copies as part of the code can help reduce overhead, especially for async valuetask methods, where builders are relatively large (and grow as t grows). Similarly, dotnet / roslyn_from @ benaadams also adjusted the same generated code to better exploit the zero improvement of JIT discussed earlier.
There are also some improvements in specific APIs. Dotnet / runtime was born in the use of specific tasks Task.ContinueWith Where continuation is used purely to record exceptions in the “go ahead” task continue from. In general, tasks don’t go wrong, and PR does better in this case.

const int Iters = 1_000_000;

private AsyncTaskMethodBuilder[] tasks = new AsyncTaskMethodBuilder[Iters];

[IterationSetup]
public void Setup()
{
    Array.Clear(tasks, 0, tasks.Length);
    for (int i = 0; i < tasks.Length; i++)
        _ = tasks[i].Task;
}

[Benchmark(OperationsPerInvoke = Iters)]
public void Cancel()
{
    for (int i = 0; i < tasks.Length; i++)
    {
        tasks[i].Task.ContinueWith(_ => { }, CancellationToken.None, TaskContinuationOptions.OnlyOnFaulted | TaskContinuationOptions.ExecuteSynchronously, TaskScheduler.Default);
        tasks[i].SetResult();
    }
}
Method Runtime Mean Ratio Allocated
Cancel .NET FW 4.8 239.2 ns 1.00 193 B
Cancel .NET Core 3.1 140.3 ns 0.59 192 B
Cancel .NET 5.0 106.4 ns 0.44 112 B

There are also some adjustments to help specific architectures. Due to the strong memory model adopted by x86 / x64 architecture, volatile almost disappears in JIT when it is targeted at x86 / x64. This is not the case with arm / arm64. Its memory model is weak, and volatile will cause JIT to issue fences. Dotnet / runtime removes several volatile access for each work item queued to the thread pool, making the thread pool on ARM faster. Dotnet / runtime throws volatile access in concurrent dictionary from a loop, which in turn improves the throughput of some members of concurrent dictionary on arm by up to 30%. Dotnet / runtime completely removes volatile from another concurrent dictionary field.

Collections

Over the years, C ා has gained a lot of valuable features. Many of these features are designed to allow developers to write code more concisely, while the language / compiler is responsible for all template files, such as the records in C ා 9. However, there are features that focus more on performance than productivity, which are a huge boon to the core library and can be used frequently to improve the efficiency of everyone’s program. Dotnet / runtime from @ benaadams is a good example. PR improves dictionaryUsing ref return and ref local variable introduced in C ා 7. >The implementation of is supported by the array entries in the dictionary. The dictionary has a core routine to find the index of the key in its entry array; then, the routine is used in multiple functions, such as indexer, TryGetValue, containskey, etc. However, this sharing comes at a cost: by returning the index and leaving it to the caller to retrieve data from the slot as needed, the caller will need to re index into the array, resulting in a second boundary check. With ref return, the shared routine can pass a ref back to the slot instead of the original index, so that the caller can avoid a second boundary check and copy the entire entry. PR also includes some low-level tuning of the generated assembly, reorganizing fields, and operations to update these fields so that JIT can better tune the generated assembly.
DictionariesThe performance of PRS is further improved by several PRS. Like many hash tables, dictionaryIt is divided into “buckets”, and each bucket is essentially a linked list of items (stored in an array, rather than having a separate node object for each item). For a given key, a hash function (tkey’s GetHashCode or IComparer’s GetHashCode) is used to calculate the hash code of the provided key, and then the hash code is mapped to a bucket. After finding the bucket, the implementation will traverse the entry chain in the bucket and find the target key. The implementation attempts to keep the number of entries in each bucket small and grow and rebalance as necessary to maintain the condition. Therefore, a large part of the cost of lookup is to calculate the mapping of hashcode to bucket. To help maintain a good distribution between buckets, especially when the provided tkey or comparator uses less than ideal hash code generators, the dictionary uses prime buckets, and the bucket mapping is done by hashcode% numbuckets. But the important speed here is that the division used by the% operator is relatively expensive. Based on Daniel Lemire’s work, dotnet / coreclr (from @ benaadams) and dotnet / runtime Ť 406 changed the use of% in 64 bit processes instead of using a pair of multiplication and shifting to achieve the same result, but faster.

private Dictionary _dictionary = Enumerable.Range(0, 10_000).ToDictionary(i => i);

[Benchmark]
public int Sum()
{
    Dictionary dictionary = _dictionary;
    int sum = 0;

    for (int i = 0; i < 10_000; i++)
        if (dictionary.TryGetValue(i, out int value))
            sum += value;

    return sum;
}
Method Runtime Mean Ratio
Sum .NET FW 4.8 77.45 us 1.00
Sum .NET Core 3.1 67.35 us 0.87
Sum .NET 5.0 44.10 us 0.57

HashSet is very similar to dictionary。 Although it exposes a different set of operations (no pun), its data structure is basically the same except that it stores only one key instead of a key and a value Or at least it used to be the same. For many years, considering the use of dictionaryHow much more than HashSet, we spent more efforts to optimize the dictionaryThe two implementations have drifted. Dotnet / corefx ා 40106 @ Jeffrey Zhao transplanted some improved dictionary hash sets, and then dotnet / runtime ා 37180 effectively rewrites HashSetThe implementation of the re syncing Dictionary (along with low stack moves, some of the dictionaries are used for a set that is properly replaced). The end result is that HashSet ends up with similar benefits (or even more, because it started in a worse place).

private HashSet _set = Enumerable.Range(0, 10_000).ToHashSet();

[Benchmark]
public int Sum()
{
    HashSet set = _set;
    int sum = 0;

    for (int i = 0; i < 10_000; i++)
        if (set.Contains(i))
            sum += i;

    return sum;
}
Method Runtime Mean Ratio
Sum .NET FW 4.8 76.29 us 1.00
Sum .NET Core 3.1 79.23 us 1.04
Sum .NET 5.0 42.63 us 0.56

Similarly, dotnet / runtime_portedsimilar improvements from dictionaryTo concurrent dictionary。

private ConcurrentDictionary _dictionary = new ConcurrentDictionary(Enumerable.Range(0, 10_000).Select(i => new KeyValuePair(i, i)));

[Benchmark]
public int Sum()
{
    ConcurrentDictionary dictionary = _dictionary;
    int sum = 0;

    for (int i = 0; i < 10_000; i++)
        if (dictionary.TryGetValue(i, out int value))
            sum += value;

    return sum;
}
Method Runtime Mean Ratio
Sum .NET FW 4.8 115.25 us 1.00
Sum .NET Core 3.1 84.30 us 0.73
Sum .NET 5.0 49.52 us 0.43

System.Collections 。 The immutable version has also been improved. Dotnet / runtime? 1183 is @ hnrqbaggio by adding [methodimpl]( methodimploptions.ancsiveinlining )]To immutablearray GetEnumerator method to improve the foreach performance of immutablearray GetEnumerator method. We’re usually very cautious about aggressive inlining: it makes micro benchmarks look good because it eventually eliminates the overhead of calling related methods, but it can also greatly increase the size of the code, and then a lot of things can have a negative impact, such as making the instruction cache less efficient. However, in this case, it not only improves throughput, but actually reduces the size of the code. Inline is a powerful optimization, not only because it eliminates the overhead of the call, but also because it exposes the callee’s content to the caller. JIT usually does not do interprocess analysis, which is due to the limited time and budget of JIT for optimization, but the inlining overcomes this point by merging the callers and callees. At this point, JIT optimizes the callee factor of the caller factor. Suppose a method public static int getvalue() = > 42; the caller executes if (getvalue() * 2 > 100) { A lot of code }。 If getvalue() is not inlined, the comparison and “large amount of code” will be processed by JIT, but if getvalue() is inlined, JIT will see this as (84 > 100) { A lot of code }, the entire block is deleted. Fortunately, such a simple method almost always inline automatically, but the GetEnumerator of immutablearray is large enough that JIT cannot automatically recognize its benefits. In practice, when the GetEnumerator is inlined, JIT can better recognize that foreach is traversing the array rather than generating code for sum

; Program.Sum()
       push      rsi
       sub       rsp,30
       xor       eax,eax
       mov       [rsp+20],rax
       mov       [rsp+28],rax
       xor       esi,esi
       cmp       [rcx],ecx
       add       rcx,8
       lea       rdx,[rsp+20]
       call      System.Collections.Immutable.ImmutableArray'1[[System.Int32, System.Private.CoreLib]].GetEnumerator()
       jmp       short M00_L01
M00_L00:
       cmp       [rsp+28],edx
       jae       short M00_L02
       mov       rax,[rsp+20]
       mov       edx,[rsp+28]
       movsxd    rdx,edx
       mov       eax,[rax+rdx*4+10]
       add       esi,eax
M00_L01:
       mov       eax,[rsp+28]
       inc       eax
       mov       [rsp+28],eax
       mov       rdx,[rsp+20]
       mov       edx,[rdx+8]
       cmp       edx,eax
       jg        short M00_L00
       mov       eax,esi
       add       rsp,30
       pop       rsi
       ret
M00_L02:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 97

Just as in. Net core 3.1, so is in. Net 5

; Program.Sum()
       sub       rsp,28
       xor       eax,eax
       add       rcx,8
       mov       rdx,[rcx]
       mov       ecx,[rdx+8]
       mov       r8d,0FFFFFFFF
       jmp       short M00_L01
M00_L00:
       cmp       r8d,ecx
       jae       short M00_L02
       movsxd    r9,r8d
       mov       r9d,[rdx+r9*4+10]
       add       eax,r9d
M00_L01:
       inc       r8d
       cmp       ecx,r8d
       jg        short M00_L00
       add       rsp,28
       ret
M00_L02:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 59

As a result, smaller code and faster execution:

private ImmutableArray _array = ImmutableArray.Create(Enumerable.Range(0, 100_000).ToArray());

[Benchmark]
public int Sum()
{
    int sum = 0;

    foreach (int i in _array)
        sum += i;

    return sum;
}
Method Runtime Mean Ratio
Sum .NET FW 4.8 187.60 us 1.00
Sum .NET Core 3.1 187.32 us 1.00
Sum .NET 5.0 46.59 us 0.25

ImmutableList。 Inclusion also saw significant improvements due to the dotnet / corefx ᦇ 40 from @ shortspider. Contains is implemented using the indexof method of immutablelist, which is implemented on its enumerator. Behind the scenes immutablelistToday, AVL tree, a form of self balanced binary search tree, needs to maintain a nontrivial state and immutablelist in order to follow such a treeThe enumerator takes great care of each enumeration in order to avoid allocating storage. This leads to a lot of overhead. However, contains doesn’t care about the exact index of the elements in the list (or which of the possible multiple copies is found), it only cares about its existence, so it can use a simple recursive tree search. (because the tree is balanced, we don’t care about the stack overflow condition. )

private ImmutableList _list = ImmutableList.Create(Enumerable.Range(0, 1_000).ToArray());

[Benchmark]
public int Sum()
{
    int sum = 0;

    for (int i = 0; i < 1_000; i++)
        if (_list.Contains(i))
            sum += i;

    return sum;
}
Method Runtime Mean Ratio
Sum .NET FW 4.8 22.259 ms 1.00
Sum .NET Core 3.1 22.872 ms 1.03
Sum .NET 5.0 2.066 ms 0.09

The collection improvements highlighted above are all for generic collections, that is, for any data that developers need to store. But not all collection types are like this: some are more specific to specific data types, and such sets are in. Net 5 can also see the performance improvement. A bit set is such an example, with several PRS this release makes significant improvements to its performance. In particular, dotnet / corefx  41896 from @ gnbrkm41 uses avx2 and SSE2 features to vectorize many operations of bitarray (dotnet / runtime also adds arm64 feature later on): avx2 and SSE2 features are used to vectorize many operations of bitarray

private bool[] _array;

[GlobalSetup]
public void Setup()
{
    var r = new Random(42);
    _array = Enumerable.Range(0, 1000).Select(_ => r.Next(0, 2) == 0).ToArray();
}

[Benchmark]
public BitArray Create() => new BitArray(_array);
Method Runtime Mean Ratio
Create .NET FW 4.8 1,140.91 ns 1.00
Create .NET Core 3.1 861.97 ns 0.76
Create .NET 5.0 49.08 ns 0.04

LINQ

Before. Net core, there were a lot of changes in the system. LINQ code base, especially to improve performance. This process has slowed down, but. Net 5 can still see LINQ performance improvements.
There is a notable improvement to orderby. As discussed earlier, there are several motivations for converting coreclr’s local sorting implementation into managed code, one of which is to be able to easily reuse it as part of a spanc based sorting approach. Such an API is public, and through dotnet / runtime ා 1888, we can use the System.Linq Using span based sorting in. This is particularly beneficial because it supports the use of comparison based sorting routines, which in turn supports avoiding multiple layers of indirection on each comparison operation.

[GlobalSetup]
public void Setup()
{
    var r = new Random(42);
    _array = Enumerable.Range(0, 1_000).Select(_ => r.Next()).ToArray();
}

private int[] _array;

[Benchmark]
public void Sort()
{
    foreach (int i in _array.OrderBy(i => i)) { }
}
Method Runtime Mean Ratio
Sort .NET FW 4.8 100.78 us 1.00
Sort .NET Core 3.1 101.03 us 1.00
Sort .NET 5.0 85.46 us 0.85

This is good for one line changes.
Another improvement comes from @ timandy’s dotnet / corefx_. PR extensible enumeration. Skiplast to the special case IList and the internal ipartition interface (which is a way for operators to optimize each other) to represent skiplast as a take operation when the source length can be determined cheaply.

private IEnumerable data = Enumerable.Range(0, 100).ToList();

[Benchmark]
public int SkipLast() => data.SkipLast(5).Sum();
Method Runtime Mean Ratio Allocated
SkipLast .NET Core 3.1 1,641.0 ns 1.00 248 B
SkipLast .NET 5.0 684.8 ns 0.42 48 B

Finally, dotnet / corefx ා 40377 is a long process. This is an interesting example. For some time, I’ve seen developers think that Enumerable.Any () ratio Enumerable.Count ()! = 0 is more effective; after all, any () only needs to determine whether there is something in the source, while count () needs to determine how much is in the source. Therefore, for any reasonable set, any () should be o (1) in the worst case, while count() may be o (n) in the worst case. Isn’t any always better? Even Roslyn analyzers recommend this conversion. Unfortunately, this is not always the case. In. Before net 5, the implementation of any() is basically as follows:

using (IEnumerator e = source.GetEnumerator)
    return e.MoveNext();

This means that, in general, even if it is possible for an O (1) operation, it can result in the allocation of an enumerator object and two interface assignments. In contrast, count () has optimized the code path for icollection to use its count attribute in special cases since the initial version of LINQ in. Net framework 3.0 was released, in which case it is usually o (1) and free of allocation, with only one interface dispatch. So for very common cases (such as a list of sources), using count ()! = 0 is actually more effective than using any(). While adding interface checks can bring some overhead, it is worth adding to make the any () implementation predictable and consistent with count (), which makes it easier to reason and make mainstream views on its costs right.

Networking

Today, the network is a key component of almost all applications, and good network performance is crucial. As a result, every version of. Net puts a lot of effort into improving network performance. Net 5 is no exception.
Let’s look at some primitives first, and then move on. System. Most applications use URI to represent a URL, and it’s important that it’s fast. Many PRS have already started in. Make URI faster in. Net 5. It can be said that the most important operation of URI is to construct a URI, while dotnet / runtime makes the construction of all URIs faster, mainly by focusing on the cost and avoiding unnecessary overhead

[Benchmark]
public Uri Ctor() => new Uri("https://github.com/dotnet/runtime/pull/36915");
Method Runtime Mean Ratio Allocated
Ctor .NET FW 4.8 443.2 ns 1.00 225 B
Ctor .NET Core 3.1 192.3 ns 0.43 72 B
Ctor .NET 5.0 129.9 ns 0.29 56 B

After construction, applications often access the various components of the URI, which is also improved. In particular, types such as httpclient usually have a duplicate URI used to make the request. The httpclient implementation will access URI. Property to send as part of an HTTP request (for example, get / dotnet / runtime http / 1.1), which in the past meant recreating a partial string of URI for each request. Thanks to dotnet / runtime, it is now cached (just like idnhost)

private Uri _uri = new Uri("http://github.com/dotnet/runtime");

[Benchmark]
public string PathAndQuery() => _uri.PathAndQuery;
Method Runtime Mean Ratio Allocated
PathAndQuery .NET FW 4.8 17.936 ns 1.00 56 B
PathAndQuery .NET Core 3.1 30.891 ns 1.72 56 B
PathAndQuery .NET 5.0 2.854 ns 0.16

In addition, there are many ways in which code interacts with URIs, many of which have been improved. For example, dotnet / corefx? 41772 improves URI. Escape datastring and URI. Escapeuristring, which escapes strings according to RFC 3986 and RFC 3987. Both methods rely on shared helpers that use unsafe code, switch back and forth through char [] and have a lot of complexity in Unicode processing. This PR rewrites the helpers to take advantage of new features of. Net, such as span and runes, to make escape operations safe and fast. For some inputs, the gain is small, but for inputs involving Unicode, or even for long ASCII inputs, the gain is large.

[Params(false, true)]
public bool ASCII { get; set; }

[GlobalSetup]
public void Setup()
{
    _input = ASCII ?
        new string('s', 20_000) :
        string.Concat(Enumerable.Repeat("\xD83D\xDE00", 10_000));
}

private string _input;

[Benchmark] public string Escape() => Uri.EscapeDataString(_input);
Method Runtime ASCII Mean Ratio Allocated
Escape .NET FW 4.8 False 6,162.59 us 1.00 60616272 B
Escape .NET Core 3.1 False 6,483.85 us 1.06 60612025 B
Escape .NET 5.0 False 243.09 us 0.04 240045 B
Escape .NET FW 4.8 True 86.93 us 1.00
Escape .NET Core 3.1 True 122.06 us 1.40
Escape .NET 5.0 True 14.04 us 0.16

by Uri.UnescapeDataString The corresponding improvement is provided. This change includes using vectorized indexof instead of manual pointer based loops to determine the first position of characters that need to be non escaped, then avoiding unnecessary code, and using stack allocation instead of heap allocation where possible. Although all operations are faster, the biggest benefit is that the string unescape is irrelevant, which means that the escape datastring operation does not evade, but only returns its input (this situation also helps further dotnet / corefx, so that the original string does not need to be changed when it is returned)

private string _value = string.Concat(Enumerable.Repeat("abcdefghijklmnopqrstuvwxyz", 20));

[Benchmark]
public string Unescape() => Uri.UnescapeDataString(_value);
Method Runtime Mean Ratio
Unescape .NET FW 4.8 847.44 ns 1.00
Unescape .NET Core 3.1 846.84 ns 1.00
Unescape .NET 5.0 21.84 ns 0.03

Dotnet / runtime and dotnet / runtime make it faster to compare URIs and perform related operations, such as putting them into a dictionary, especially relative URIs.

private Uri[] _uris = Enumerable.Range(0, 1000).Select(i => new Uri($"/some/relative/path?ID={i}", UriKind.Relative)).ToArray();

[Benchmark]
public int Sum()
{
    int sum = 0;

    foreach (Uri uri in _uris)
        sum += uri.GetHashCode();
        
    return sum;
}
Method Runtime Mean Ratio
Sum .NET FW 4.8 330.25 us 1.00
Sum .NET Core 3.1 47.64 us 0.14
Sum .NET 5.0 18.87 us 0.06

Move the stack up and let’s see System.Net.Sockets 。 Since the birth of the. Net core, the techempower benchmark has been used as a way to measure progress. Previously, we focused on the “plain text” benchmark, a very low-level set of specific performance characteristics, but for this version, we hope to focus on improving two benchmarks, “JSON serialization” and “fortune” (the latter involves database access, although its name is given, the cost of the former is mainly due to the JSON payload with very low network speed). Our work is mainly focused on Linux. When I say “ours,” I don’t just mean those who work on the. Net team; we work effectively together through a working group that goes beyond the core team, such as the great ideas and contributions of @ TMDs from red hat and @ benaadam by Illyria games.

On Linux, socket implementation is based on epoll. In order to meet the huge demand for many services, we can’t just assign one thread to each socket. If we use blocking I / O for all operations on the socket, we will. Instead, with nonblocking I / O, epoll is used to notify socket implementation of a change in state when the operating system is not ready to satisfy a request (for example, when readasync is used for a socket but no data is available for reading, or an asynchronous socket is used but no space is available in the kernel’s send buffer), epoll is used to notify socket implementation of a change in state so that the operation can be tried again. Epoll is a method that uses one thread to block the change waiting of any number of sockets effectively. Therefore, the implementation maintains a dedicated thread to register epoll of all sockets waiting for changes. The implementation maintains multiple epoll threads, the number of which is usually equal to half of the number of cores in the system. When multiple sockets are reused to the same epoll and epoll thread, the implementation needs to be very careful not to run arbitrary work in response to socket notification; this will happen to the epoll thread itself, so the epoll thread will not be able to process further notifications until the job is completed. Worse still, if the job is blocked, waiting for another notification on any socket associated with the same epoll, the system will deadlock. Therefore, the thread processing epoll tries to do as little work as possible in response to socket notification, extracting enough information to queue the actual processing to the thread pool.

It turns out that there is an interesting feedback loop between these epoll threads and the thread pool. The overhead of work item queuing from epoll thread is just enough to support multiple epoll threads, but multiple epoll threads will cause some contention in the queue, so that the increased overhead of each additional thread exceeds its fair share. Most importantly, the queuing speed is only low enough, and the thread pool will find it difficult to keep all its threads saturated. A small amount of work will occur in a socket operation (this is the case of JSON serialization benchmark); this in turn causes the thread pool to spend more time isolating and releasing threads, thus slowing it down, thus creating a feedback loop. To make a long story short, poor queuing results in slower processing speed and more epoll threads than are actually required. This is corrected with two PRs, dotnet / runtime and dotnet / runtime. #35330 changes the queuing model from epoll threads instead of queuing a work item / event (when epoll wakes up for notification, there may be multiple notifications for all sockets to register it, and it will provide all notifications in one batch), which will process the entire batch queue for one work item. The pool thread that processes it then uses a very parallel model. For / foreach has been working for years, which means that queued work items can keep an item for themselves and then queue their copies to help process the remaining items. This changes calculus, the most reasonable size of the machine, which actually becomes conducive to reducing epoll threads rather than more (not coincidentally, we would like to have fewer), so the number of epoll threads changes, usually used in the end, is just one (there will be more in terms of machines and larger cores). We also passed dotnet_ SYSTEM_ NET_ SOCKETS_ THREAD_ Count epoll number configurable environment variable that can be set to the required calculation to override the system’s default value if the developer wants to experiment with a given workload with other quantities and provide feedback results.

As an experiment, from @ TMDs dotnet / runtime, we also added an experimental mode (by dotnet_ SYSTEM_ NET_ SOCKETS_ INLINE_ The completions environment variable is set to 1 on Linux) we avoid queued worker pools instead of just running all socket continuations (such as work () wait) socket.ReadAsync (); work (); on epoll thread. Burp is my Draco! If socket continuation stops, no other work associated with that epoll thread will be processed. What’s worse, if the continuation actually blocks waiting for other work associated with the epoll, the system will deadlock. However, in this mode, a well-designed program may get better performance because the processing location can be better and the overhead of queuing to the thread pool can be avoided. Because all socket work runs on epoll threads, it no longer makes sense to default to 1; by default, its number of threads equals the number of processors. Again, this is an experiment, and we welcome any positive or negative results.

These improvements focus on socket performance on Linux on a large scale, which makes it difficult to demonstrate them in micro benchmark on a single machine. However, there are other improvements that are easier to see. Dotnet / runtime removes several allocations from the socket. Connection, socket. In order to support old code access security (CAS) checks that are no longer relevant, some states are unnecessarily replicated: CAS checks were removed long ago, but clones still exist, so this just cleans them up. Dotnet / runtime also removes an allocation from the windows implementation of safelocker handle. The socket is reconstructed by dotnet / runtime. Connect async so that it can share the same internal socketasynceventargs instance, which is eventually used to perform the receiveasync operation, thus avoiding additional connection allocation. Dotnet / runtime ා 34175 uses the new fixed object heap introduced in. Net 5, uses the pre pinned buffer socketasynceventargs to implement the parts on windows instead of gchandle (on Linux, it does not need to put the corresponding function, so it is not used to it). In dotnet / runtime, @ TMDs reduces the allocation as part of the I / O sendasync / receivedasync implementation by using stack allocation where appropriate.

private Socket _listener, _client, _server;
private byte[] _buffer = new byte[8];
private List> _buffers = new List>();

[GlobalSetup]
public void Setup()
{
    _listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
    _listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));
    _listener.Listen(1);

    _client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
    _client.ConnectAsync(_listener.LocalEndPoint).GetAwaiter().GetResult();

    _server = _listener.AcceptAsync().GetAwaiter().GetResult();

    for (int i = 0; i < _buffer.Length; i++)
        _buffers.Add(new ArraySegment(_buffer, i, 1));
}

[Benchmark]
public async Task SendReceive()
{
    await _client.SendAsync(_buffers, SocketFlags.None);
    int total = 0;
    while (total < _buffer.Length)
        total += await _server.ReceiveAsync(_buffers, SocketFlags.None);
}
Method Runtime Mean Ratio Allocated
SendReceive .NET Core 3.1 5.924 us 1.00 624 B
SendReceive .NET 5.0 5.230 us 0.88 144 B

On top of that, we come to System.Net.Http 。 Socketshttphandler has made a lot of improvements in two aspects. The first is the processing of headers, which represents a large part of type related assignments and processing. By creating httpheaders, dotnet / corefx_startsthings. The name of tryadd without validation is true: because of the way that socketshttphandler enumerates request headers and writes them to the wire, even if the developer specifies “without validation”, it will eventually validate the header, and PR fixes this problem. Multiple PRs, including dotnet / runtime 訙 35003, dotnet / runtime, dotnet / runtime, and dotnet / runtime, improve lookup in the known header list of sockethttphandler (which helps avoid allocation when these headers appear) and enhances the list to be more comprehensive. Dotnet / runtime updates the internal strongly typed collection types, uses header less allocation sets, and dotnet / runtime makes some related allocation head-to-hand only when they actually access (as well as special case date and server response header to avoid assigning them in the most common case). The final result is a small improvement in throughput, but a significant improvement in allocation

private static readonly Socket s_listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
private static readonly HttpClient s_client = new HttpClient();
private static Uri s_uri;

[Benchmark]
public async Task HttpGet()
{
    var m = new HttpRequestMessage(HttpMethod.Get, s_uri);
    m.Headers.TryAddWithoutValidation("Authorization", "ANYTHING SOMEKEY");
    m.Headers.TryAddWithoutValidation("Referer", "http://someuri.com");
    m.Headers.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36");
    m.Headers.TryAddWithoutValidation("Host", "www.somehost.com");
    using (HttpResponseMessage r = await s_client.SendAsync(m, HttpCompletionOption.ResponseHeadersRead))
    using (Stream s = await r.Content.ReadAsStreamAsync())
        await s.CopyToAsync(Stream.Null);
}

[GlobalSetup]
public void CreateSocketServer()
{
    s_listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));
    s_listener.Listen(int.MaxValue);
    var ep = (IPEndPoint)s_listener.LocalEndPoint;
    s_uri = new Uri($"http://{ep.Address}:{ep.Port}/");
    byte[] response = Encoding.UTF8.GetBytes("HTTP/1.1 200 OK\r\nDate: Sun, 05 Jul 2020 12:00:00 GMT \r\nServer: Example\r\nContent-Length: 5\r\n\r\nHello");
    byte[] endSequence = new byte[] { (byte)'\r', (byte)'\n', (byte)'\r', (byte)'\n' };

    Task.Run(async () =>
    {
        while (true)
        {
            Socket s = await s_listener.AcceptAsync();
            _ = Task.Run(() =>
            {
                using (var ns = new NetworkStream(s, true))
                {
                    byte[] buffer = new byte[1024];
                    int totalRead = 0;
                    while (true)
                    {
                        int read =  ns.Read(buffer, totalRead, buffer.Length - totalRead);
                        if (read == 0) return;
                        totalRead += read;
                        if (buffer.AsSpan(0, totalRead).IndexOf(endSequence) == -1)
                        {
                            if (totalRead == buffer.Length) Array.Resize(ref buffer, buffer.Length * 2);
                            continue;
                        }

                        ns.Write(response, 0, response.Length);

                        totalRead = 0;
                    }
                }
            });
        }
    });
}
Method Runtime Mean Ratio Allocated
HttpGet .NET FW 4.8 123.67 us 1.00 98.48 KB
HttpGet .NET Core 3.1 68.57 us 0.55 6.07 KB
HttpGet .NET 5.0 66.80 us 0.54 2.86 KB

Other PRS related to supervisors are more specialized. For example, dotnet / runtime improves the parsing of date headers by more careful consideration of methods. The previous implementation uses datetime. A long list of trypasseexact; of possible formats, which deprives the implementation of its fast path and results in much slower parsing of the input even if it matches the first format in the list. The vast majority of today’s date headings will follow the format listed in RFC 1123, which is “R”. Due to the improvement of previous versions, datetime parses the “R” format very quickly, so we can directly use tryparseexact to parse a single format. If it fails, we can use tryparseexact to parse the rest of the formats.

[Benchmark]
public DateTimeOffset? DatePreferred()
{
    var m = new HttpResponseMessage();
    m.Headers.TryAddWithoutValidation("Date", "Sun, 06 Nov 1994 08:49:37 GMT");
    return m.Headers.Date;
}
Method Runtime Mean Ratio Allocated
DatePreferred .NET FW 4.8 2,177.9 ns 1.00 674 B
DatePreferred .NET Core 3.1 1,510.8 ns 0.69 544 B
DatePreferred .NET 5.0 267.2 ns 0.12 520 B

However, the biggest improvement comes from the general http / 2. In. Net core 3.1, http / 2 implementation is functional, but there is no special tuning. Therefore, some efforts have been made on. Net5 to make http / 2 implementation better, especially more scalable. Dotnet / runtime and dotnet / runtime significantly reduce the allocation of participating http / 2 get requests by using a custom copytoasync overlay in the response stream for HTTP / 2 responses, being more careful in how to access a part of the request header write request (not necessary in order to avoid forcing the lazily initialized state to exist), and removing async related assignments. Dotnet / runtime reduces the allocation in http / 2 by better handling canceling and reducing the allocation related to asynchronous operations. On top of this, dotnet / runtime includes a bunch of HTTP / two related changes, including reducing the number of locks involved (HTTP / 2 involves more synchronization than http / 1.1c 譮 because multiple requests in http / 2 are multiplexed to the same socket connection), reducing the amount of work, while holding locks is a key case to change the locking mechanism used and increase the Title Optimization of the title, And other adjustments to reduce administrative costs. As a follow-up, dotnet / runtime removed some allocations due to cancellation and tail headers, which are common in grpc traffic. To demonstrate this, I created a simple ASP.NET Core localhost server (use an empty template, delete a small amount of code, which is not needed in this example)

using Microsoft.AspNetCore.Builder;
using Microsoft.AspNetCore.Hosting;
using Microsoft.AspNetCore.Http;
using Microsoft.Extensions.Hosting;

public class Program
{
    public static void Main(string[] args) =>
        Host.CreateDefaultBuilder(args).ConfigureWebHostDefaults(b => b.UseStartup()).Build().Run();
}

public class Startup
{
    public void Configure(IApplicationBuilder app, IWebHostEnvironment env)
    {
        app.UseRouting();
        app.UseEndpoints(endpoints =>
        {
            endpoints.MapGet("/", context => context.Response.WriteAsync("Hello"));
            endpoints.MapPost("/", context => context.Response.WriteAsync("Hello"));
        });
    }
}

Then I use this client benchmark:

private HttpMessageInvoker _client = new HttpMessageInvoker(new SocketsHttpHandler() { UseCookies = false, UseProxy = false, AllowAutoRedirect = false });
private HttpRequestMessage _get = new HttpRequestMessage(HttpMethod.Get, new Uri("https://localhost:5001/")) { Version = HttpVersion.Version20 };
private HttpRequestMessage _post = new HttpRequestMessage(HttpMethod.Post, new Uri("https://localhost:5001/")) { Version = HttpVersion.Version20, Content = new ByteArrayContent(Encoding.UTF8.GetBytes("Hello")) };

[Benchmark] public Task Get() => MakeRequest(_get);

[Benchmark] public Task Post() => MakeRequest(_post);

private Task MakeRequest(HttpRequestMessage request) => Task.WhenAll(Enumerable.Range(0, 100).Select(async _ =>
{
    for (int i = 0; i < 500; i++)
    {
        using (HttpResponseMessage r = await _client.SendAsync(request, default))
        using (Stream s = await r.Content.ReadAsStreamAsync())
            await s.CopyToAsync(Stream.Null);
    }
}));
Method Runtime Mean Ratio Allocated
Get .NET Core 3.1 1,267.4 ms 1.00 122.76 MB
Get .NET 5.0 681.7 ms 0.54 74.01 MB
Post .NET Core 3.1 1,464.7 ms 1.00 280.51 MB
Post .NET 5.0 735.6 ms 0.50 132.52 MB

Also note that for. Net 5, there is still a lot of work to do in this area. Dotnet / runtime changes the way to handle write in http / 2 implementation, which is expected to bring substantial scalability improvement on the basis of existing improvements, especially for grpc based workload.
Other network components have also been significantly improved. For example, the xxasync API on the DNS type is implemented on the corresponding begin / endxx method. For dotnet / corefx in. Net 5, this is the reverse. For example, the begin / endxx method is implemented on the basis of the xxasync method; This makes the code simpler and faster, and also has a good impact on allocation (note that the result of. Net framework 4.8 is a little faster, because it does not actually use asynchronous I / O, but just a queued work item to the thread pool that performs synchronous I / O; this reduces some overhead, but also reduces scalability):

private string _hostname = Dns.GetHostName();

[Benchmark] public Task Lookup() => Dns.GetHostAddressesAsync(_hostname);
Method Runtime Mean Ratio Allocated
Lookup .NET FW 4.8 178.6 us 1.00 4146 B
Lookup .NET Core 3.1 211.5 us 1.18 1664 B
Lookup .NET 5.0 209.7 us 1.17 984 B

Although it is a rare type (although it uses WCF), negotiatestream also updates dotnet / runtime, uses asynchronous / wait with all xxasync methods, and then multiplexes buffers in dotnet / runtime instead of creating new ones for each operation. The end result is a significant reduction in allocation in typical read / write usage:

private byte[] _buffer = new byte[1];
private NegotiateStream _client, _server;

[GlobalSetup]
public void Setup()
{
    using var listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
    listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));
    listener.Listen(1);

    var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
    client.ConnectAsync(listener.LocalEndPoint).GetAwaiter().GetResult();

    Socket server = listener.AcceptAsync().GetAwaiter().GetResult();

    _client = new NegotiateStream(new NetworkStream(client, true));
    _server = new NegotiateStream(new NetworkStream(server, true));

    Task.WaitAll(
        _client.AuthenticateAsClientAsync(),
        _server.AuthenticateAsServerAsync());
}

[Benchmark]
public async Task WriteRead()
{
    for (int i = 0; i < 100; i++)
    {
        await _client.WriteAsync(_buffer);
        await _server.ReadAsync(_buffer);
    }
}

[Benchmark]
public async Task ReadWrite()
{
    for (int i = 0; i < 100; i++)
    {
        var r = _server.ReadAsync(_buffer);
        await _client.WriteAsync(_buffer);
        await r;
    }
}
Method Runtime Mean Ratio Allocated
WriteRead .NET Core 3.1 1.510 ms 1.00 61600 B
WriteRead .NET 5.0 1.294 ms 0.86
ReadWrite .NET Core 3.1 3.502 ms 1.00 76224 B
ReadWrite .NET 5.0 3.301 ms 0.94 226 B

JSON

There are significant improvements to the. Net 5 JSON library, especially the jsonserializer, but many of these improvements have actually been ported back to. Net core 3.1 and released as part of the service repair (see dotnet / corefx). Even so, there are some good improvements in. Net 5.
Dotnet / runtime ᦇ 9 reconstructs the model of how the converter in jsonserializer processes sets, resulting in measurable improvements, especially for larger sets:

private MemoryStream _stream = new MemoryStream();
private DateTime[] _array = Enumerable.Range(0, 1000).Select(_ => DateTime.UtcNow).ToArray();

[Benchmark]
public Task LargeArray()
{
    _stream.Position = 0;
    return JsonSerializer.SerializeAsync(_stream, _array);
}
Method Runtime Mean Ratio Allocated
LargeArray .NET FW 4.8 262.06 us 1.00 24256 B
LargeArray .NET Core 3.1 191.34 us 0.73 24184 B
LargeArray .NET 5.0 69.40 us 0.26 152 B

But even smaller ones, for example.

private MemoryStream _stream = new MemoryStream();
private JsonSerializerOptions _options = new JsonSerializerOptions();
private Dictionary _instance = new Dictionary()
{
    { "One", 1 }, { "Two", 2 }, { "Three", 3 }, { "Four", 4 }, { "Five", 5 },
    { "Six", 6 }, { "Seven", 7 }, { "Eight", 8 }, { "Nine", 9 }, { "Ten", 10 },
};

[Benchmark]
public async Task Dictionary()
{
    _stream.Position = 0;
    await JsonSerializer.SerializeAsync(_stream, _instance, _options);
}
Method Runtime Mean Ratio Allocated
Dictionary .NET FW 4.8 2,141.7 ns 1.00 209 B
Dictionary .NET Core 3.1 1,376.6 ns 0.64 208 B
Dictionary .NET 5.0 726.1 ns 0.34 152 B

Dotnet / runtime also helps to improve the performance of small types by adding a cache layer to help retrieve metadata used within serialized and deserialized types.

private MemoryStream _stream = new MemoryStream();
private MyAwesomeType _instance = new MyAwesomeType() { SomeString = "Hello", SomeInt = 42, SomeByte = 1, SomeDouble = 1.234 };

[Benchmark]
public Task SimpleType()
{
    _stream.Position = 0;
    return JsonSerializer.SerializeAsync(_stream, _instance);
}

public struct MyAwesomeType
{
    public string SomeString { get; set; }
    public int SomeInt { get; set; }
    public double SomeDouble { get; set; }
    public byte SomeByte { get; set; }
}
Method Runtime Mean Ratio Allocated
SimpleType .NET FW 4.8 1,204.3 ns 1.00 265 B
SimpleType .NET Core 3.1 617.2 ns 0.51 192 B
SimpleType .NET 5.0 504.2 ns 0.42 192 B

Trimming

Before. Net core 3.0,. Net core focused on server workload, while ASP focused on server workload. Net core is an excellent application model on this platform. With the addition of. Net core 3.0, Windows Forms and Windows Presentation Foundation (WPF) have also joined. Net core has been introduced into desktop applications. With the release of. Net core 3.2, blade has released support for browser applications, but it is based on mono and libraries in the mono stack. In. Net 5, razor uses the. Net 5 library shared by the. Net 5 mono runtime and all other application models. This brings an important change in performance: size. While code size has always been an important issue (and. Net native applications), it is very important that the scale required for a successful browser based deployment does bring the forefront, and we need to worry about download size to some extent that we haven’t focused on. Net core in the past.
To help with the size of the application, the. Net SDK contains a linker that can clear unused parts of the application, not only at the assembly level, but also at the member level, to do static analysis to determine what is code, not used and discarded parts that are not. This presents an interesting set of challenges: some coding patterns used to facilitate or simplify the use of APIs are difficult for linkers to analyze in a way that allows them to throw away a lot of things. Therefore, one of the main performance related tasks in. Net 5 is to improve the tailoring of the library.

There are two aspects

  • Not too many (correct) deleted. We need to make sure that these libraries can be really safely cut down. In particular, it is difficult to find all members of a linker that reflects (or even only reflects in the common area). In fact, it may be used. For example, in application code, typeof type instances are used in one place and passed to another part of the application. It uses getmethod to retrieve methodinfo. For a public method, type, And use methodinfo to call part of it to another application. Address, linker uses a heuristic method to minimize the false positives of the API that can be deleted. However, in order to further help it, a bunch of properties have been added in. Net 5, which enables developers to make such implicit dependencies explicit, and suppress warning linkers, which may be considered unsafe, but are not, passed on to consumers, Force a warning that some parts of the surface are not suitable for joining. See dotnet / runtime.
  • Delete as many (performance) as possible. We need to minimize the reasons why code fragments need to be retained. This can be expressed as refactoring the implementation to change the calling pattern, using usage conditions recognized by the linker to tailor the whole code, and using more fine-grained control to precisely control what needs to be retained and why.

There are many examples of the second method, so I will highlight some of them to show the various technologies used:

  • Remove unnecessary code, such as dotnet / corefx ꃙ. Here, we find a number of obsolete tracesource / switch usages that are only used to enable some debugging only trace and assertion, but no one has actually used them, which causes the linker to see some of these types, even in the release version.
  • Remove obsolete code that was useful but no longer useful, such as dotnet / coreclr. This type used to be important for improving NGEN (the predecessor of crossgen), but it’s no longer needed. Or, as in dotnet / coreclr, some code is actually no longer used, but still causes the type to remain.
  • Remove duplicate code, such as dotnet / corefx ා 41165, dotnet / corefx, and dotnet / coreclr ฅ. Some libraries use their own hash code to help private copies of routines, resulting in each library having its own IL copy to implement this function. They can be updated to use shared hashcode types, which not only helps with IL sizing and sizing, but also helps to avoid the extra code that needs to be maintained, and better modernizes the code base to take advantage of the features we recommend others to use as well.
  • Use a different API, such as dotnet / corefx? 41143. The code uses the extension helper method, leading to the introduction of additional types, but the help provided actually saves little code. A possible better example is dotnet / corefx, which starts from System.Xml In the implementation, the use of non generic queue and stack types is removed, and only the general implementation is used (dotnet / coreclr did similar things with WeakReference). Or dotnet / corefx? 41111, which changes some of the code in the XML library to use httpclient instead of webrequest, which allows deletion of the entire System.Net 。 Dependent requests. Or avoid it System.Net Dotnet / corefx ා 41110. HTTP needs to use System.Text 。 Regular expressions: This is unnecessary complexity and can be replaced with a small amount of code specific to the use case. Another example is dotnet / coreclr, some of which are used unnecessarily string.ToLower (), replacing its use is not only more efficient, but also helps to reduce overloading by default. Dotnet / coreclr is similar.
  • Rerouting logic to avoid root routing large amounts of unnecessary code, such as dotnet / corefx ා 41075. If the code only uses the new regex (string), the longer regex constructor is delegated internally, and the constructor needs to be able to use the internal regex compiler to deal with regexoptions. Compile and use. By adjusting the code path, the regex (string) constructor is independent of the regex (string, regexoptions) constructor. Without regex, it is easy for the linker to delete the entire regex compiler code path (and its dependency on reflection). Then make better use of that and make sure you use as short a phone as possible. This is a fairly common pattern to avoid this unnecessary root cause. consider Environment.GetEnvironmentVariable (string). It once called for the environment. Getenvironmentvariable (string, environmentvariabletarget) overload, passing in the default environmentvariabletarget. Process. Instead, the dependency relationship is inverted Environment.GetEnvironmentVariable (string) overloads only contain the logic for processing process use cases, and the longer overloads include if (target = = envir) onmentVariableTarget.Process )Getenvironmentvariable (name);. In this way, the most common use of simple overloading does not introduce all the code paths needed to handle other less common targets. Another example is dotnet / corefx ා 0944: it allows more console internal links for applications that write only to the console and not read from the console.
  • Use deferred initialization, especially for static fields such as dotnet / runtime. If a type is used and any of its static methods are called, its static constructor needs to be saved as well as any fields initialized by the static constructor. If these fields are delay initialized when they are first used, they need to be retained only if the code that performs the deferred initialization is accessible.
  • Use feature switches, such as dotnet / runtime (further benefiting from dotnet / runtime). In many cases, an application may not need all the feature sets, such as logging or debugging support, but from the linker’s point of view, it sees the code in use and is forced to keep it. However, the linker can be told what replacement value it should use for a known property. For example, you can tell the linker when it sees a class that returns a Boolean value. For some properties, it should replace it with the constant false, which in turn enables it to delete any code protected by the property.

Peanut Butter

After the performance of. Net core 3.0, I talked about “peanut butter”. Many small improvements do not necessarily make a huge difference alone, but the processing cost is the whole code. Otherwise, smearing and repairing these groups can produce measurable changes. As with previous versions, there are many such welcome improvements in. Net 5. Here are a few:

  • Assembly loads faster. For historical reasons, the. Net core has many small implementation assemblies, and the purpose of splitting is meaningless. However, each additional assembly that needs to be loaded increases overhead. Dotnet / runtime? 2189 and dotnet / runtime? 31991 combine a bunch of small assemblies to reduce the number of small assemblies that need to be loaded.
  • Faster math. Improved Nan checking to make the code double. IsNaN and float. Smaller code and faster. Dotnet / runtime from @ john-h-k is a good example of using SSE and AMD64 intrinsics to accelerate mathematics measurably. CopySign MathF.CopySign 。 Dotnet / runtime_from @ marusyk improves hash code generation for matrix3x2 and matrix4x4x4.
  • Faster encryption. Dotnet / runtime from @ vcsjones System.Security The optimized binaryprimitives are used in different locations of the open code to replace the equivalent code of the open code. Dotnet / corefx_from @ Vladimir Khvostov optimizes encryption that is not popular but still in use. The createfromname method can be up to 10 times faster.
  • Faster interoperability. Dotnet / runtime reduces entry point probing by avoiding windows specific “exact spelling” checks on Linux and setting it to true on windows (where the runtime attempts to find the exact native function for P / calls). Dotnet / runtime from @ nextturn uses sizeof (T) instead of Marshal.SizeOf (Type)/ Marshal.SizeOf () in a bunch of places, because the former has less cost than the latter. However, dotnet / runtime, dotnet / runtime ฎ 35098 and dotnet / runtime 謏 39059 reduce the interoperability and marshaling costs of several libraries by using more blittable types, using span and ref local variables, and using sizeof.
  • Faster reflection. Reflection emission enables developers to write IL at run time, and if you can fire the same instructions in a way that takes less space, you can save the managed allocation required for the storage sequence. Various IL opcodes have shorter variants in more common cases, such as LDC_ I4 can be used to load any int value as a constant, but LDC_ I4_ S is shorter and can be used to load any sbyte, while LDC_ I4_ 1 is shorter for loading a value of 1. Some libraries take advantage of this and use their own mapping tables as part of their emit code to use the shortest relevant opcode; others don’t like it. Dotnet / runtime simply moves such a mapping into the ilgenerator itself, enabling us to remove all custom implementations from the dotnet / runtime library and automatically gain the benefits of mapping in all of these and other libraries.
  • Faster I / O. Improved from @ bbartels BinaryWriter.Write (string) dotnet / runtime, provides a fast path for various common input. Dotnet / runtime improves the System.IO The way of internal management relationship. Package by using O (1) instead of O (n) lookup.
  • There are small distributions everywhere. For example, dotnet / runtime 訛 35005 removes the memory stream allocation in bytearraycontent, and dotnet / runtime removes it System.Reflection List and underlying t [] allocation in. Delete char [] allocation in xmlconverter. Delete a char [] allocation in httputility, several possible char allocation in modulebuilder, and some char allocation slave strings in dotnet / runtime ා 32301. Split use, dotnet / runtime 詞 32422 deleted a character [] assigned in asnformatter, dotnet / runtime deleted several strings allocated in System.IO 。 File system, dotnet / corefx ා 41363 delete character [] is assigned in jsoncamel case naming policy, dotnet / coreclr delete string allocation from MethodBase.ToString (), dotnet / corefx ා 41274 delete some unnecessary strings from certificatepal. Appendprivatekeyinfo dotnet / runtime k 1155 delete temporary array from sqldecimal @ wraith2, dotnet / coreclr delete boxing before using methods like GetHashCode method in some tuples, dotnet / coreclr delete several assignments reflected in custom attributes, dotnet / coreclr 詻 27013 delete some string assignments, replace some inputs with constants from concatenation, And dotnet / runtime string.Normalize Some temporary char [] assignments have been removed from.

New Performance-focused APIs

This article emphasizes that a large number of existing APIs running on. Net 5 will get better. In addition, there are many new APIs in. Net 5, some of which focus on helping developers write faster code (more on getting developers to do the same thing with less code, or supporting new features that were previously difficult to do). Here are some highlights, including the fact that some APIs have been used internally by other libraries to reduce the cost of existing APIs:

  • Decimal(ReadOnlySpan) / Decimal.TryGetBits / Decimal.GetBits(dotnet / runtime ා 32155): in the previous version, many span based methods were added to communicate effectively with the primitive, decimal and get span based tryformat and {} to try to parse methods, but these new methods effectively build a decimal from span and extract bit decimal span in. Net 5. As you can see, this support is already available in sqldecimal, BigInteger, and System.Linq andSystem.Reflection.MetadataUsed in.
  • MemoryExtensions.Sort (dotnet/coreclr#27700)。 I talked about it before: the new sortAnd sortThe extended method can sort data in any range. These new public methods are already available in array itself (dotnet / coreclr 氷 and System.Linq (dotnet / runtime ා 1888).
  • GC.AllocateArrayAnd GC. AllocateUninitializedArray(dotnet/runtime#33526)。 These new APIs are just like using the new t [length], except for two special behaviors: using uninitialized variables to allow GC to return arrays, not forcibly liquidate them (unless they contain references, in which case, it must be specified at least), and return from the new fixed array object heap (POH) through real bool fixed parameters, which guarantees that the array will never waver in memory, In this way they can be passed to external code without putting them (i.e. not using fixed or gchandle). StringBuilder is supported to reduce costs and expand its internal storage by using uninitialized features (dotnet / coreclr), like the new transcodingstream (dotnet / runtime), and even new support for importing x509 certificates and collection mail Certificate (PEM) files (dotnet / runtime) from privacy enhancement. You can also see that fixed support is well used in the Windows Sockets async event args (dotnet / runtime) implementation, where fixed buffers need to be allocated for operations such as receivemessage from.
  • StringSplitOptions。 Trimentries (dotnet / runtime). character string. The split overload accepts a stringsplitoptions enum, which allows the partition to optionally remove empty entries from the result array. The new trimentries enumeration value adjusts the results first with or without this option. Whether or not you use removeemptyentries, this allows split to avoid assigning strings (or smaller ones) to entries that will become empty once pruned, and then, together with removeemptyentries, make the result array smaller in this case. In addition, it is common for users of split to subsequently call trim() on each string, so adding pruning as part of the split call eliminates the extra string allocation of the caller. This is used in some types and methods in the dotnet / runtime, such as through datatable, httplistener, and socketshttphandler.
  • BinaryPrimitives。 Try {read / write}{double / single}{large / small} tail method (dotnet / runtime # 6864). For example, in. In the new, concise binary object representation (CBOR) support added in net 5 (dotnet/runtime_), you can see that these APIs are used.
  • MailAddress。 Try create (dotnet / runtime 記 1052 from @ marcorossignoli) and physicaladdress. {} try to parse (dotnet and physical address. Try {dotn7) / runtime. The new try overload supports no exception parsing, while the cross based overload supports address resolution in a larger context without causing substring assignments.
  • Unsafesuppressexecutioncontextflow) (dotnet / runtime ා 706 from @ marcorossignoli). When “ExecutionContext” is called in “ExecutionContext” by default, it will continue to execute in the current site. This is asynclocalHow values are propagated through asynchronous operations. This stream is usually cheap, but still has a small overhead. Since socket operations can be critical to performance, developers can use this new constructor on the socketasynceventargs constructor when they know that context will not be required for callbacks raised by instances. For example, you can see this usage in the socket HttpHandler’s internal connect helper (dotnet / runtime ා 1381).
  • Unsafe.SkipInit  (dotnet/corefx#41995)。 C ා compiler’s explicit assignment rules require the assignment of parameters and local variables in various cases. In very specific cases, this may require additional assignments rather than actual ones, which may not be desirable when calculating memory writes in each instruction and performance sensitive code. This method effectively makes the code pretend to have written parameters or local, when it doesn’t. It is used for various operations on decimal (dotnet / runtime), some new APIs in IntPtr and uintptr (dotnet / runtime 觫 from @ john-h-k), matrix4x4 (dotnet / runtime 觻 from @ eanova), utf8parser (dotnet / runtime), and utf8encoding (dotnet / runtime)
  • SuppressGCTransitionAttribute (dotnet/coreclr#26458)。 This is a high-level property for P / invoke, which enables the runtime to prevent the cooperative preemptive mode transitions it usually causes, just as it does when it performs internal “fcalls” on the runtime itself. This property needs to be used very carefully (see the detailed comments in the property description). Even so, you can see that some methods in corelib (dotnet / runtime) use it, and there are some pending changes to JIT, which will make it better (dotnet / runtime).
  • CollectionsMarshal.AsSpan (dotnet/coreclr# 26867)。 This method provides the caller with a list ofThe background storage of spaner based access.
  • MemoryMarshal.GetArrayDataReference (dotnet/runtime#1036)。 This method returns a reference to the first element of the array (or where it should be if the array is not empty). No validation is performed, so it’s dangerous and very fast. This method is used in many places in corelib for very low-level optimization. For example, it is used as part of the cast helper implemented in C (dotnet / runtime) discussed earlier, and as part of using buffers. Memmove is in different places (dotnet / runtime).
  • SslStreamCertificateContext (dotnet/runtime#38364)。 When SslStream.AuthenticateAsServer The {async} provides the certificate used, which attempts to build a complete x509 chain. An operation can have different amounts of related costs, and even perform I / O if additional certificate information needs to be downloaded. In some cases, this can happen with the same certificate used to create any number of sslstream instances, resulting in duplicate overhead. Sslstreamcertificatecontext serves as a cache of such calculation results, and the work can be executed once in advanced, and then passed to sslstream to achieve any degree of reuse. This helps to avoid duplication of effort and also provides callers with more predictability and control over any failures.
  • HttpClient。 Send (dotnet / runtime). For some readers, it may be strange to see the synchronization API called here. Although httpclient is designed for asynchronous use, we find that developers cannot take advantage of asynchrony. For example, when implementing synchronous only interface methods, or when calling from local operations that require synchronous response, the need to download data is ubiquitous. In these cases, forcing developers to perform “synchronization over asynchrony” (that is, performing asynchronous operations and then blocking them to wait for them to complete) is not as good as using synchronous operations in the first place. As a result,. Net 5 sees a limited new synchronization surface area added to httpclient and its supported types. Dotnet / runtime itself uses this in some places. For example, on Linux, when x509certificates support needs to download a certificate as part of the build chain, it is usually on a code path that needs to be synchronized during all the processes returning to the OpenSSL callback; previously, this would have used httpclient. Getbytearrayasync, then block and wait for it to complete, but this has proven to cause significant scalability problems for some users Dotnet / runtime 陦 change it to use a new synchronization API instead. Similarly, the old Httpwebrequest type was based on httpclient. In previous versions of. Net core, its synchronous getresponse() method was actually synchronized over asynchrony; like dotnet / runtime, it now uses synchronous httpclient. Sending method.
  • HttpContent.ReadAsStream (dotnet/runtime#37494)。 This is logically part of the httpclient. Send the effort mentioned above, but I call it alone because it’s useful in itself. The existing readasstreamasync method is a bit strange. It was initially exposed as asynchronous, just to prevent custom httpcontent derived types from requiring asynchrony, but little was found to override httpcontent. Readasstreamasync is not synchronized, and the implementations returned by httpclient requests are synchronized. Therefore, the caller is ultimately the task of the returned flowThe wrapper object is paid for, but it is always available immediately. Therefore, the new readasstream method avoids additional tasks in this caseDistribution. You can see that it is used in different places in this way in dotnet / runtime, such as the client websocket implementation.
  • Non generic task completion source (dotnet / runtime). Due to the introduction of task and task, TaskCompletionSourceIs a way to build tasks that the caller can do manually through its {try} set method. And because of the taskIs derived from a task, so a single generic type can be used for a generic task at the same timeAnd non generic task requirements. However, this is not always obvious to people, leading to confusion in the case of non generic, the correct solution exacerbates ambiguous types when using t. net 5 adds a non generic taskcompletionsource, which not only eliminates the confusion, but also helps a little bit of performance, because it avoids the need for tasks to carry a useless space T with them.
  • Task.WhenAny (task, task) (dotnet / runtime 藰 and dotnet / runtime). Previously, you could pass any number of tasks to Task.WhenAny And accept the parameter task [] tasks through its overload. However, when analyzing the use of this method, it is found that most call stations always pass through two tasks. The new public overload is optimized for this situation, and one neat thing about this overload is that just recompiling these call sites will allow the compiler to bind to the new faster overload instead of the old one, so you don’t need to make any code changes to benefit from the reload.
private Task _incomplete = new TaskCompletionSource().Task;

[Benchmark]
public Task OneAlreadyCompleted() => Task.WhenAny(Task.CompletedTask, _incomplete);

[Benchmark]
public Task AsyncCompletion()
{
    AsyncTaskMethodBuilder atmb = default;
    Task result = Task.WhenAny(atmb.Task, _incomplete);
    atmb.SetResult();
    return result;
}
Method Runtime Mean Ratio Allocated
OneAlreadyCompleted .NET FW 4.8 125.387 ns 1.00 217 B
OneAlreadyCompleted .NET Core 3.1 89.040 ns 0.71 200 B
OneAlreadyCompleted .NET 5.0 8.391 ns 0.07 72 B
AsyncCompletion .NET FW 4.8 289.042 ns 1.00 257 B
AsyncCompletion .NET Core 3.1 195.879 ns 0.68 240 B
AsyncCompletion .NET 5.0 150.523 ns 0.52 160 B

There are too many System.Runtime.Intrinsics The method even began to mention!

New Performance-focused Analyzers

The C ා Roslyn compiler has a very useful extension point called “analyzers” or “Roslyn analyzers.”. The parser is inserted into the compiler and is granted full read access to all source code that the compiler operates on, as well as the compiler’s parsing and modeling of the code, allowing developers to insert their own custom analysis into the compilation. Most importantly, the parser can run not only as part of the build, but also in the ide as developers write code, which enables the analyzer to provide suggestions, warnings, and errors on how developers can improve their code. Parser developers can also write “refixes” that can be invoked in IDE, and automatically replace the labeled code with “repaired” alternatives. All of these components can be distributed through the nuget package, which makes it easy for developers to use arbitrary analyses written by others.
The Roslyn analyzer buyback consists of a set of custom analyzers, including the ports of the old FxCop rules. It also includes new analyzers, and for. Net5, the. Net SDK will automatically include a large number of these analyzers, including the new analyzers written for this distribution. Many of these rules are performance related, or at least partially related to performance. Here are some examples:
Detects unexpected allocation as part of the distance index. C ා 8 introduces scope, which makes it easy to slice sets, such as somecollection [1.. 3]. Such expressions can be converted to use the indexer of the collection to get a range, such as public mycollection this [range R] {get;}, or slice (int start, int length) if there is no such indexer. According to conventions and design guidelines, such indexers and slicing methods should return the same types they define, so for example, slicing one t [] will produce another t [] and slicing a span will produce a span. However, this can cause implicit casts to hide unexpected assignments. For example, t [] can be implicitly converted to span, but this also means that the result of T [] slice can be implicitly converted to span, which means that the following code span span span =_ Array [1.. 3]; will compile and run well, except that it will cause_ The array allocation of the array slice generated by array [1..]. 3] Index range. A more effective way to write it is span span span =_ array.AsSpan ()[1..3]。 The analyzer will detect several of these situations and provide solutions to eliminate the allocation.

[Benchmark(Baseline = true)]
public ReadOnlySpan Slice1()
{
    ReadOnlySpan span = "hello world"[1..3];
    return span;
}

[Benchmark]
public ReadOnlySpan Slice2()
{
    ReadOnlySpan span = "hello world".AsSpan()[1..3];
    return span;
}
Method Mean Ratio Allocated
Slice1 8.3337 ns 1.00 32 B
Slice2 0.4332 ns 0.05

Memory overload that takes precedence over streams. . net core 2.1 adds a new overload to the stream. Readasync and stream. Write async for memory and readonlymory operations, respectively. This allows these methods to process data from other sources instead of byte [] and can also be optimized, such as avoiding pinning when {readonly} memory is created in a specified way to represent fixed or immovable data. However, the introduction of new overloads also provides a new opportunity to choose the return type of these methods. We chose valuetask and valuetask instead of task and task, respectively. The benefit of this is that it allows calls to be completed in a more synchronous manner to avoid allocation, and even more asynchronously to avoid allocation (although the covered developer needs to make more efforts). Therefore, it is usually beneficial to tend to use a new overload instead of an old one. This analyzer will detect the use of the old overload and provide a repair program to automatically switch to the use of the new overload. Dotnet / runtime has some examples of repair cases found in.

private NetworkStream _client, _server;
private byte[] _buffer = new byte[10];

[GlobalSetup]
public void Setup()
{
    using Socket listener = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
    var client = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
    listener.Bind(new IPEndPoint(IPAddress.Loopback, 0));
    listener.Listen();
    client.Connect(listener.LocalEndPoint);
    _client = new NetworkStream(client);
    _server = new NetworkStream(listener.Accept());
}

[Benchmark(Baseline = true)]
public async Task ReadWrite1()
{
    byte[] buffer = _buffer;
    for (int i = 0; i < 1000; i++)
    {
        await _client.WriteAsync(buffer, 0, buffer.Length);
        await _server.ReadAsync(buffer, 0, buffer.Length); // may not read everything; just for demo purposes
    }
}

[Benchmark]
public async Task ReadWrite2()
{
    byte[] buffer = _buffer;
    for (int i = 0; i < 1000; i++)
    {
        await _client.WriteAsync(buffer);
        await _server.ReadAsync(buffer); // may not read everything; just for demo purposes
    }
}
Method Mean Ratio Allocated
ReadWrite1 7.604 ms 1.00 72001 B
ReadWrite2 7.549 ms 0.99

It is best to use type overloading on StringBuilder. Additional and StringBuilder.Insert There are many overloads, not only for appending strings or objects, but also for appending various basic types, such as int32. Even so, we often see images like this stringBuilder.Append ( intValue.ToString ()) such code. StringBuilder.Append overloading is more efficient and does not require string allocation, so overloading should be preferred. The analyzer comes with a fixer to detect this and automatically switch to a more suitable overload.

[Benchmark(Baseline = true)]
public void Append1()
{
    _builder.Clear();
    for (int i = 0; i < 1000; i++)
        _builder.Append(i.ToString());
}

[Benchmark]
public void Append2()
{
    _builder.Clear();
    for (int i = 0; i < 1000; i++)
        _builder.Append(i);
}
Method Mean Ratio Allocated
Append1 13.546 us 1.00 31680 B
Append2 9.841 us 0.73

be the first choice StringBuilder.Append (char) instead of StringBuilder.Append (string)。 Attaching a single character to StringBuilder is more efficient than appending a string of length 1. But, likeprivate const string Separator = ":"Such code is still very common ; if const is changed toprivate const char Separator = ':';It will be better. The parser will mark many of these cases and help fix them. Some examples of parser modification in dotnet / runtime are in dotnet / runtime ා 36097.

[Benchmark(Baseline = true)]
public void Append1()
{
    _builder.Clear();
    for (int i = 0; i < 1000; i++)
        _builder.Append(":");
}

[Benchmark]
public void Append2()
{
    _builder.Clear();
    for (int i = 0; i < 1000; i++)
        _builder.Append(':');
}
Method Mean Ratio
Append1 2.621 us 1.00
Append2 1.968 us 0.75

Isempty is preferred over count. Similar to the previous LINQ any() vs count(), some collection types expose both the isempty property and the count property. In some cases, such as concurrent collections such as concurrentqueue, it is much more expensive to determine the exact count of the number of items in the collection than to determine only if there are any items in the collection. In this case, if you write code to execute something like if( collection.Count ! =0), use if (! collection.IsEmpty )It will be more effective. The analyzer can help to detect and repair such conditions.

[Benchmark(Baseline = true)]
public bool IsEmpty1() => _queue.Count == 0;

[Benchmark]
public bool IsEmpty2() => _queue.IsEmpty;
Method Mean Ratio
IsEmpty1 21.621 ns 1.00
IsEmpty2 4.041 ns 0.19

be the first choice Environment.ProcessId 。 Dotnet / runtime added new static properties Environment.ProcessId , which returns the ID of the current process. See previous attempts to use Process.GetCurrentProcess ()。 ID code that does the same thing is common. However, the latter is significantly less efficient and cannot easily support internal caching, so it allocates a terminable object and makes a system call on each call. This new analyzer helps to automatically find and replace such usages.

[Benchmark(Baseline = true)]
public int PGCPI() => Process.GetCurrentProcess().Id;

[Benchmark]
public int EPI() => Environment.ProcessId;
Method Mean Ratio Allocated
PGCPI 67.856 ns 1.00 280 B
EPI 3.191 ns 0.05

Avoid stackallocs in loops. This analyzer doesn’t largely help you make your code faster, but it can help you make your code correct when you use a solution that makes code faster. Specifically, it marks cases where stackalloc is used to allocate memory from the stack, but it is used in loops. A part of the memory allocated from the stack may not be released until the method returns. If stackalloc is used in a loop, it may lead to more memory allocation than developers, and eventually lead to stack overflow and crash. You can see some examples of fixes in dotnet / runtime.

What’s Next?

According to the. Net roadmap,. Net 5 is scheduled to be released in November 2020, which is a few months away. Although this article shows that a lot of performance improvements have been released, I expect that we will see a lot of additional performance improvements found in. Net 5, if there are no other reasons than the current PRS waiting for a group (in addition to the other discussions mentioned earlier), such as dotnet / runtime and dotnet / runtime to further improve URI, dotnet / runtime 詊 402 vectors string.Compare , Dotnet / Runtime: dictionary for improving performanceFind ordinal ignorecase by extending the existing non randomization optimization, which is case insensitive. Dotnet / runtime provides an asynchronous DNS resolution. On Linux, dotnet / runtime can significantly reduce the overhead Activator.CreateInstance(),dotnet/runtime#32843 Utf8Parser。 Trying to resolve int32 value faster, dotnet / runtime improves the performance of ID equality checking, dotnet / runtime reduces the cost of event listener to process event source events, while dotnet / runtime input more information from @ bond-009 under special circumstances Task.WhenAny 。

Finally, while we really try to avoid performance degradation, some performance degradation will inevitably occur in any version, and we will spend time investigating the performance degradation we find. This regression and a known class feature make. Net5: ICU. Net framework and previous versions of. Net core use national language support (NLS) API globalization on windows, while net core uses international Unicode (ICU) on UNIX. Net 5 components switch to use default ICU on all operating systems if available (Windows 10 includes updates up to may 2019), Make better behavior consistent operating system. However, because the two technologies have different performance profiles, some operations, especially culture aware string operations, may become slower in some cases. While we want to reduce most of these (which will also help improve performance on Linux and MacOS), if any changes that remain may not matter to your application, you can choose to continue using NLS if they have a negative impact on your specific application.

With a preview of. Net and nightly builds, I encourage you to download the latest versions and try them out in your application. If you find something you think can and should be improved, we welcome your PRS to dotnet / runtime!
Happy coding!

Because the article is long, it really takes a long time. The middle machine corrects some places, but the ending is still good. Finally, it is finished. Hope to help you, thank you!

image

reference resources: https://devblogs.microsoft.com/dotnet/performance-improvements-in-net-5/