-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Switch managed type system and compilers to utf-8 #119385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Personally big fan of this direction as it would also make switching ILC/R2R name mangling to UTF-8 more straightforward later on (I have attempt at this PaulusParssinen#3) |
Yep, we'll definitely want to do that! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR switches the managed type system and compilers from UTF-16 to UTF-8 string representation for better runtime performance and working set improvements. The primary motivation is to eliminate conversions between metadata format (UTF-8) and System.String format (UTF-16) during reflection operations and hashcode computation.
- Updates type system APIs to use
ReadOnlySpan<byte>
for names and namespaces instead ofstring
- Modifies hashcode algorithms to work directly with UTF-8 bytes
- Adds helper methods and extensions to support UTF-8 string comparisons and conversions
Reviewed Changes
Copilot reviewed 258 out of 258 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
src/libraries/Common/src/Internal/VersionResilientHashCode.cs | Core hashcode algorithm implementation for UTF-8 strings |
src/coreclr/tools/Common/TypeSystem/Canon/CanonTypes.Metadata.cs | Updates canon type method signatures to use UTF-8 spans |
src/coreclr/tools/Common/TypeSystem/Canon/CanonTypes.Diagnostic.cs | Modifies diagnostic name properties to call GetName() methods |
src/coreclr/tools/Common/JitInterface/*.cs | Updates JIT interface formatters and instruction set lookups to use UTF-8 |
src/coreclr/tools/Common/Internal/Runtime/EETypeBuilderHelpers.cs | Converts type name checks to use UTF-8 byte comparisons |
src/coreclr/tools/Common/Internal/Metadata/NativeFormat/*.cs | Updates metadata readers and hashcode algorithms for UTF-8 |
src/coreclr/tools/Common/Compiler/*.cs | Updates compiler helpers, name manglers, and layout algorithms |
src/coreclr/nativeaot/System.Private.TypeLoader/src/**/*.cs | Updates TypeLoader components for UTF-8 string handling |
Comments suppressed due to low confidence (1)
Any numbers to support this? |
Yep, the example in #66620 is seeing 5+% improvement: Before:
After:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can use crossgen2 and naot outer loop runs before merging,
/azp run runtime-nativeaot-outerloop, runtime-coreclr crossgen2 outerloop |
Azure Pipelines successfully started running 2 pipeline(s). |
Thank you for reviewing, this one was a bit longer! |
/azp run runtime-nativeaot-outerloop, runtime-coreclr crossgen2 outerloop |
Azure Pipelines successfully started running 2 pipeline(s). |
/ba-g failures are unrelated and fixed by #119574 |
Restore missing Append call that was removed during a past refactoring in #119385
My original motivation for this was just to stop computing our stable hashcodes from UTF-16 and use UTF-8 instead. We compute hashcodes from names both at compile time and at runtime. I.e. the motivation was not so much compiler perf but run time perf. But I didn't want to regress compiler perf. So this got a bit out of hand.
We should see improvements in reflection performance since now we can hash strings in metadata directly. It should also help working set at runtime because we no longer need to convert strings in metadata format (UTF-8) to strings in
System.String
format (UTF-16).The compiler perf improvements are a mixed bag and mostly a wash. We still end up converting pretty much everything to
string
because of name mangling being string based, and dataflow analysis (shared with ILLinker) being string based.I think we can address both of those and also get a compiler throughput win. But not in this PR - I did just enough here so that we don't have regression in ILC or crossgen2. All the places that call
GetName()
/GetNamespace()
are optimization opportunities.Cc @dotnet/ilc-contrib