You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
encoding/protowire: micro-optimize SizeVarint (-20% on Intel)
SizeVarint is of strategic importance for Protobuf encoding,
but I want to be clear: This change, on its own, does not
measurably improve real-world Protobuf usages in my testing.
It does, however, improve performance within the context of
another, larger project. I don’t want to sequence this optimization
on the bigger project, but would rather test and submit it in isolation.
As the detailed comment in the source code explains,
this implementation follows C++ Protobuf’s approach.
For your convenience, here is a godbolt Compiler Explorer link
that shows what the Go compiler makes of the old and new version:
https://godbolt.org/z/4erW1EY4r
When compiling with GOAMD64=v1 (the default), the new version
is roughly performance-neutral (a little faster on some, a little
slower on other CPU architectures — probably within the noise floor):
.fullname: SizeVarint-4
│ head │ micro |
│ sec/op │ sec/op vs base |
conan-altra 2.174µ ± 0% 2.156µ ± 0% -0.83% (p=0.000 n=10)
arcadia-rome 3.519µ ± 2% 3.558µ ± 0% ~ (p=0.060 n=10)
indus-skylake 2.143µ ± 3% 2.192µ ± 7% ~ (p=0.448 n=10)
izumi-sapphirerapids 974.9n ± 0% 1020.0n ± 0% +4.63% (p=0.000 n=10)
geomean 1.999µ 2.035µ +1.78%
By setting GOAMD64=v3, we unlock the full feature set of our CPUs.
If we build the old version with GOAMD64=v3, we already see a -50% speed-up
on AMD Zen 2 CPUs (due to switching from the slow BSRQ to the fast LZCNTQ):
.fullname: SizeVarint-4
│ head │ head-goamd64v3 │
│ sec/op │ sec/op vs base │
conan-altra 2.174µ ± 0% 2.174µ ± 0% ~ (p=1.000 n=10)
arcadia-rome 3.519µ ± 2% 1.789µ ± 0% -49.15% (p=0.000 n=10)
indus-skylake 2.143µ ± 3% 2.165µ ± 9% ~ (p=0.739 n=10)
izumi-sapphirerapids 974.9n ± 0% 980.5n ± 3% +0.58% (p=0.007 n=10)
geomean 1.999µ 1.695µ -15.22%
And if we benchmark the new version with GOAMD64=v3, we see a further speed-up
on ARM and Intel — as high as 20% on Skylake!
.fullname: SizeVarint-4
│ head-goamd64v3 │ micro-goamd64v3 │
│ sec/op │ sec/op vs base │
conan-altra 2.174µ ± 0% 2.156µ ± 0% -0.83% (p=0.000 n=10)
arcadia-rome 1.789µ ± 0% 1.836µ ± 1% +2.63% (p=0.000 n=10)
indus-skylake 2.165µ ± 9% 1.753µ ± 7% -19.05% (p=0.000 n=10)
izumi-sapphirerapids 980.5n ± 3% 959.1n ± 0% -2.19% (p=0.000 n=10)
geomean 1.695µ 1.606µ -5.25%
In summary, I believe this version of SizeVarint is currently the fastest
on the relevant CPUs, and leaves the path open to squeeze out a little more
performance by changing the Go compiler.
Change-Id: Ibc2629f8dcf9f2f4eb0a09fe37f923829ee3165b
Reviewed-on: https://go-review.googlesource.com/c/protobuf/+/683955
Reviewed-by: Nicolas Hillegeer <[email protected]>
Auto-Submit: Nicolas Hillegeer <[email protected]>
Reviewed-by: Christian Höppner <[email protected]>
Reviewed-by: Damien Neil <[email protected]>
Commit-Queue: Nicolas Hillegeer <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
0 commit comments