Skip to content

Commit 8e8926e

Browse files
stapelberggopherbot
authored andcommitted
encoding/protowire: micro-optimize SizeVarint (-20% on Intel)
SizeVarint is of strategic importance for Protobuf encoding, but I want to be clear: This change, on its own, does not measurably improve real-world Protobuf usages in my testing. It does, however, improve performance within the context of another, larger project. I don’t want to sequence this optimization on the bigger project, but would rather test and submit it in isolation. As the detailed comment in the source code explains, this implementation follows C++ Protobuf’s approach. For your convenience, here is a godbolt Compiler Explorer link that shows what the Go compiler makes of the old and new version: https://godbolt.org/z/4erW1EY4r When compiling with GOAMD64=v1 (the default), the new version is roughly performance-neutral (a little faster on some, a little slower on other CPU architectures — probably within the noise floor): .fullname: SizeVarint-4 │ head │ micro | │ sec/op │ sec/op vs base | conan-altra 2.174µ ± 0% 2.156µ ± 0% -0.83% (p=0.000 n=10) arcadia-rome 3.519µ ± 2% 3.558µ ± 0% ~ (p=0.060 n=10) indus-skylake 2.143µ ± 3% 2.192µ ± 7% ~ (p=0.448 n=10) izumi-sapphirerapids 974.9n ± 0% 1020.0n ± 0% +4.63% (p=0.000 n=10) geomean 1.999µ 2.035µ +1.78% By setting GOAMD64=v3, we unlock the full feature set of our CPUs. If we build the old version with GOAMD64=v3, we already see a -50% speed-up on AMD Zen 2 CPUs (due to switching from the slow BSRQ to the fast LZCNTQ): .fullname: SizeVarint-4 │ head │ head-goamd64v3 │ │ sec/op │ sec/op vs base │ conan-altra 2.174µ ± 0% 2.174µ ± 0% ~ (p=1.000 n=10) arcadia-rome 3.519µ ± 2% 1.789µ ± 0% -49.15% (p=0.000 n=10) indus-skylake 2.143µ ± 3% 2.165µ ± 9% ~ (p=0.739 n=10) izumi-sapphirerapids 974.9n ± 0% 980.5n ± 3% +0.58% (p=0.007 n=10) geomean 1.999µ 1.695µ -15.22% And if we benchmark the new version with GOAMD64=v3, we see a further speed-up on ARM and Intel — as high as 20% on Skylake! .fullname: SizeVarint-4 │ head-goamd64v3 │ micro-goamd64v3 │ │ sec/op │ sec/op vs base │ conan-altra 2.174µ ± 0% 2.156µ ± 0% -0.83% (p=0.000 n=10) arcadia-rome 1.789µ ± 0% 1.836µ ± 1% +2.63% (p=0.000 n=10) indus-skylake 2.165µ ± 9% 1.753µ ± 7% -19.05% (p=0.000 n=10) izumi-sapphirerapids 980.5n ± 3% 959.1n ± 0% -2.19% (p=0.000 n=10) geomean 1.695µ 1.606µ -5.25% In summary, I believe this version of SizeVarint is currently the fastest on the relevant CPUs, and leaves the path open to squeeze out a little more performance by changing the Go compiler. Change-Id: Ibc2629f8dcf9f2f4eb0a09fe37f923829ee3165b Reviewed-on: https://go-review.googlesource.com/c/protobuf/+/683955 Reviewed-by: Nicolas Hillegeer <[email protected]> Auto-Submit: Nicolas Hillegeer <[email protected]> Reviewed-by: Christian Höppner <[email protected]> Reviewed-by: Damien Neil <[email protected]> Commit-Queue: Nicolas Hillegeer <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]>
1 parent 32018e9 commit 8e8926e

File tree

2 files changed

+53
-1
lines changed

2 files changed

+53
-1
lines changed

encoding/protowire/wire.go

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -371,7 +371,31 @@ func ConsumeVarint(b []byte) (v uint64, n int) {
371371
func SizeVarint(v uint64) int {
372372
// This computes 1 + (bits.Len64(v)-1)/7.
373373
// 9/64 is a good enough approximation of 1/7
374-
return int(9*uint32(bits.Len64(v))+64) / 64
374+
//
375+
// The Go compiler can translate the bits.LeadingZeros64 call into the LZCNT
376+
// instruction, which is very fast on CPUs from the last few years. The
377+
// specific way of expressing the calculation matches C++ Protobuf, see
378+
// https://godbolt.org/z/4P3h53oM4 for the C++ code and how gcc/clang
379+
// optimize that function for GOAMD64=v1 and GOAMD64=v3 (-march=haswell).
380+
381+
// By OR'ing v with 1, we guarantee that v is never 0, without changing the
382+
// result of SizeVarint. LZCNT is not defined for 0, meaning the compiler
383+
// needs to add extra instructions to handle that case.
384+
//
385+
// The Go compiler currently (go1.24.4) does not make use of this knowledge.
386+
// This opportunity (removing the XOR instruction, which handles the 0 case)
387+
// results in a small (1%) performance win across CPU architectures.
388+
//
389+
// Independently of avoiding the 0 case, we need the v |= 1 line because
390+
// it allows the Go compiler to eliminate an extra XCHGL barrier.
391+
v |= 1
392+
393+
// It would be clearer to write log2value := 63 - uint32(...), but
394+
// writing uint32(...) ^ 63 is much more efficient (-14% ARM, -20% Intel).
395+
// Proof of identity for our value range [0..63]:
396+
// https://go.dev/play/p/Pdn9hEWYakX
397+
log2value := uint32(bits.LeadingZeros64(v)) ^ 63
398+
return int((log2value*9 + (64 + 9)) / 64)
375399
}
376400

377401
// AppendFixed32 appends v to b as a little-endian uint32.

encoding/protowire/wire_test.go

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -678,3 +678,31 @@ func TestZigZag(t *testing.T) {
678678
}
679679
}
680680
}
681+
682+
// TODO(go1.23): use slices.Repeat
683+
var testvals = func() []uint64 {
684+
// These values are representative for the values that we observe when
685+
// running benchmarks extracted from Google production workloads.
686+
vals := []uint64{
687+
1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
688+
55, 66, 77, 88, 99, 100,
689+
123456789, 98765432,
690+
}
691+
newslice := make([]uint64, 100*len(vals))
692+
n := copy(newslice, vals)
693+
for n < len(newslice) {
694+
n += copy(newslice[n:], newslice[:n])
695+
}
696+
return newslice
697+
}()
698+
699+
func BenchmarkSizeVarint(b *testing.B) {
700+
var total int
701+
for range b.N {
702+
for _, val := range testvals {
703+
total += SizeVarint(val)
704+
}
705+
}
706+
// Prevent the Go compiler from optimizing out the SizeVarint call:
707+
b.Logf("total: %d", total)
708+
}

0 commit comments

Comments
 (0)