encoding/protowire: micro-optimize SizeVarint (-20% on Intel)

stapelberg · gopherbot · commit 8e8926ef675d · 2025-06-25T15:27:01.000-07:00
SizeVarint is of strategic importance for Protobuf encoding, but I want to be clear: This change, on its own, does not measurably improve real-world Protobuf usages in my testing. It does, however, improve performance within the context of another, larger project. I don’t want to sequence this optimization on the bigger project, but would rather test and submit it in isolation. As the detailed comment in the source code explains, this implementation follows C++ Protobuf’s approach. For your convenience, here is a godbolt Compiler Explorer link that shows what the Go compiler makes of the old and new version: https://godbolt.org/z/4erW1EY4r When compiling with GOAMD64=v1 (the default), the new version is roughly performance-neutral (a little faster on some, a little slower on other CPU architectures — probably within the noise floor): .fullname: SizeVarint-4 │ head │ micro | │ sec/op │ sec/op vs base | conan-altra 2.174µ ± 0% 2.156µ ± 0% -0.83% (p=0.000 n=10) arcadia-rome 3.519µ ± 2% 3.558µ ± 0% ~ (p=0.060 n=10) indus-skylake 2.143µ ± 3% 2.192µ ± 7% ~ (p=0.448 n=10) izumi-sapphirerapids 974.9n ± 0% 1020.0n ± 0% +4.63% (p=0.000 n=10) geomean 1.999µ 2.035µ +1.78% By setting GOAMD64=v3, we unlock the full feature set of our CPUs. If we build the old version with GOAMD64=v3, we already see a -50% speed-up on AMD Zen 2 CPUs (due to switching from the slow BSRQ to the fast LZCNTQ): .fullname: SizeVarint-4 │ head │ head-goamd64v3 │ │ sec/op │ sec/op vs base │ conan-altra 2.174µ ± 0% 2.174µ ± 0% ~ (p=1.000 n=10) arcadia-rome 3.519µ ± 2% 1.789µ ± 0% -49.15% (p=0.000 n=10) indus-skylake 2.143µ ± 3% 2.165µ ± 9% ~ (p=0.739 n=10) izumi-sapphirerapids 974.9n ± 0% 980.5n ± 3% +0.58% (p=0.007 n=10) geomean 1.999µ 1.695µ -15.22% And if we benchmark the new version with GOAMD64=v3, we see a further speed-up on ARM and Intel — as high as 20% on Skylake! .fullname: SizeVarint-4 │ head-goamd64v3 │ micro-goamd64v3 │ │ sec/op │ sec/op vs base │ conan-altra 2.174µ ± 0% 2.156µ ± 0% -0.83% (p=0.000 n=10) arcadia-rome 1.789µ ± 0% 1.836µ ± 1% +2.63% (p=0.000 n=10) indus-skylake 2.165µ ± 9% 1.753µ ± 7% -19.05% (p=0.000 n=10) izumi-sapphirerapids 980.5n ± 3% 959.1n ± 0% -2.19% (p=0.000 n=10) geomean 1.695µ 1.606µ -5.25% In summary, I believe this version of SizeVarint is currently the fastest on the relevant CPUs, and leaves the path open to squeeze out a little more performance by changing the Go compiler. Change-Id: Ibc2629f8dcf9f2f4eb0a09fe37f923829ee3165b Reviewed-on: https://go-review.googlesource.com/c/protobuf/+/683955 Reviewed-by: Nicolas Hillegeer <aktau@google.com> Auto-Submit: Nicolas Hillegeer <aktau@google.com> Reviewed-by: Christian Höppner <hoeppi@google.com> Reviewed-by: Damien Neil <dneil@google.com> Commit-Queue: Nicolas Hillegeer <aktau@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
diff --git a/encoding/protowire/wire.go b/encoding/protowire/wire.go
@@ -371,7 +371,31 @@ func ConsumeVarint(b []byte) (v uint64, n int) {
 func SizeVarint(v uint64) int {
 	// This computes 1 + (bits.Len64(v)-1)/7.
 	// 9/64 is a good enough approximation of 1/7
-	return int(9*uint32(bits.Len64(v))+64) / 64
+	//
+	// The Go compiler can translate the bits.LeadingZeros64 call into the LZCNT
+	// instruction, which is very fast on CPUs from the last few years. The
+	// specific way of expressing the calculation matches C++ Protobuf, see
+	// https://godbolt.org/z/4P3h53oM4 for the C++ code and how gcc/clang
+	// optimize that function for GOAMD64=v1 and GOAMD64=v3 (-march=haswell).
+
+	// By OR'ing v with 1, we guarantee that v is never 0, without changing the
+	// result of SizeVarint. LZCNT is not defined for 0, meaning the compiler
+	// needs to add extra instructions to handle that case.
+	//
+	// The Go compiler currently (go1.24.4) does not make use of this knowledge.
+	// This opportunity (removing the XOR instruction, which handles the 0 case)
+	// results in a small (1%) performance win across CPU architectures.
+	//
+	// Independently of avoiding the 0 case, we need the v |= 1 line because
+	// it allows the Go compiler to eliminate an extra XCHGL barrier.
+	v |= 1
+
+	// It would be clearer to write log2value := 63 - uint32(...), but
+	// writing uint32(...) ^ 63 is much more efficient (-14% ARM, -20% Intel).
+	// Proof of identity for our value range [0..63]:
+	// https://go.dev/play/p/Pdn9hEWYakX
+	log2value := uint32(bits.LeadingZeros64(v)) ^ 63
+	return int((log2value*9 + (64 + 9)) / 64)
 }
 
 // AppendFixed32 appends v to b as a little-endian uint32.
diff --git a/encoding/protowire/wire_test.go b/encoding/protowire/wire_test.go
@@ -678,3 +678,31 @@ func TestZigZag(t *testing.T) {
 		}
 	}
 }
+
+// TODO(go1.23): use slices.Repeat
+var testvals = func() []uint64 {
+	// These values are representative for the values that we observe when
+	// running benchmarks extracted from Google production workloads.
+	vals := []uint64{
+		1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
+		55, 66, 77, 88, 99, 100,
+		123456789, 98765432,
+	}
+	newslice := make([]uint64, 100*len(vals))
+	n := copy(newslice, vals)
+	for n < len(newslice) {
+		n += copy(newslice[n:], newslice[:n])
+	}
+	return newslice
+}()
+
+func BenchmarkSizeVarint(b *testing.B) {
+	var total int
+	for range b.N {
+		for _, val := range testvals {
+			total += SizeVarint(val)
+		}
+	}
+	// Prevent the Go compiler from optimizing out the SizeVarint call:
+	b.Logf("total: %d", total)
+}