-
Notifications
You must be signed in to change notification settings - Fork 3.1k
vmm: optimize vioapic_write() #1838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Running Windows 11 / Server 2025 inside bhyve causes this function to be called extremely frequently. As a result, vm_smp_rendezvous() is called very often, which causes all but 1 core that the VM has access to to synchronize. As a result, in our testing, these cores would spend roughly 70% of their time inside vm_handle_rendezvous(), causing Windows to slow to a crawl. This is the same code path as mentioned in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=268794 We ran a simple test program that finds the number of prime numbers below a certain threshold, a simple O(n^2) single-thread performance benchmark. On Windows Server 2022: > number of primes less than 100000: 9592 > wall time: 1.69249 secs > user time: 1.6875 secs Windows Server 2025 *without* this patch: > number of primes less than 100000: 9592 > wall time: 3.21974 secs > user time: 2.89063 secs Windows Server 2025 *WITH* this patch: > number of primes less than 100000: 9592 > wall time: 1.72742 secs > user time: 1.65625 secs Given that the purpose of the routine passed into vm_smp_rendezvous() is to "Reset the vlapic's trigger-mode register to reflect the ioapic pin configuration", this change seems reasonable. Signed-off-by: Jack Bendtsen <[email protected]>
*/ | ||
changed = last ^ vioapic->rtbl[pin].reg; | ||
if (changed & ~(IOART_INTMASK | IOART_INTPOL)) { | ||
if (changed & IOART_TRGRMOD) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately there are other fields that can change that can affect the trigger mode such as the delivery mode (probably rarely changed), or if you were to change which local APIC you are sending the interrupt to (as the old LAPIC needs to disable it and the new LAPIC needs to enable the bit in TMR), or the IDT vector for the interrupt changes (you need to clear the old TMR bit and set the new one).
Do you know which bits are actually changing in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the data. Note that vioapic_write() is being called from vm_handle_inst_emul(), ie. the Windows 11 kernel.
TL;DR Windows is selecting the APIC ID to receive an interrupt. No other changes are being made. All other fields are 0 except the interrupt vector.
I ended up (ab)using printf to extract out the information from just before this if statement. I attempted to buffer the information and write it to a file in one go, in order to combat the delay caused by printf, but opening the file seemed to result in EFAULT for some reason.
csa% dmesg | tail
vio_debug: secs = 1757380084, nanos = 338361009, last = 7000000000000D1, changed = 0, addr = 40, data = D1
vio_debug: secs = 1757380084, nanos = 349262377, last = 7000000000000D1, changed = 100000000000000, addr = 41, data = 6000000
vio_debug: secs = 1757380084, nanos = 364429405, last = 6000000000000D1, changed = 0, addr = 40, data = D1
vio_debug: secs = 1757380084, nanos = 375337184, last = 6000000000000D1, changed = 300000000000000, addr = 41, data = 5000000
vio_debug: secs = 1757380084, nanos = 388042990, last = 5000000000000D1, changed = 0, addr = 40, data = D1
vio_debug: secs = 1757380084, nanos = 403239995, last = 5000000000000D1, changed = 700000000000000, addr = 41, data = 2000000
vio_debug: secs = 1757380084, nanos = 415955752, last = 2000000000000D1, changed = 0, addr = 40, data = D1
vio_debug: secs = 1757380084, nanos = 426865540, last = 2000000000000D1, changed = 600000000000000, addr = 41, data = 4000000
vio_debug: secs = 1757380084, nanos = 443928158, last = 4000000000000D1, changed = 0, addr = 40, data = D1
vio_debug: secs = 1757380084, nanos = 454848257, last = 4000000000000D1, changed = 700000000000000, addr = 41, data = 3000000
csa% dmesg | tail
vio_debug: secs = 1757380086, nanos = 826479003, last = 2000000000000D1, changed = 0, addr = 40, data = D1
vio_debug: secs = 1757380086, nanos = 837419782, last = 2000000000000D1, changed = 600000000000000, addr = 41, data = 4000000
vio_debug: secs = 1757380086, nanos = 852087814, last = 4000000000000D1, changed = 0, addr = 40, data = D1
vio_debug: secs = 1757380086, nanos = 862996137, last = 4000000000000D1, changed = 700000000000000, addr = 41, data = 3000000
vio_debug: secs = 1757380086, nanos = 875703865, last = 3000000000000D1, changed = 0, addr = 40, data = D1
vio_debug: secs = 1757380086, nanos = 890709419, last = 3000000000000D1, changed = 200000000000000, addr = 41, data = 1000000
vio_debug: secs = 1757380086, nanos = 903415561, last = 1000000000000D1, changed = 0, addr = 40, data = D1
vio_debug: secs = 1757380086, nanos = 914343627, last = 1000000000000D1, changed = 100000000000000, addr = 41, data = 0
vio_debug: secs = 1757380086, nanos = 926467863, last = D1, changed = 0, addr = 40, data = D1
vio_debug: secs = 1757380086, nanos = 937923418, last = D1, changed = 700000000000000, addr = 41, data = 7000000
From https://wiki.osdev.org/IOAPIC ...
Destination: bits 56 - 63: This field is interpreted according to the Destination Format bit. If Physical destination is choosen, then this field is limited to bits 56 - 59 (only 16 CPUs addressable). You put here the APIC ID of the CPU that you want to receive the interrupt. TODO: Logical destination format...
Note that in this example, the VM has been given 8 cores. Sure enough, only bits 56, 57, and 58 are changing.
I don't know what Windows is trying to achieve, but if I had to guess, it might be some kind of task dispatcher that distributes across all available cores.
In any case, the patch as is has been tested and works just fine, which indicates that Windows 11 doesn't seem to be interested in constantly reconfiguring LAPIC. Perhaps a more accurate fix can determined.
How about if ((changed & ~(IOART_INTMASK | IOART_INTPOL)) && (changed & ~(0xffUL << 56)) != 0) {
instead? ie. if something changed that wasn't just the physical CPU ID of the interrupt target.
EDIT: I just realized that with this implementation, there's no way of knowing whether the other fields changed once you've observed a change in the destination APIC, since the high and low words are written separately. The solution would therefore need to store the previous value for the low word, and use that to determine whether the destination change requires any work to be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that vioapic_update_tmr doesn't do any meaningful work if the previous TMR bit was edge and the new TMR bit is also edge.
It's also worth emphasizing that fixing this issue solves the problem of Windows 11 not being viable on bhyve.
Running Windows 11 / Server 2025 inside bhyve causes this function to be called extremely frequently. As a result, vm_smp_rendezvous() is called very often, which causes all but 1 core that the VM has access to to synchronize. As a result, in our testing, these cores would spend roughly 70% of their time inside vm_handle_rendezvous(), causing Windows to slow to a crawl.
This is the same code path as mentioned in https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=268794
We ran a simple test program that finds the number of prime numbers below a certain threshold, a simple O(n^2) single-thread performance benchmark.
On Windows Server 2022:
Windows Server 2025 without this patch:
Windows Server 2025 WITH this patch:
Given that the purpose of the routine passed into vm_smp_rendezvous() is to "Reset the vlapic's trigger-mode register to reflect the ioapic pin configuration", this change seems reasonable.