Contributed by Peter N. M. Hansteen on from the no fault of UVM dept.
In a recent
message
to tech@
Martin Pieuchot (mpi@
) wrote about
analysis of kernel lock contention.
We reproduce the message(s) here, reformatted with his permission.
Unlocking UVM [virtual memory - Ed.] faults makes build time decrease a lot and improve the overall latency of mixed userland workload. In other words it gives a smoother feeling for "desktop usage": it is now possible to do 'make -j17' and watch a HD video at the same time.
So what next? The 4 Flamegraphs below were captured with patrick@ during the WE. We used its desktop 16-core arm64 machine with amdgpu(4). They all include the UVM unlocking diff and one also includes the poll(2)/select(2) diff + unlocked sowakeup(). Web browsing has been performed with iridium.
- make-j17_arm64.svg
Building a kernel with 17 jobs is hard and only 30% of CPU time is spend in userland.
- Overall spinning time is ~40% (18% on KERNEL_LOCK(), 10% on SCHED_LOCK(), 12% on UVM's pageqlock)
- the UVM unlocking diff made the contention shift from the KERNEL_LOCK() to the global pageqlock and per-amap rwlock. Due to the high contention on shared amap in this workload many threads go to sleep at the same time which makes some contention appear on the SCHED_LOCK().
- The SCHED_LOCK() is not *yet* a problem. What is happening here shows that our rwlock implementation relying on a global sleep queue is suboptimal. However in UVM's `vmobjlock' case we should hopefully turn many of the existing write locks into read locks. NetBSD is already doing that and this should be good enough to prevent some threads to go to sleep thus avoiding SCHED_LOCK() (or any global lock for the sleep queue) contention.
- contention on the pageqlock could be reduced by revisiting/adding per UVM page locking
- 10% of CPU time is spent idle. It is hard to say how much this is because of the scheduler and/or its interaction with high spinning time. However it is worth investigation.
- Syscalls that need the KERNEL_LOCK() for this workload fall into 2 categories:
- UVM ones that could be unlocked as part of a UVM next step: execve(2), fork(2), kbind(2), mmap(2), munmap(2), mprotect(2)
- FS ones where the KERNEL_LOCK() could be pushed down to the VFS layer similarly to what has already be done for read(2) & write(2): dofsstatat(2), doopenat(2), __realpath(2), ioctl(2)
- 2ytHD+make-j17_arm64.svg
Goal of this test was to generate enough workload to not have idle CPUs and to expose where the contention is with a "desktop" usage. Almost the same amount of CPU time is spend in userland ~30-35%. Which gives us an indication that OpenBSD kernel isn't yet scaling to 16 CPUs for such use case.
- Overall spinning time is also ~40% but with a different repartition (30% on KERNEL_LOCK(), 2% on SCHED_LOCK(), 8% on UVM's pageqlock).
- syscalls that need the KERNEL_LOCK() for this workload are the same as above (for obvious reasons) but the following are, IMHO, the most important ones:
- The kernel lock spinning time in futex(2) is there because sleeping with PCATCH still require it.
- pipe, unix and network sockets all use selwakeup() and spin there because poll(2) & select(2) still need it.
- With the kqpoll diff (2ytHD+make-j17+kqpoll_unlocked_arm64.svg) the contention in sowakeup() disappear, the one in pipeselwakeup() could receive the same treatment.
2ytHD+make-j17+kqpoll_unlocked_arm64.svg
- 2ytHD+googlemap_arm64.svg
The intend of this test is to expose where the contention is for heavy multi-threaded process workload. We didn't care much about idle time, it is much more about low latency, how "smooth" can run desktop apps in other words what happens in the kernel.
- UVM fault unlocking is "good enough" for such workload and all the contention is due to syscalls
- If we look at time spent in kernel, 37% is spent spinning on the KERNEL_LOCK() and 12% on the SCHED_LOCK(). So almost half of %sys time is spinning.
- futex(2) for FUTEX_WAIT exposes most of it. It spins on the KERNEL_LOCK() because sleeping with PCATCH requires it, then it spins on the SCHED_LOCK() to put itself on the sleep queue.
- kevent(2), poll(2), and DRM ioctl(2) are responsible for a lot of KERNEL_LOCK() contention in this workload
- NET_LOCK() contention in poll(2) and kqueue(2) generate a lot of sleeps which, together with a lot of futex(2) make the SCHED_LOCK() contention bad.
Conclusion
Unlocking UVM fault is the obvious next step and we are not finished with that yet.
Making poll(2) & select(2) work on top of the kqueue subsystem will allow us to unlock selwakeup() & friends. This will also help for workloads with network traffic going to userland (server, proxy, etc).
Completely unlocking poll(2), select(2) and kqueue(2) will require making rwsleep(9) w/ PCATCH work without KERNEL_LOCK(). This implies make signals work w/o KERNEL_LOCK(). This will also reduce the contention in futex(2).
Unlocking UVM fault will make it easier to unlock many UVM related syscalls. This will help for workloads that fork a lot.
Pushing the KERNEL_LOCK() at the VFS border in all other syscalls that matter can already be done and should already help, so I see no reason to wait.
Questions?
All in all, quite some scope for improvements. Read the entire thing (including any followups) from your favorite archive site or local mailbox.
This promises good things on the horizon.
(Comments are closed)
By Tristan (tristan) tristan@etheria.eu on
Very interesting and really want to check the userland improvements on my gnome desktop. Any idea of timeline for hitting current?