Contributed by jcr on from the lord-voldebert-is-out-to-lunch dept.
OpenBSD developer Ted Unangst (tedu@) recently wrote a blog post on Shared Named Semaphores and with permission it's reposted below:
Support for shared named semaphores, ala sem_open(3), recently arrived in OpenBSD. OpenBSD already supported single process thread shared semaphores, ala sem_init(3), and the old school SysV semaphores, ala semget(2). There are still a few tweaks being made, but the internal design hasn't changed in 24 hours so I figure it's safe to discuss the implementation.
The basic building block of rthreads synchronization is a pair syscall(2) routines, __thrsleep(3) and __thrwakeup(3). They expose to userland the two basic operations of the kernel scheduler. Sleeping and waking is controlled by passing the address of the object you are waiting for, such as the semaphore or mutex. However, they only permit operating on threads within a process (mainly for performance, not security). When a thread calls thrwakeup(3), we don't want to wake up every thread in every process (thundering herd), but we also don't want to spend too much time finding the threads to wake up. Instead, only the list of threads for the current process is searched to find eligible targets.
Different processes can have different addresses for the same shared semaphore, making them incompatible with the current thrsleep(3) design. Making this work across processes is very simple if we cheat with a giant hack. Userland specifies a magic address ((void *)-1) that won't ordinarily be used and the kernel now knows to place all these sleeping threads on a global list, such that any call to thrwakeup(3) will revive all of them regardless of process.
It's not especially efficient, but it lets us add the API and make it work, which an increasing number of programs in the ports tree depend on. Working and slow is better than not working. If the expected number of system wide shared semaphores is small, it doesn't hurt at all.
As mentioned, we already have thread shared semaphores. They have all the internals we need for process shared semaphores: value, waitcount, spinlock. We just need the same semaphore to appear in different processes. And we need to be able to identify it by name. Sounds like a job for the filesystem.
The name passed to sem_open(3) gets mangled up via SHA256 just like shm_open(3). Then we open said file in /tmp, do some permission checks (with the same sharing restrictions as shm_open), mmap(2) it, and done. Very few changes needed to be made to the existing implementation other than a check for a shared semaphore and use of the special wait address.
In contrast to the shm_open(3) API, which I think is entirely useless (just roll your own in one line of code), the sem_open(3) API is a good addition. You don't want everybody implementing their own semaphores. As an example, the OpenBSD implementation requires two system specific syscalls and has spinlock assembly implementations for every CPU architecture. The interface is considerably easier to work with than the old SysV semget(2)/semop(2) combo.
It's not perfect. POSIX (need I say more?) requires that two successive calls to sem_open(3) with the same name return the same pointer. Not the same semaphore (duh), but the same pointer. This doesn't make any sense to me. Why should you care? The same semaphore mmapped at two addresses is going to behave exactly the same. The only explanation I can think of is you need to know if two semaphores are the same, but it seems an easier way to solve that problem is don't forget what semaphores you've opened.
In any case, fixing this is wicked hard. sem_open(3) has no idea what semaphores are already open. The file descriptor is only used to call mmap(2), after that the file is closed, leaving only the memory segment (as if scanning through all open files would be a good idea). There's no way to ask the kernel if some file is mapped into your address space. The kernel can't even answer; the memory segment points to the file and the file doesn't have pointers to all of its mappings, only a ref count.
Hopefully nobody depends on this behavior. I guess we'll find out.
(Comments are closed)