Making openat(2) and friends more useful in practice

Contributed by Peter N. M. Hansteen on 2025-05-28 from the best laid plans of pufferfish and ... dept.

Reining in file system access is hard to get right, even for OpenBSD developers.

In a message to tech@ titled openat(2) is mostly useless, sadly Theo de Raadt (deraadt@) describes how the openat(2) family of system calls has failed to live up to expectations in practice, and he proposes changes that may improve the situation.

Theo writes,

List:       openbsd-tech
Subject:    openat(2) is mostly useless, sadly
From:       "Theo de Raadt" <deraadt () openbsd ! org>
Date:       2025-05-28 14:03:29

The family of system calls related to openat(2) are mostly useless in
practice, rarely used. When they are used it is often ineffectively or
even with performance-reducing results.

     int
     openat(int fd, const char *path, int flags, ...);

These are the others:

    sys_fstatat sys_utimensat sys_chflagsat sys_pathconfat sys_faccessat
    sys_fchmodat sys_fchownat sys_linkat sys_mkdirat sys_mkfifoat
    sys_mknodat sys_readlinkat sys_renameat sys_symlinkat sys_unlinkat

The idea is that you can open a directory as fd, typically using O_DIRECTORY,
and then do relative accesses.  This will reduce lookups, and corresponding
locking operations in the kernel.  In practice two things get in the way, as
POSIX specs say:

    The openat() function shall be equivalent to the open() function except
    in the case where path specifies a relative path.

1) What if it is not a relative path, meaning /etc/passwd?
   openat(herefd, "/etc/passwd, O_RDONLY) will open that file and completely
   ignore herefd.

2) What if the relative path is upwards, meaning "../../../../something".
   It walks up the path, and opens it.

To keep it simple, these calls were not designed to assist any security
model.

Both FreeBSD and Linux have designed variations which do this.  Since all
the *at(2) functions have a flags parameter, their strategy was to add an
additional flag which didn't allow upwards traversal.  I think that misses
the point, and have a different proposal.

Let's create directory fd's which cannot traverse upwards.  Mark the object,
instead of requiring a programmer to put a flag on every system call acting
upon the object.  Two operational flags are added, O_BELOW and F_BELOW.

Creating such a locked directory fd is done with either

     dirfd = open("path", O_DIRECTORY | O_BELOW);

or you can lock a pre-existing dirfd:

     fcntl(dirfd, F_BELOW);

This dirfd has two charactistics.  Absolute accesses always fail with ENOENT.
Relative accesses that attempt to traverse upwards fail with ENOENT.
You can openat(dirfd, ".") but you cannot openat(dirfd, "..").

Code using readdir() or similar must be careful because they will be provided
with "." and ".." but operations on ".." will now fail.

---
An interesting use case shows up that this is a tiny bit like a chroot()
system call allowed for non-root users.  You can
       
     dirfd = open("path", O_DIRECTORY | O_BELOW);
     fchdir(dirfd);

Your process is now contained inside that directory.  This does not
have the classic risks that prevented providing chroot() to regular
processes (meaning, the opening of absolute paths inside the chroot
could confuse library functions because they are now inspecting the
user-created files, and the consequences of this were considered too
grave).  Absolute paths accessses with open() start at the process
current directory, and now fail.  I have not explored this regular
user chroot-like thing extensively yet.  Some semantic changes maybe
be desired.  There's a chance that this becomes something we want
to use in many daemons instead of chroot().

This is just a draft.  The main idea comes out of review one program
which uses openat() strangely, and wondering if we can do pathname
containment better in the kernel.  This can work nicely alongside unveil(),
but it is cheaper because the kernel doesn't need to hold references to
vnodes like unveil() does.

Index: […]

and the rest of the message is the diff (against -current) that implements the draft proposal.

What do you think? As a developer, what would this mean for the code you write and maintain? Testing and feedback is welcome, as always.

Latest Articles

Thu, May 29
- 08:06 Making openat(2) and friends more useful in practice (0)
Mon, May 26
- 13:35 Adventures in read-only softraid (2)
Sun, May 25
- 10:41 New profiling subsystem committed to -current (0)
Mon, May 19
- 17:05 Call for testing: em(4) TX interrupt mitigation (1)
Fri, May 16
- 06:29 EdgeRouter 4 under OpenBSD with Failover WAN (0)
Thu, May 15
- 04:45 erspan(4) committed to -current (0)
Wed, May 14
- 17:35 Game of Trees 0.112 released (0)
- 05:21 OpenSMTPD 7.7.0p0 released (0)
Mon, May 12
- 10:02 erspan(4): ERSPAN Type II collection (0)

Credits

Copyright © 2004-2008 Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to April 2nd 2004 as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]