OpenBSD Journal

2Q Buffer Cache in OpenBSD

Contributed by tbert on from the i will always 2q dept.

Ted Unangst (tedu@) wrote a blog post about his replacement of the simple LRU buffer cache algorithm with a 2Q-ish one:

Since the dawn of time, the OpenBSD buffer cache replacement algorithm has been LRU. It’s not always ideal, but it often comes close enough and it’s simple enough to implement that it’s remained the tried and true classic for a long time. I just changed the algorithm to one modelled somewhat after the 2Q algorithm by Johnson and Shasha. (PDF)
LRU

LRU is simple enough it doesn’t require much explanation. Keep a list of all buffers. Whenever you use one, put it on the front of the list. Whenever you need a new (recycled) buffer, take it from the end of the list. Those are the oldest, least recently used buffers. In high level terms, the current working set is at the front of the list and the previous working set is fading away off the end. It’s responsive to changes in the working set, very quickly replacing old unused buffers with the latest. In other words, it has a short history; it’s not “sticky”.

This strength is also a weakness. LRU is particularly vulnerable to erroneous replacement during scanning; i.e. it’s not scan resistant. After weeks of uptime, your system is running great and everything you need from disk is in cache. Then you download a few Linux ISOs to see what the fuss is about. Suddenly, even simple commands like ps need to be read in from disk. Seek, seek, thrash, seek, grind. Should have bought an SSD. What happened? Those ISOs were so large that they pushed everything out of the disk cache. (The obvious optimization, to only cache reads and not writes, doesn’t help here because you’re equally obviously going to run sha256 and checksum those ISOs, leading to the same result.)

This problem has been known for long time, but thanks again to the short history rarely causes permanent damage. And caches with history are often harder to implement due to the additional state tracking required, and they can even suffer from too much history. After weeks of uptime, the exec pages for a command like ls are going to be pegged in memory. What if you catch color fever and switch to using colorls? Sorry, those plain ls buffers aren’t going to be recycled. Eldritch buffers are eldritch.

At the hackathon I decided to look into the problem of scan resistance a little more. A hundred CS papers have proposed better algorithms; I settled on 2Q. Or at least something inspired by it, as the final result is more or less different depending on your squint.

2Q

Reading the paper is the best way to understand exactly what they’re doing. I’ll just summarize the highlights that I thought were applicable. The main insight is that in addition to LRU’s current working set and previous working set, there’s a third working set: the long term working set. We start by assuming that a buffer is in the current working set, then quickly move it to the previous working set (or sets; they keep quite a long history). If we need that buffer again, that’s a hint that it’s really in the long term working set. To conserve memory, 2Q doesn’t actually keep the previous working set data cached in memory. Only the addresses are retained which is enough to determine when a cache miss is new data vs previously seen data.

The 2Q algorithm is scan resistant. The memory reserved for current working set is kept quite small; most of it is dedicated to long term. This keeps transient data out. At the same time, the long term list is itself managed as LRU, so it’s responsive to changes. All good.

bufcache

Earlier, I refactored the buffer cache interface so that the line between the buf cache and buf midlayer was better defined. The bufcache family of functions does all the work now, allowing us to make algorithm changes in one place. At the time I was experimenting with a different LFU style algorithm, but it was sufficient to prove the interface would work. 2Q represents the first real world test.

not quite 2Q

The code in OpenBSD resembles 2Q, but contains some significant differences as well. In large part, this is because I had another priority: ease of implementation. 2Q keeps a history of past buffer addresses to know when a miss was previously in the cache; bufcache doesn’t provide that capability in its current form. There were some other obstacles as well. The result is that I tried to implement the simplest variation I could, while also considering future plans. There are comments in the code as well, but I repeat the explanation here.

We retain the key three working set distinction. In the OpenBSD code, they are named hot, cold, and warm, and each is an LRU queue. New buffers start hot. They stay that way as long as they remain on the hot queue. Eventually, a buffer will slip from the end of the hot queue onto the front of the cold queue. (We preserve the data, not just the address.) When a new buffer is needed, we recycle one from the tail of the cold queue. The oldest and coldest. If, on the other hand, we have a cache hit on a cold buffer, it turns into a warm buffer and goes to the front of the warm queue. Then as the warm queue lengthens, buffers start slipping from the end onto the cold queue. Both the hot and warm queues are capped at one third of memory each to ensure balance.

Scan resistance is achieved by means of the warm queue. Transient data will pass from hot queue to cold queue and be recycled. Responsiveness is maintained by making the warm queue LRU so that expired long term set buffers fade away.

results

In some ad hoc md5 some large files testing, I confirmed the contents of the bin directory remained in cache when they otherwise would not. Other interactive responsiveness testing went well, too. The primary goal had been to achieve some degree of scan resistance: mission accomplished.

The 2Q paper results are based on replaying traces in a simulator, which is the kind of thing you have to do to get papers published, but which also means they didn’t have to integrate with an existing system which anybody used. I’m not trying to publish a paper, so my results are lacking, but I did have to weld this thing into the existing kernel’s buf layer, which is why it didn’t come out quite the same. I’m all about rigor, as long somebody else does the work.

future

The numbers and ratios for the various queue lengths may need tuning. The current design and numbers were selected in part to minimize regressions, not maximize performance. One hopeful change is to finally add support for highmem on amd64 systems. The same algorithm that works to balance between hot and warm queues will work to balance between low and high mem. Now at least we have a launch point for future efforts to build upon.

A name? TU-Q?

(Comments are closed)


Comments
  1. By Josh Grosse (129.9.75.252) on

    You need a name? How about tedq?

    Comments
    1. By Renaud Allard (renaud) on

      > You need a name? How about tedq?

      TUQ seems better as it could be pronounced quite like 2Q and it's Ted Unangst Queue.

      Comments
      1. By Josh Grosse (129.9.75.252) on

        > TUQ seems better as it could be pronounced quite like 2Q and it's Ted Unangst Queue.

        Oooh, Better than my suggestion!

        Comments
        1. By rjc (rjc) on

          > Oooh, Better than my suggestion!

          Well, it's not *better* than your suggestion - it's simply *there* in the quoted blog post ;^)

      2. By Anonymous Coward (63.231.28.161) on

        > > You need a name? How about tedq?
        >
        > TUQ seems better as it could be pronounced quite like 2Q and it's Ted Unangst Queue.

        Sorry, but I must ask:
        Why is overlapping pronunciation a benefit?

        It's kind of cool that you've managed to accomplish this feat, sort of like how solving a puzzle is cool. But I would prefer that things be less confusing.

        I had an aunt and uncle who had 13 kids and all of them had a name that started with the letter K. In some cases, the name of one child was pronounced the same as the name of an opposite-sexed sibling, but had a different spelling. I can applaud the creativity in managing to come up with such a list of names. The fact that they implemented that within a single household was not always very convenient.

        Multiple names for the same thing (Es Cue El, Sequel; Slash Ee Tee See, Etsy) can be confusing enough. Multiple things with the same name ("root" referring to / is standard, but then /tmp is often referred to as "temp" and /var is often referred to as /var, and then we have /root so now we have two "root"s) is even worse.

        As long as nobody mentions TUQ in speeches, the confusion will be minimal. But if TUQ is discussed in a seminar, I don't want someone to refer to it as "too cue" because I will be thinking it is 2Q. Which might not be a problem as long as they haven't diverged, but at the point they get to be a bit different, then having a unique name is nice.

        Comments
        1. By Anonymous Coward (67.225.32.71) on

          > > > You need a name? How about tedq?
          > >
          > > TUQ seems better as it could be pronounced quite like 2Q and it's Ted Unangst Queue.
          >
          > Sorry, but I must ask:
          > Why is overlapping pronunciation a benefit?
          >
          > It's kind of cool that you've managed to accomplish this feat, sort of like how solving a puzzle is cool. But I would prefer that things be less confusing.
          >
          > I had an aunt and uncle who had 13 kids and all of them had a name that started with the letter K. In some cases, the name of one child was pronounced the same as the name of an opposite-sexed sibling, but had a different spelling. I can applaud the creativity in managing to come up with such a list of names. The fact that they implemented that within a single household was not always very convenient.
          >
          > Multiple names for the same thing (Es Cue El, Sequel; Slash Ee Tee See, Etsy) can be confusing enough. Multiple things with the same name ("root" referring to / is standard, but then /tmp is often referred to as "temp" and /var is often referred to as /var, and then we have /root so now we have two "root"s) is even worse.
          >
          > As long as nobody mentions TUQ in speeches, the confusion will be minimal. But if TUQ is discussed in a seminar, I don't want someone to refer to it as "too cue" because I will be thinking it is 2Q. Which might not be a problem as long as they haven't diverged, but at the point they get to be a bit different, then having a unique name is nice.

          It also reminds me of the Canadian word tuque, which is fitting for a project headquartered out of Alberta. I hope the name sticks.

  2. By Peter J. Philipp (pjp) pjp@solarscale.de on http://centroid.eu

    Perhaps somewhat unrelated but the PDF kept crashing my mozilla firefox. A recent thread on -misc I was on someone revealed how to increase the limits in login.conf:

    datasizemax === 3G
    datasize-cur === 2G

    This does the job for me without crashes. So thanks Amit from -misc for that helpful hint!

    -peter

  3. By Anonymous Coward (185.36.100.145) on

    Can I tune off bufcache on some mountpoint?

    Comments
    1. By Anonymous Coward (79.238.5.210) on

      > Can I tune off bufcache on some mountpoint?

      Why would you want to?

      Comments
      1. By Anonymous Coward (193.191.170.2) on

        > > Can I tune off bufcache on some mountpoint?
        >
        > Why would you want to?

        For database files; for which the DBMS is already doing the caching ?

        Comments
        1. By Anonymous Coward (85.21.236.247) on

          > > > Can I tune off bufcache on some mountpoint?
          > >
          > > Why would you want to?
          >
          > For database files; for which the DBMS is already doing the caching ?

          I'd rather like to see Direct I/O support in OpenBSD than that.

        2. By Janne Johansson (jj) on http://www.inet6.se

          > > > Can I tune off bufcache on some mountpoint?
          > >
          > > Why would you want to?
          >
          > For database files; for which the DBMS is already doing the caching ?

          In the old days, you'd give your DB a raw device instead of a file in a partition just so that it would bypass the buffering. I guess that became unfashionable.

    2. By Anonymous Coward (193.191.170.2) on

      > Can I tune off bufcache on some mountpoint?

      For database files; for which the DBMS is already doing the caching ?

Latest Articles

Credits

Copyright © - Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]