OpenBSD Journal
Home : : Add Story : : Archives : : About : : Create Account : : Login :
From the trenches: espie@ reports on recent experiments in package building
Contributed by tbert on Fri Mar 7 12:52:45 2014 (GMT)
from the puff my package! dept.

In a recent post to the ports mailing list titled "dpb fun", Marc Espie (espie@) reported on tests running the OpenBSD distributed ports builder on larger than usual hardware and improvements that sprang from the test:

So, I got access to a bunch of fast machines through Yandex. Big kudoes to them. It allowed me to continue working on dpb optimizations for fast clusters, after some tentalizing glimpse into big clusters I got a few months ago thanks to some experiment led by Florian Obser.

The rest of the post follows after the fold, this looks like exciting times are ahead.

First remark is that we don't really scale all that well on a lot of cpus on the same machine (duh). It's probably faster to build things on a cluster with 5~8 machines with 4 cores each than on 3 machines with 16 cores each...

It looks like you really really want to disable hyper-threading. On my test, it amounts for a +20% performance increase.

Having memory helps... I gues the turning point is somewhere around 20GB-50GB per box... around there, you can build all ports in memory, and also have other interesting parts in tmpfs as well, such as /usr/local. It doesn't help as much as I would have guessed.

I've experimented with storing packages locally. Doesn't help all that much. Building things directly on the NFS server doesn't hurt.

One thing that does hurt, though, is computing dependencies while building packages. I have a set of patches to enable a "global depends cache" that seems to shave an extra minute or so per package (to divide by 12 or so, since this is wall-clock time on a single cpu). I've already committed the src/ part (tweaks to PkgCreate) and the rest will be in when ports unlock. Specifically, it just requires creating cache entries in tmp files and renaming them to their final destination, so that several packages may be built simultaneously without stepping on each other toes.

I also discovered that thanks to some buggy code reorganization of mine, I've inadvertendly nullified an important optimization (BUILD_ONCE) I implemented a few years ago.... which was fairly easy to restore, fortunately.

Another interesting improvement was smarter scheduling of available cores. The initial dpb code is very crude, and just grabs the first core available to run things. If all machines fire up simultaneously, this ends up in a "thundering herd" stampede, as 12 jobs are started on the first host, then 12 on the next one, etc... most specifically, the LISTING job starts up slow, as it competes with another 11 jobs almost right away... and dpb tends to empty its whole queue right away when you've got over 40 cpus to play with... faster LISTING == bigger queue == greater chance of full cpu utilisation.

One thing I haven't solved yet is that apparently, ssh in master mode tends to refuse shared connections when you go to 16 jobs per machine... I haven't investigated, not sure whether this is a limitation of ssh, or some machine limit I haven't found.

(important note: dpb with lots of cores gobbles resources like crazy... you want to seriously crank up process#, fd#, memory usage).

With all this, the biggest contender left is that dpb will lose a lot of time in "waiting-for-lock" states... I've done a quick patch that helps a common case: when a job is run parallel, it would release lots of cores (maxcores/2) at the end of packaging, and hence... lots of waiting-for-lock.

The job can actually regurgitate those cores at the end of fake... this still leads to loads of waiting-for-locks, but those will hopefully be solved by the end of packaging.

I'm currently experimenting with further tricky patches to make jobs in depends aware of other jobs waiting for the same lock on the same machine. This entails trying to wake other jobs in order, trying to solve depends for several jobs at once, and preventing junk from removing depends for stuff that's already been analyzed, but unsolved yet. These patches are somewhat necessary: the code is rather complicated, but it yields some impressive performance benefits: without it, the end of a normal bulk wastes over 2 hours with most cores being in "waiting-for-lock" state.

(colors for dpb -DCOLOR mode will also be adjusted after unlock... turns out yellow-over-red is very hard to read (duh) when you've got to shrink the font to fit it all on a single screen, and it pays to distinguish waiting-for-lock states from over frozen states).

I hope to have all this in decent enough shape by the time we unlock ports.

All in all, after a few weeks of tweaks and fun, I've come up with impressive speed-ups... That setup went down from 21 hours to 14 hours and a half for a full bulk.

(as for particulars, we're talking three Xenon E5 machines with 128G of ram each, 16 cpus per machines, hyper-threading disabled... 12 cpus actually useful... running -j16 didn't help... parallel set to 4 seems to yield better result than the default 6 in that case...

Of course, the form-factor of the cluster is very important. A lot of these problems don't even show up on smaller clusters, so it would have been impossible to achieve without the donation. So thanks again, especially to Anton Karpov.

[topicopenbsd]

<< Slashdot Taking Questions for Interview with Theo de Raadt | Reply | Flattened | Expanded | USB 3.0 support beginning to emerge for -current >>

Threshold: Help

Related Links
more by tbert


  Re: From the trenches: espie@ reports on recent experiments in package building (mod 0/4)
by Laurence Rochfort (80.0.78.224) on Fri Mar 7 20:14:04 2014 (GMT)
  My understanding of Intel hyper-threading is that the OS has to support it. Does OpenBSD support it?

Is Marc's observation that one "really really" wants to disable it a general observation or specific to dpb?

For instance, I run a Core i5 as my day-to-day desktop. Should hyper-threading be disabled in this context? Not that I'd notice the difference I'm sure.
  [ Show thread ] [ Reply to this comment ] [ Mod Up ] [ Mod Down ]

  Re: From the trenches: espie@ reports on recent experiments in package building (mod 2/4)
by Anonymous Coward (2001:8b0:112f:2:5e51:4fff:fe15:af4) on Sat Mar 8 16:21:34 2014 (GMT)
  "One thing I haven't solved yet is that apparently, ssh in master mode tends to refuse shared connections when you go to 16 jobs per machine... I haven't investigated, not sure whether this is a limitation of ssh, or some machine limit I haven't found."

The default for MaxSessions is 10.
  [ Show thread ] [ Reply to this comment ] [ Mod Up ] [ Mod Down ]

  Re: From the trenches: espie@ reports on recent experiments in package building (mod 2/2)
by Sebastian Rother (srother) (srother@mercenary-security.com) on Sun Mar 9 12:37:09 2014 (GMT)
https://www.mercenary-security.com
  Dear Marc,

Thanks a lot for your effort to optimize and speed up the package building (thus -current profits from this, maybe some day even -stable to provide faster updates for sec. related issues).

During reading your story I thought about password cracking and that those applications face similiar issues related to clustering their tasks. Would it be benefitial if you might contact for example Solar Designer and talk with him about clustering concepts?

Of course the needs related to the supported CPU optimizations (SSE2 and co) is not relevant for the ports building but the concepts of sheduling tasks is. Password crackers face the same issues (waiting for other tasks/depencies_for_a_specific_task and co) basicaly thus some exchange could maybe be inspiring to improve the mechanism in dpb?

Except this: Would it maybe be benefitial to use one CPU core (fix) to compress the builded packages? Or maybe compress all packages if everything is build (just tar'ed). CPUs nowdays optimize their instructions and if one core just compresses it might speed up the process a littlebit further (imho the instructions would stay in the cache). So that all cores just do compression at the end. But this maybe needs to get tested and I am unsure about the improvements but I'd consider 1-4% maybe as realisticaly.
  [ Show thread ] [ Reply to this comment ] [ Mod Up ] [ Mod Down ]

  Re: From the trenches: espie@ reports on recent experiments in package building (mod 0/2)
by Marc Espie (espie) (espie@nerim.net) on Sun Mar 9 21:05:30 2014 (GMT)
  Ports tree unlocked! a large part of that is now in the tree...

the dubious parts are still being worked on.
  [ Show thread ] [ Reply to this comment ] [ Mod Up ] [ Mod Down ]

[ Home | Add Story | Archives | Polls | About ]

Copyright © 2004-2008 Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to April 2nd 2004 as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. Some icons from slashdot.org used with permission from Kathleen. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. Search engine is ht://Dig. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]