Contributed by tbert on from the puff my package! dept.
In a recent post to the ports mailing list titled "dpb fun", Marc Espie (espie@) reported on tests running the OpenBSD distributed ports builder on larger than usual hardware and improvements that sprang from the test:
So, I got access to a bunch of fast machines through Yandex. Big kudoes to them. It allowed me to continue working on dpb optimizations for fast clusters, after some tentalizing glimpse into big clusters I got a few months ago thanks to some experiment led by Florian Obser.
The rest of the post follows after the fold, this looks like exciting times are ahead.
First remark is that we don't really scale all that well on a lot of cpus on the same machine (duh). It's probably faster to build things on a cluster with 5~8 machines with 4 cores each than on 3 machines with 16 cores each...It looks like you really really want to disable hyper-threading. On my test, it amounts for a +20% performance increase.
Having memory helps... I gues the turning point is somewhere around 20GB-50GB per box... around there, you can build all ports in memory, and also have other interesting parts in tmpfs as well, such as /usr/local. It doesn't help as much as I would have guessed.
I've experimented with storing packages locally. Doesn't help all that much. Building things directly on the NFS server doesn't hurt.
One thing that does hurt, though, is computing dependencies while building packages. I have a set of patches to enable a "global depends cache" that seems to shave an extra minute or so per package (to divide by 12 or so, since this is wall-clock time on a single cpu). I've already committed the src/ part (tweaks to PkgCreate) and the rest will be in when ports unlock. Specifically, it just requires creating cache entries in tmp files and renaming them to their final destination, so that several packages may be built simultaneously without stepping on each other toes.
I also discovered that thanks to some buggy code reorganization of mine, I've inadvertendly nullified an important optimization (BUILD_ONCE) I implemented a few years ago.... which was fairly easy to restore, fortunately.
Another interesting improvement was smarter scheduling of available cores. The initial dpb code is very crude, and just grabs the first core available to run things. If all machines fire up simultaneously, this ends up in a "thundering herd" stampede, as 12 jobs are started on the first host, then 12 on the next one, etc... most specifically, the LISTING job starts up slow, as it competes with another 11 jobs almost right away... and dpb tends to empty its whole queue right away when you've got over 40 cpus to play with... faster LISTING == bigger queue == greater chance of full cpu utilisation.
One thing I haven't solved yet is that apparently, ssh in master mode tends to refuse shared connections when you go to 16 jobs per machine... I haven't investigated, not sure whether this is a limitation of ssh, or some machine limit I haven't found.
(important note: dpb with lots of cores gobbles resources like crazy... you want to seriously crank up process#, fd#, memory usage).
With all this, the biggest contender left is that dpb will lose a lot of time in "waiting-for-lock" states... I've done a quick patch that helps a common case: when a job is run parallel, it would release lots of cores (maxcores/2) at the end of packaging, and hence... lots of waiting-for-lock.
The job can actually regurgitate those cores at the end of fake... this still leads to loads of waiting-for-locks, but those will hopefully be solved by the end of packaging.
I'm currently experimenting with further tricky patches to make jobs in depends aware of other jobs waiting for the same lock on the same machine. This entails trying to wake other jobs in order, trying to solve depends for several jobs at once, and preventing junk from removing depends for stuff that's already been analyzed, but unsolved yet. These patches are somewhat necessary: the code is rather complicated, but it yields some impressive performance benefits: without it, the end of a normal bulk wastes over 2 hours with most cores being in "waiting-for-lock" state.
(colors for dpb -DCOLOR mode will also be adjusted after unlock... turns out yellow-over-red is very hard to read (duh) when you've got to shrink the font to fit it all on a single screen, and it pays to distinguish waiting-for-lock states from over frozen states).
I hope to have all this in decent enough shape by the time we unlock ports.
All in all, after a few weeks of tweaks and fun, I've come up with impressive speed-ups... That setup went down from 21 hours to 14 hours and a half for a full bulk.
(as for particulars, we're talking three Xenon E5 machines with 128G of ram each, 16 cpus per machines, hyper-threading disabled... 12 cpus actually useful... running -j16 didn't help... parallel set to 4 seems to yield better result than the default 6 in that case...
Of course, the form-factor of the cluster is very important. A lot of these problems don't even show up on smaller clusters, so it would have been impossible to achieve without the donation. So thanks again, especially to Anton Karpov.
(Comments are closed)
By Laurence Rochfort (80.0.78.224) on
Is Marc's observation that one "really really" wants to disable it a general observation or specific to dpb?
For instance, I run a Core i5 as my day-to-day desktop. Should hyper-threading be disabled in this context? Not that I'd notice the difference I'm sure.
Comments
By Marc Espie (espie) on
>
> Is Marc's observation that one "really really" wants to disable it a general observation or specific to dpb?
>
> For instance, I run a Core i5 as my day-to-day desktop. Should hyper-threading be disabled in this context? Not that I'd notice the difference I'm sure.
Well, why don't you try it and see if that helps or change anything ?
I know that it's not just dpb. I timed a make build at 32mn vs 39mn depending on HT.
As far as I know, the "OS support" mostly has to do with not burning out your cpus, as they tend to run hotter with HT.
Performance will depend on the workload. But basically, twice as many cpu==half as much cache per cpu... not an issue with some tasks which mainly run the same code on all cpus (e.g., video encoding/decoding). Not that useful with generic parallelism, like compiling stuff.
In any case, our SMP does not scale all that great beyond 4~8 cpus right now... with HT, those boxes have 32 cpus... this is ways beyond what we can do.
From an architectural standpoint, it's not even really clear *if* those machines can run any OS efficiently for generic computations with 32 cpus... memory bandwidth, disk io, are very likely to become bottleneck before you can run 32 jobs at full speed.
Comments
By Marc Espie (espie) on
So, the 2 virtual cpus bundled as "hyperthreaded" have to wait on each other in any case => they're a performance loss on everything but contrieved testcases.
You want to turn hyper-threading OFF.
(also, if I understand things right, we have non-existent topology support at the moment, so the scheduler will happily move things around from one cpu to another, even though distances between "cpu" are definitely not the same on those modern machines, so HT is an even greater loss for us).
Comments
By Amit Kulkarni (amitkulz) on
>
>
IMHO this needs to be worked on almost all BSD's. Even Solaris 10 lacks this. I remember pinning to a single CPU and the job finished in 4-6 hours instead of a week in Solaris 10. And this was supposed to be big iron OS.
If long lasting processes are pinned to a CPU, this will boost throughput the fastest, right?
Comments
By Marc Espie (espie) on
> >
> >
>
>
> IMHO this needs to be worked on almost all BSD's. Even Solaris 10 lacks this. I remember pinning to a single CPU and the job finished in 4-6 hours instead of a week in Solaris 10. And this was supposed to be big iron OS.
>
> If long lasting processes are pinned to a CPU, this will boost throughput the fastest, right?
No, we have basic affinity support. It's not a question of pinning processes to a cpu, it's a question of not moving them too far.
By Laurence Rochfort (193.9.13.136) on
>
> So, the 2 virtual cpus bundled as "hyperthreaded" have to wait on each other in any case => they're a performance loss on everything but contrieved testcases.
>
> You want to turn hyper-threading OFF.
>
> (also, if I understand things right, we have non-existent topology support at the moment, so the scheduler will happily move things around from one cpu to another, even though distances between "cpu" are definitely not the same on those modern machines, so HT is an even greater loss for us).
>
>
This is extremely interesting, particularly re SSE or FP.
Some proprietary computation we perform, which is a shedload of FP operations, run around 6 to 10 minutes per hour faster.
Some DVD ripping shows approximately the same gains.
I'll try to find time to track core temperatures and report back. I'm sure out data centre guys would be extremely interested in that component.
Comments
By Anonymous Coward (193.9.13.136) on
> >
> > So, the 2 virtual cpus bundled as "hyperthreaded" have to wait on each other in any case => they're a performance loss on everything but contrieved testcases.
> >
> > You want to turn hyper-threading OFF.
> >
> > (also, if I understand things right, we have non-existent topology support at the moment, so the scheduler will happily move things around from one cpu to another, even though distances between "cpu" are definitely not the same on those modern machines, so HT is an even greater loss for us).
> >
> >
>
> This is extremely interesting, particularly re SSE or FP.
>
> Some proprietary computation we perform, which is a shedload of FP operations, run around 6 to 10 minutes per hour faster.
>
> Some DVD ripping shows approximately the same gains.
>
> I'll try to find time to track core temperatures and report back. I'm sure out data centre guys would be extremely interested in that component.
As alluded to by Marc, video en/decode doesn't benefit from turning off HT. In fact, on my system almost all in-browser streaming takes a significant performance hit with it off. There a lot of audio stutter and pauses in video playback.
By Anonymous Coward (2001:8b0:112f:2:5e51:4fff:fe15:af4) on
The default for MaxSessions is 10.
Comments
By Marc Espie (espie) on
>
> The default for MaxSessions is 10.
Yep... well, it was slightly trickier than that, since by moving down to 12, I got rid of the warnings...
... on the console !
somehow, turns out the redirection goo inside dpb hides quite a few of those warnings... and it turns out that, if you run with -j16, you see a lot of them on the console, and if you run with -j12, you see none of them on the console.
so that one just got fixed this afternoon (thanks to Stuart Henderson for that).
By Sebastian Rother (srother) srother@mercenary-security.com on https://www.mercenary-security.com
Thanks a lot for your effort to optimize and speed up the package building (thus -current profits from this, maybe some day even -stable to provide faster updates for sec. related issues).
During reading your story I thought about password cracking and that those applications face similiar issues related to clustering their tasks. Would it be benefitial if you might contact for example Solar Designer and talk with him about clustering concepts?
Of course the needs related to the supported CPU optimizations (SSE2 and co) is not relevant for the ports building but the concepts of sheduling tasks is. Password crackers face the same issues (waiting for other tasks/depencies_for_a_specific_task and co) basicaly thus some exchange could maybe be inspiring to improve the mechanism in dpb?
Except this: Would it maybe be benefitial to use one CPU core (fix) to compress the builded packages? Or maybe compress all packages if everything is build (just tar'ed). CPUs nowdays optimize their instructions and if one core just compresses it might speed up the process a littlebit further (imho the instructions would stay in the cache). So that all cores just do compression at the end. But this maybe needs to get tested and I am unsure about the improvements but I'd consider 1-4% maybe as realisticaly.
Comments
By Marc Espie (espie) on
> Except this: Would it maybe be benefitial to use one CPU core (fix) to compress the builded packages? Or maybe compress all packages if everything is build (just tar'ed). CPUs nowdays optimize their instructions and if one core just compresses it might speed up the process a littlebit further (imho the instructions would stay in the cache). So that all cores just do compression at the end. But this maybe needs to get tested and I am unsure about the improvements but I'd consider 1-4% maybe as realisticaly.
This is an interesting idea. I was thinking that compression would not be such a big issue, but actually, turns out gzip is rather slow (as we found out in pkg_sign). I'm not sure getting the code in the core will help, I'm not sure the IO isn't a big issue, nor the network for NFS.
So, if something is possible, it's definitely complicated.
As for the analogy with password cracking, well, that falls down very early... One of the major difficulty packages building encounters is that it's a lot of generic computations. It's heavy on cpu, heavy on IO, heavy on memory... the only way to improve things is, generally, to have more caches in an interesting location. Or to try to make sure your cores are not all doing the same thing at the same time... which is a bit hard.
There are some huge huge locality issues with respect to dependencies with large clusters, but we don't even have information about that...
I guess that at some point, it's diminishing returns...
as far as I'm concerned, the ball is very much in the kernel hackers' camp. dpb(1) won't go much faster until our SMP gets better... :)
Comments
By Sebastian Rother (srother) on https://www.mercenary-security.com
> > Except this: Would it maybe be benefitial to use one CPU core (fix) to compress the builded packages? Or maybe compress all packages if everything is build (just tar'ed). CPUs nowdays optimize their instructions and if one core just compresses it might speed up the process a littlebit further (imho the instructions would stay in the cache). So that all cores just do compression at the end. But this maybe needs to get tested and I am unsure about the improvements but I'd consider 1-4% maybe as realisticaly.
>
> This is an interesting idea. I was thinking that compression would not be such a big issue, but actually, turns out gzip is rather slow (as we found out in pkg_sign). I'm not sure getting the code in the core will help, I'm not sure the IO isn't a big issue, nor the network for NFS.
>
> So, if something is possible, it's definitely complicated.
>
> As for the analogy with password cracking, well, that falls down very early... One of the major difficulty packages building encounters is that it's a lot of generic computations. It's heavy on cpu, heavy on IO, heavy on memory... the only way to improve things is, generally, to have more caches in an interesting location. Or to try to make sure your cores are not all doing the same thing at the same time... which is a bit hard.
>
> There are some huge huge locality issues with respect to dependencies with large clusters, but we don't even have information about that...
>
> I guess that at some point, it's diminishing returns...
>
> as far as I'm concerned, the ball is very much in the kernel hackers' camp. dpb(1) won't go much faster until our SMP gets better... :)
>
>
Could you maybe test if it helps if the compression (just compiling+tar'ing) is applied at the end? I'd assume it should be noticeable because the instructions stay int he CPU cache.
By Marc Espie (espie) espie@nerim.net on
the dubious parts are still being worked on.