More C2K7 - Faster TLB shootdown

Contributed by merdely on 2007-06-04 from the shootdown-at-the-tlb-corral dept.

The Translation Lookaside Buffer (TLB) is cache in the CPU that maps virtual page addresses to physical page addresses. It prevents the CPU from having to go all the way out to the page table.

A TLB Shootdown occurs when a process restricts access to a page in shared memory and must interrupt processes using that memory space on other processors so they flush their TLB tables.

During the hackathon, art@ committed a change that's a simplified, faster version of the shootdown code which gives a 15% reduction in system time on Art's dual-core laptop.

The commit message is below.

CVSROOT:	/cvs
Module name:	src
Changes by:	art a t cvs openbsd org	2007/05/25 09:55:27

Modified files:
	sys/arch/i386/i386: apicvec.s ipifuncs.c lapic.c lock_machdep.c 
	                    machdep.c pmap.c vm_machdep.c 
	sys/arch/i386/include: atomic.h i82489var.h intr.h pmap.h 

Log message:
Replace the overdesigned and overcomplicated tlb shootdown code with
very simple and dumb fast tlb IPI handlers that have in the order of
the same amount of instructions as the old code had function calls.

All TLB shootdowns are reorganized so that we always shoot the[m],
without looking at PG_U and when we're shooting a range (primarily in
pmap_remove), we shoot the range when there are 32 or less pages in
it, otherwise we just nuke the whole TLB (this might need tweaking if
someone is interested in micro-optimization). The IPIs are not handled
through the normal interrupt vectoring code, they are not blockable
and they only shoot one page or a range of pages or the whole tlb.

This gives a 15% reduction in system time on my dual-core laptop
during a kernel compile and an 18% reduction in real time on a quad
machine doing bulk ports build.

Tested by many, in snaps for a week, no slowdowns reported (although not
everyone is seeing such huge wins).

The massive speed improvements we've seen in different parts of OpenBSD through the hackathon will certainly make OpenBSD 4.2 an interesting upgrade.

(Comments are closed)

Comments

By Anonymous Coward (156.34.75.11) on 2007-06-04 11:46

Thanks for the concise explanation of TLB shootdown, so 'the rest of us' can (roughly) understand why this is a benefit. From the explanation, I'm guessing these changes do not affect single processor systems, but are an improvement in SMP support (?).
Comments
1. By Noryungi (noryungi) on 2007-06-04 12:21
  
  >> Thanks for the concise explanation of TLB shootdown, so 'the rest of us' can (roughly) understand why this is a benefit. From the explanation, I'm guessing these changes do not affect single processor systems, but are an improvement in SMP support (?).<<
  
  Well, yes, I do believe it's mostly SMP, since the 15% improvement has been measured on a dual-core laptop...
  Comments
  1. By scot bontrager (216.62.11.163) scot@indievisible.org on 2007-06-04 15:42
    
    > >> Thanks for the concise explanation of TLB shootdown, so 'the rest of us' can (roughly) understand why this is a benefit. From the explanation, I'm guessing these changes do not affect single processor systems, but are an improvement in SMP support (?).<<
    >
    > Well, yes, I do believe it's mostly SMP, since the 15% improvement has been measured on a dual-core laptop...
    >
    >
    
    with this change, make build (wall-clock time) went from 1:12:00 to 1:02:00 on my 2x amd64 (1.6Ghz) system. I was happy back when the lockmgr/simplelock changes started and build time came down from 1:19:00. When they can shave another 3 minutes off and I can do a make build in less than an hour, I'll be very happy.
    
    Between this change and the other work done at hackathon, CPU usage is hovering right at 0.1% on this system, before it was 3-4%. If only my CPU's (Opteron 242's) supported PowerNow!, I would throttle them back so I could save on my electrictiy bill!
    
    Good going all!
    
    FFS2 was giving fits a few weeks back, but it seems much better now. I've only been using it for /usr/obj, but the last few builds have been solid. I was hoping FFS2 would be faster than FFS, but I don't see any measurable improvement there (I know there isn't suppose to be either). Once the last few userland bits get finished I'll switch /usr/src over as well.
    
    Comments
    
    By Anonymous Coward (63.237.125.20) on 2007-06-04 21:42
    
    > with this change, make build (wall-clock time) went from 1:12:00 to 1:02:00 on my 2x amd64 (1.6Ghz) system. I was happy back when the lockmgr/simplelock changes started and build time came down from 1:19:00. When they can shave another 3 minutes off and I can do a make build in less than an hour, I'll be very happy.
    
    Have you ever timed make build on this system in single processor mode? Just wondering how much performance the SMP adds.
    
    Comments
    
    By scot bontrager (216.62.11.163) on 2007-06-06 03:06
    
    > > with this change, make build (wall-clock time) went from 1:12:00 to 1:02:00 on my 2x amd64 (1.6Ghz) system. I was happy back when the lockmgr/simplelock changes started and build time came down from 1:19:00. When they can shave another 3 minutes off and I can do a make build in less than an hour, I'll be very happy.
    >
    > Have you ever timed make build on this system in single processor mode? Just wondering how much performance the SMP adds.
    
    2547.021u 672.352s 59:13.29 90.6% 0+0k 88260+245879io 159172pf+0w
    
    59 minutes using a non-SMP kernel! It took 3 minutes LONGER using the SMP! That seems odd. "make build" is a mostly linear process, so I can understand why SMP doesn't gain much, but why does it cost so much more? (make -j 2 deadlocks in a hurry so I've not even tried that in years).
    
    I'll newfs /usr/obj and rerun this test just to make sure.
    
    Comments
    
    By scot bontrager (216.62.11.163) on 2007-06-06 11:27
    
    > > > with this change, make build (wall-clock time) went from 1:12:00 to 1:02:00 on my 2x amd64 (1.6Ghz) system. I was happy back when the lockmgr/simplelock changes started and build time came down from 1:19:00. When they can shave another 3 minutes off and I can do a make build in less than an hour, I'll be very happy.
    > >
    > > Have you ever timed make build on this system in single processor mode? Just wondering how much performance the SMP adds.
    >
    > 2547.021u 672.352s 59:13.29 90.6% 0+0k 88260+245879io 159172pf+0w
    >
    > 59 minutes using a non-SMP kernel! It took 3 minutes LONGER using the SMP! That seems odd. "make build" is a mostly linear process, so I can understand why SMP doesn't gain much, but why does it cost so much more? (make -j 2 deadlocks in a hurry so I've not even tried that in years).
    >
    > I'll newfs /usr/obj and rerun this test just to make sure.
    >
    >
    
    try 2, clean /usr/obj and rm /tmp/ac.cache
    
    2537.256u 679.384s 59:01.10 90.8% 0+0k 78968+242154io 148936pf+0w
    
    By tedu (69.12.168.115) on 2007-06-06 17:15
    
    locks don't come for free.
    
    By sthen (85.158.44.149) on 2007-06-04 21:50
    
    > If only my CPU's (Opteron 242's) supported PowerNow!, I would throttle them back so I could save on my electrictiy bill!
    They should do; have you seen this?
    
    Comments
    
    By sthen (85.158.44.149) on 2007-06-04 22:00
    
    > If only my CPU's (Opteron 242's) supported PowerNow!, I would throttle them back so I could save on my electrictiy bill!
    >
    > They should do; have you seen
    ...ah, sorry... I meant this but it seems Opteron [128]4[02] don't have a lower p-state listed in the AMD documentation.
2. By Bret Lambert (tbert) on 2007-06-04 12:31
  
  > Thanks for the concise explanation of TLB shootdown, so 'the rest of us' can (roughly) understand why this is a benefit. From the explanation, I'm guessing these changes do not affect single processor systems, but are an improvement in SMP support (?).
  
  I'm not a hardware guy, but the TLB exists in UP systems as well. There may not be as much of a performance enhancement if you're not shuttling processes between CPUs, but you'll still get some benefit from the faster code.
By Anonymous Coward (70.67.139.183) on 2007-06-04 20:52

this is pretty amazing. I love hackathons :)
Comments
1. By art (213.0.113.90) on 2007-06-05 13:41
  
  > this is pretty amazing. I love hackathons :)
  
  Actually, the code was written way before the hackathon. I had the first prototype out for testing several months ago, it just didn't work correctly at first (because of other bugs, I might add).

Latest Articles

Sat, Jun 28
- 05:57 Game of Trees 0.115 released (0)
Tue, Jun 24
- 07:48 Game of Trees 0.114 released (0)
- 07:23 Call for testing: bge/bnx/iavf/igc/ix/ixl/ngbe/pcn: ifq_restart() fix (0)
Mon, Jun 16
- 08:22 j2k25 hackathon report from kn@: installer, low battery, and more (0)
Fri, Jun 13
- 11:18 dhcpd(8): use UDP sockets instead of BPF (1)
Thu, Jun 12
- 12:32 clang(1)/llvm/lld(1) updated to version 19 (0)
Wed, Jun 11
- 12:22 Source code sandboxing (0)
Tue, Jun 10
- 06:50 TearFree option backported to modesetting(4) driver (0)
Mon, Jun 09
- 07:32 FFS optimizations with dirhash, as blogged by rsadowski@ (1)

Credits

Copyright © 2004-2008 Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to April 2nd 2004 as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]