UVM change, testers required

Contributed by phessler on 2005-02-19 from the do-you-want-more-VM-sure-we-all-do dept.

CVSROOT: /cvs
Module name: src
Changes by: henning@cvs.openbsd.org 2005/02/19 10:58:03

Modified files:
sys/uvm : uvm_map.h

Log message:
double default MAX_KMAPENT to 2000, theo ok
everybody please update your trees and test this, we need to find out
wether there is bad side-effects from the doubling. If this does not get
enough testing by our user community we will play safe and revert this for
the 3.7 release, so please test.
it needs testing on all architectures, and especially on machines that
-now sometimes crash with the panic("uvm_mapent_alloc: out of static map entries, "
-that have little RAM

There will be snapshots up with this change soon - this is of course
the preferred way of testing.
Applying the diff manually is useless, especially it is absolutely
useless to test a 3.6-stable or something like that with this diff
applied, there were more changes in that area. Don't even bother, ok?

this is very important, so test test test!

To reiterate what Henning said, either run -current, or the new snapshots. Anything else won't help.

(Comments are closed)

Comments

By Anthony (68.145.111.152) on 2005-02-19 22:21

What does this do, and why does it matter?
Comments
1. By Anonymous Coward (69.197.92.181) on 2005-02-19 23:09
  
  It finally bumps it to a sane default so busy servers don't constantly crash. The real question is why did this take years to do?
  Comments
  1. By Anthony (68.145.111.152) on 2005-02-20 00:01
    
    Right, but what is a static map entry?
    
    There's a ton of different things that cause problems if there aren't enough of them. I have to increase kern.maxfiles for example.
    
    Comments
    
    By jose (68.40.238.70) on 2005-02-20 03:31 http://monkey.org/~jose/
    
    in a nutshell, the kernel keeps a memory map via a list, which requires list node allocations. this list describes regions by range, permissions, ownership, etc. if you use a lot of memory, especially in small chunks, you'll eat these up.
    this change will delay that panic, it wont alleviate it. other people have been dealing with this for some time. check the netbsd archives for a further discussion, by the way, of the root cause and fixes.
    
    Comments
    
    By Anonymous Coward (69.197.92.181) on 2005-02-20 14:53
    
    You can't say this won't alleviate the panic, it will for some people. If they are only using memory that requires 1500 entries, then this will solve their problem. Its people who's systems slowly use more and more who will keep running into problems anyways.
    
    Comments
    
    By Anthony (68.145.111.152) on 2005-02-20 16:17
    
    Well if it's a static amount there's always going to be someone that it's not enough for. I assume a dynamic solution is difficult, but just saying "X many should be enough for anyone." usually turns out to be wrong.
    
    Comments
    
    By Anonymous Coward (69.197.92.181) on 2005-02-20 20:13
    
    I didn't say it was enough for everyone. I said it would solve the problem for *some people*. Jose is trying to claim it will only delay the panic, which is true for some people, but it actually does solve it for others.
    
    Comments
    
    By Brad (204.101.180.70) brad at comstyle dot com on 2005-02-21 00:39
    
    I don't think your systems have been running long enough to prove that they will not crash from this eventually. Anecdotal evidence seems to show that any system that can reproduce this panic WILL crash no matter what and it's only a matter of time.
    
    Comments
    
    By Anonymous Coward (69.197.92.181) on 2005-02-21 02:15
    
    I don't know what long enough is for you, but 6 months is as long as my machines are up, and depending on the server, having it between 2000 and 4000 was enough to keep them from crashing for 6 months.
    
    Comments
    
    By roo (195.137.43.11) on 2005-02-28 11:16
    
    I guess it's nice to have uptime bragging rights but this fragmentation issue is not unique to OpenBSD at all. AFAIK MVS suffered from it (as did many other mini & mainframe OSes), as have other UNIXen. Furthermore : At just about every place I've worked the SunOS, Solaris, AIX and Digital UNIX boxes were rebooted regularly (often weekly) for good measure. Partly to "solve" problems like that and to verify that the machine will come up. You may be surprised to find out how many times silent hardware failures and untested "upgrades" were caught that way. FWIW I think it's a good idea to reboot boxes weekly anyway, just to see if they'll come up again... Of course you'll have a hot-standby just in case, right ? ;) Cheers, Roo
    
    By Nonesuch (24.148.72.216) on 2005-02-21 15:38
    
    How long is "enough" time to exhibit this bug?
    I have many systems with uptime in excess of six months, and a few which have been up for nearly a year (only going down for the upgrade to 3.5 and then for an unplanned UPS "event" in May).
    I have one system with an uptime of 200 days which processes several gigabytes/day of log events, and has not shown any signs of this UVM problem.
    What type of system activity triggers the undesirable behavior?
    
    Comments
    
    By Anonymous Coward (81.64.227.144) on 2005-02-21 21:33
    
    What type of system activity triggers the undesirable behavior?
    Yes would be great to know ! I've some spare machines following -current (so, with the patch applied) that I want to stress, if that could increase 3.7 quality.
    
    Comments
    
    By jose (68.40.238.70) --@---.-- on 2005-02-21 22:25 http://monkey.org/~jose/
    
    i've been able to trigger it reliably in the past with applications that use a lot of shared memory. postgres, for example, has always assisted me in causing a panic.
    
    Comments
    
    By Daniel (66.63.12.120) daniel@presscom.net on 2005-02-22 23:54
    
    One way to triger this at will in my case was to up the "concurrencyremote" above 50 in my qmail when sending huge lists. I coudl crash it at will. Email in the archive about it as well. After a patch in current a few weeks ago, the crash was not as bad and I could bring the "concurrencyremote" up to 100 before it act the same way. The patch wasn't for this increase from 1K to 2K however, but fixing other things in kernel. So far both combine works even better then before. So, like Henning said before, this is not the only change done in the up coming 3.7 to address this. May be it may not be all out for good, but it is sure a very good start, so test... test... test...
    
    By Anonymous Coward (83.147.128.114) on 2005-02-23 13:48 squid
    
    Squid is a great one for a bit of the old memory fragmentation.
    
    Amavisd-new, too.
    
    By jose (68.40.238.70) on 2005-02-20 17:23 http://monkey.org/~jose/
    
    actually, i can say that with a lot of confidence. been there, worked around it already. extensive testing showed it simply pushed the panic condition out a bit, but never far enough.
    
    Comments
    
    By Anonymous Coward (69.197.92.181) on 2005-02-20 20:12
    
    That doesn't make any sense. Of course if you need 3000 then 2000 is just going to make it take longer to panic. But if you need 1500 then it solves your problem. I had a couple machines where 2000 solved the problem, and they ran that way for 6 months at a time without panics. Other machines had to be raised all the way to 4000. But its just silly to pretend that you will gradually use up an infinite number of entries no matter how high you set it.
    
    Comments
    
    By Anthony (68.145.111.152) on 2005-02-20 21:11
    
    If you're increasing your use monotonically you're obviously screwed no matter what, but one default that can be only adjusted by making custom kernels is a pretty awkward solution. We all know what big fans the developers are of custom kernels.
    
    By jose (68.40.238.70) on 2005-02-20 21:38 http://monkey.org/~jose/
    
    the problem is that the list size grows over time, and the size of it spikes from time to time. but the base list size grows over time, so those spikes will start to take you over the max. that's why it only delays the panic, it dosn't alleviate it. it's simply not a static value like you seem to assuming, it's a dynamic value that grows over time.
    this bug is aggrevated by shared memory usage, ie in database systems like postgres. the kmap list grows as the memory map fragments (ie small chunks allocated), and list entries are consumed. after a while ... blam, panic.
    
    Comments
    
    By henning (80.86.183.226) on 2005-02-21 09:01
    
    no jose, this description is something between inaccurate and wrong.
    
    Comments
    
    By Anonymous Coward (217.43.158.188) on 2005-02-21 12:02
    
    If this is inaccurate, is there some place to get an accurate account of the problem?
    
    By jose (68.40.238.70) --@---.-- on 2005-02-21 17:52 http://monkey.org/~jose/
    
    while my answers may be wrong in light of post-3.6-rel changes, they're accurate to the best of my knowledge through 3.3 and maybe even 3.6. i haven't looked since then, but i do know the problem exists through the 3.5 and 3.6 releases. i simply haven't looked at openbsd-current in quite a while, since pre-3.6 even.
    notice that unlike you i'm trying to help by sharing what information i do have.
    
    By henning (80.86.183.226) on 2005-02-21 09:00
    
    sorry, thius is plain bullshit.
    first, there was a very important change in that UVM area after 3.6 release, so all older results don't count any more anyway.
    And claiming that increasing MAX_KMAPENTs just moves the poiint of panic is plain bullshit.
    of course, if you run something that makes the kernel (want to) consume 10k entries it will crash with 1k or 2k, yeah...
    
    By Anonymous Coward (83.147.128.114) on 2005-02-20 23:46
    
    I've been checking the NetBSD archives, haven't found much. Woudln't happen to have a direct link to more info?
    
    This bug is very serious.
    
    Comments
    
    By Brad (204.101.180.70) brad at comstyle dot com on 2005-02-21 00:41
    
    Of course this is serious, though a fix for the issue is far from trivial.
    
    Comments
    
    By jose (68.40.238.70) --@---.-- on 2005-02-21 01:11 http://monkey.org/~jose/
    
    look in the NetBSD tech-kern archives recently (ie dec 04) for the following two threads:
    
    kernel_map very fragmented?
    kernel map entry merging and PR 24039
    link: http://mail-index.netbsd.org/tech-kern/2004/12/.
    you'll see a few things in there that are directly related, help illuminate the information about this problem and steps that can be taken to resolve it. it's probably the best info i can point you at offhand.
    
    By rene (138.217.52.28) on 2005-02-23 03:11
    
    "Do not let serious problems sit unsolved." - goals.html
    lighten up dudes.
By Anonymous Coward (81.64.227.144) on 2005-02-19 22:22

What are the goods of this change (beside "bad side-effects" ;) ?
Comments
1. By Brian (205.161.1.46) on 2005-02-19 22:59
  
  Not having "uvm_mapent_alloc: out of static map entries" every few weeks on my mail server hopefully.

By Anonymous Coward (217.120.147.78) on 2005-02-20 04:06

I just noticed that there is a MD5 sum mismatch with the latest snapshot. So hold your horses and wait with installing this snapshot:

% md5 base36.tgz comp36.tgz; egrep '(comp|base)' MD5 
MD5 (base36.tgz) = 4c2ce670c72304e9caa3d968fc6fea7a                                                         
MD5 (comp36.tgz) = c5dfe3f0f5691b55bef2e1b342033b61                                                                                  
MD5 (base36.tgz) = 2a9c56ac8c54e050ade1fccb36c353c3                                                                                  
MD5 (comp36.tgz) = 875743e6e88ab786bdcf99e362c9817c

Comments

By Martin (62.178.75.222) martin@ on 2005-02-20 16:56

Just a matter of wrong MD5 files. New and correct files seem to be up on ftp.openbsd.org now.
Comments
1. By Han (217.120.147.78) han@mijncomputer.nl on 2005-02-21 12:04
  
  Yeah, I just waited for the next snapshot to come along.

By ferywu (202.149.79.18) ferywu@corebsd.or.id on 2005-03-03 05:11

After cvs-ing for a moment
build kernel
build userland

the effect is so great, sys-cache hit ratio always high
and every process feel more faster and ligther
even i don't use any timer to measure it.

especially for machine with squid+clamav+spamassassin+amavisd-new+mysql

Latest Articles

Fri, Jul 11
- 09:15 watch(1) utility added to -current (0)
Sat, Jul 05
- 08:17 KDE Plasma 6.4 has landed in OpenBSD (0)
- 08:13 Blink and you'll miss it! 4096 colours and flashing text on the console! (2)
- 08:08 Game of Trees Hub now taking signups for repository hosting (0)
Sat, Jun 28
- 05:57 Game of Trees 0.115 released (0)
Tue, Jun 24
- 07:48 Game of Trees 0.114 released (0)
- 07:23 Call for testing: bge/bnx/iavf/igc/ix/ixl/ngbe/pcn: ifq_restart() fix (0)
Mon, Jun 16
- 08:22 j2k25 hackathon report from kn@: installer, low battery, and more (0)
Fri, Jun 13
- 11:18 dhcpd(8): use UDP sockets instead of BPF (1)

Credits

Copyright © 2004-2008 Daniel Hartmeier. All rights reserved. Articles and comments are copyright their respective authors, submission implies license to publish on this web site. Contents of the archive prior to April 2nd 2004 as well as images and HTML templates were copied from the fabulous original deadly.org with Jose's and Jim's kind permission. This journal runs as CGI with httpd(8) on OpenBSD, the source code is BSD licensed. undeadly \Un*dead"ly\, a. Not subject to death; immortal. [Obs.]