Contributed by jcr on from the why-do-bears-hibernate dept.
I only had three days I could budget for Berlin (18-20 Oct), and I spent the time continuing hibernate work on amd64. i386 had been solid for some time, and preliminary amd64 support on uniprocessor had been committed at the n2k13 hackathon in Dunedin, New Zealand, so multiprocessor support got the attention at the t2k13 general hackathon in Toronto, Canada earlier this year and again in Berlin (b2k13).
In the weeks leading up to Berlin, I had amd64 multiprocessor hibernate working in qemu and bochs to the point where they could be put into a loop, ZZZing (apm(8))and resuming hundreds of times, but real hardware was still proving to be a challenge on some machines. Especially problematic was my x230 which refused to do any unpacking at all after loading the compressed image into memory. Berlin proved to be a lucky hackathon as we discovered two issues that turned out to be trivial fixes:
- During hibernate unpack, we use our own page tables, but we weren't previously flushing global TLB entries from the resuming kernel. That caused strange inconsistencies in the resume path, leading to stream corruption or reboots.
- On Ivy Bridge and later CPUs, we enable SMEP by default (Supervisor Mode Execute Prevention). This feature prevents user mode code from being executed by mistake in ring 0. The resume time page tables we were constructing were set to use user mode pages, when we really didn't need that. Removing those bits allowed the x230 to resume for the first time.
There were also various other cache-related bugs fixed to allow the x230 and other machines to start working in Berlin, but I wouldn't say things are 100% solid on these machines yet.
So, where does this leave hibernate presently? On i386, most of the machines I've tested work, both MP and UP, but I'm sure there are many machines that still have issues. I'd like to know about those. amd64 is still a more hit-and-miss proposition, with UP being more solid than MP. I'd say amd64 right now has probably a 50/50 success rate. The main problems you'll see are reboots or hangs on resume. There are also some uncommon situations where the suspend will fail as well, these are likely problems in lower level I/O routines that will need to be shaken out in the months to come. You'll still need swap to physical memory until we relax our swap space estimator. We discussed several ways to do that later once we have a grasp on really how much space we need in the general case. And we still don't have RLE (Run Length Encoding) re-enabled yet, which means you should expect things to still be slow.
A few of us also discussed what we should start thinking about once the amd64 situation gets more solid. For example, how to lock the machine after hibernate and other interesting things we could evolve the subsystem to do, including doing a hibernate before all suspend to RAM operations to avoid the problem of running out of battery while sleeping. Of course, that can't happen until we put RLE in and get the hibernate times down significantly.
All things considered it was a fun and productive 3 days. Thanks to the #b2k13 crew for a great hackroom and coordination!
Thanks Mike for the report. It's always interesting to see what you're working on.
(Comments are closed)