Contributed by pitrh on from the sleep and rude awakenings dept.
I had an abbreviated trip to Toronto, only being able to participate in the first four days of t2k13. I had originally planned to rework amd64 and i386 MTRRs since the way they are presently handled is difficult to understand, and I could not convince myself that they were even being handled properly in the first place. I was looking at MTRRs before the hackathon anyway, since I was trying to track down a reboot issue in hibernate resume, and MTRRs seemed like a possible suspect.
As it turned out, MTRRs weren't the cause of the particular issue I was tracking down - instead, I ended up fixing a bad assumption in amd64 hibernate resume code that assumed the kernel text was at the old physaddr (before we moved it to 16MB a few months ago). Once that was fixed, amd64 was back to the shape it was in after the Dunedin hackathon in January.
I then moved on to amd64 and i386 MP hibernate resume. i386 had been working for a while (since Budapest last year), but amd64 had about a 5% success rate. Through careful debugging, Theo and I realized that we were making too many assumptions about the state of the halfway spun-up APs at the time of image unpack. In a hibernate resume, we boot a machine almost to the point where the rootfs gets mounted. That means the APs are spun up and sitting idle. But then when we detect a hibernate signature, we read the image into memory, IPI halt the APs, then unpack on top of ourself. The problem with that approach, as it turned out, was that the halted APs would then be rehatched during resume from a state that the resuming kernel wasn't prepared for (a cold-boot halted AP is in a much different state than what we were presenting to the resuming kernel). That resulted in strange, unpredictable behavior, and usually a machine reboot.
I wrote a routine to more fully park an AP - we now demote the AP completely back to real mode. This more closely matches how an AP would be configured during cold boot. And at that point, we had our first reliably unhibernating amd64 MP machine, a Dell Inspiron Duo with 4 threads. It unhibernated over 50 times before I got tired of testing and stopped.
There are still issues, however. After I left, I don't think many other people had success with amd64 MP, so we're still missing something. Research continues, but this is basically the same style of progress we made with early suspend-to-ram, which works pretty well for most people now.
As mentioned in his earlier report, I helped pirofti@ with some high-level understanding of how the subr_hibernate code works, and he pointed out some MD-ness that crept into that code, which we refactored to be MI. All in all, it was a fun and productive 4 days.
Finally, I'll also echo the thanks to krw@, UofT, and our BBQ host!
And thank you, Mike, for this interesting report on important work!
(Comments are closed)