Contributed by marco on from the IBM-cant-read-spec dept.
typedef struct { u_int8_t smipmi_if_type; /* IPMI Interface Type */ u_int8_t smipmi_if_rev; /* BCD IPMI Revision */ u_int8_t smipmi_i2c_address; /* I2C address of BMC */ u_int8_t smipmi_nvram_address; /* I2C address of NVRAM * storage */ u_int64_t smipmi_base_address; /* Base address of BMC (BAR * format) */ u_int8_t smipmi_base_flags; /* Flags field: * bit 7:6 : register spacing * 00 = byte * 01 = dword * 02 = word * bit 4 : Lower bit BAR * bit 3 : IRQ valid * bit 2 : N/A * bit 1 : Interrupt polarity * bit 0 : Interrupt trigger */ u_int8_t smipmi_irq; /* IRQ if applicable */ } __packed smbios_ipmi_t;So there are 2 things wrong here. First look at the base_address, per the spec if it is IO and not memory mapped the value shall be odd. The IBM box uses IO and not memory mapped IO (this was determined after MANY reboots!). Also wrong is the register spacing. It is set to 0x01 but really should have been 0x00 or else the other calculated register offset will be wrong. In this case we would be poking in 0xca4 instead of 0xca3. Another nice thing is that bit 2 is set; the spec explicitly prohibits setting reserved bits.
The MSI board (the BMC that talks IPMI) is some sort of Taiwanese board that IBM dropped into this server. I found that out while hunting for Linux or some other code to use as a reference. We did find some absolutely awfully written pile of poo driver. After reading through some of that code I understand how the spec could have been completely misinterpreted. Clearly there was some sort of cranium deficiency at work here.
So I am all happy jumping up and down that we are no longer crashing and are getting values out of the BMC to see that it is not working reliably. Damn it, back to the drawing board. During this failure I see some familiar error mechanisms so I go back to the timeout code that I wrote a few weeks ago and sure enough there it was. The IO mechanism is several times slower than memory mapped IO equivalent so the timeout values were off. Ah, at least one easy fix :-)
I resume the jumping up and down activity to shortly run into the next snafu. The IPMI poll seems hung, no values are being updated. Argh!! This is getting old, now what?
By now Jordan is being summoned so he leaves without committing any code, I'll blog about his activities later, and I receive a NMI of the SIGWIFE type. Dinner, movie, etc
A very long movie later I resume hacking on this thing. I added a whole bunch of debug goo into the driver to basically see that I had been overzealous in my previous timeout commit. I did fix the cold (during boot) timeout code but screwed up the normal timeout path. I fix this to get ready for the next disappointment; it is still not working. Now I start disabling random devices that poke into IO space and magically IPMI starts working. Many reboots later I figure out that it is the nsclpcsio and gpio driver that are causing this. This is Alexander's stuff and it was too late for me to look at it and too early for him to be awake.
In the morning I found Alexander on ICB and talked to him about it and he went of and confirmed and fixed the bug. Now I have all the pieces to create a fix for i386 on this box. Some cleanup later the code goes in. There are still 2 things for me to look at on this box. First this needs to be validated on amd64 as well and secondly the fans are reporting 0 RPM so there is still something broken. More on this later.
(Comments are closed)
By Anonymous Coward (84.92.159.114) on
Massive kudos to you guys for sticking with this bullshit.
By Daniel Melameth (208.139.201.73) daniel@melameth.com on
Comments
By Anonymous Coward (80.202.46.38) on
By anon et. al. (80.213.132.8) on
code found in the Linux kernel then, I guess
Comments
By Marco Peereboom (143.166.226.19) marcp@peereboom.us on
By Anonymous Coward (24.34.57.27) on