[PATCH 1/8] pseries: phyp dump: Docmentation

Tue Jan 15 02:21:22 EST 2008

On 13/01/2008, Olof Johansson <olof at lixom.net> wrote:
>
> How do you expect to have it in full production if you don't have all
> resources available for it? It's not until the dump has finished that you
> can return all memory to the production environment and use it.

With the PHYP dump, each chunk of RAM is returned for
general use immediately after being dumped; so its not
an all-or-nothing proposition.  Production systems don't
often hit 100% RAM use right out of the gate, they often
take hours or days to get there, so again, there should
be time to dump.

> This can very easily be argued in both direction, with no clear winner:
> If the crash is stress-induced (say a slashdotted website), for those
> cases it seems more rational to take the time, collect _good data_ even
> if it takes a little longer, and then go back into production. Especially
> if the alternative is to go back into production immediately, collect
> about half of the data, and then crash again. Rinse and repeat.

Again, the mode of operation for the phyp dump  is that you'll
always have all of the data from the *first* crash, even if there
are multiple crashes. That's because the the as-yet undumped
RAM is not put back into production.

> really surprises me that there's no way to reset a device through PHYP
> though. Seems like such a fundamental feature.

I don't know who said that; that's not right. The EEH function
certainly does allow you to halt/restart PCI traffic to a particular
device and also to reset the device.  So, yes, the pSeries
kexec code should call into the eeh subsystem to rationalize
the device state.

> I think people are overly optimistic if they think it'll be possible
> to do all of this reliably (as in with consistent performance) without
> a second reboot though.

The NUMA issues do concern me. But then, the whole virtualized,
fractional-cpu, tickless operation stuff sounds like a performance
tuning nightmare to begin with.

> At least without similar amounts of work being
> done as it would have taken to fix kdump's reliability in the first place.

:-)

> Speaking of reboots. PHYP isn't known for being quick at rebooting a
> partition, it used to take in the order of minutes even on a small
> machine. Has that been fixed?

Dunno.  Probably not.

>  If not, the avoiding an extra reboot
> argument hardly seems like a benefit versus kdump+kexec, which reboots
> nearly instantly and without involvement from PHYP.

OK, let me tell you what I'm up against right now.
I'm dealing with sporadic corruption on my home box.

About a month ago, I bought a whizzy ASUS M2NE
motherboard & an AMD64 2-core cpu, and two sticks
of RAM, 1GB per stick. I have one new hard drive,
SATA, and one old hard drive, from my old machine,
the PATA.  The two disks are mirrored in a RAID-1
config. Running Ubuntu.

During install/upgrade a month ago, I noticed some of
the install files seemed to have gotten corrupted, but
that downloading them again got me a working version.
This put a serious frown on my face: maybe a bad ethernet
card or connection !?

Two weeks ago, gcc stopped working one morning, although
it worked fine the night before. I'd done nothing in the interim
but sleep. Reinstalling it made it work again. Yesterday,
something else stopped working.  I found the offending
library, I compared file checksums against a known-good
version, and they were off. (!!!) Disk corruption?

Then apt-get stopped working. The /var/lib/dpkg/status file
had randomly corrupted single bytes. Its ascii, I hand
repaired it; it had maybe 10 bad bytes out of 2MB total size.

I installed tripwire. Between the first run of tripwire, and the
second, less than an hour later, it reported several dozen
files have changed checksums. Manual inspection of some
of these files against known-good versions show that, at least
this morning, that's no longer the case.

System hasn't crashed in a month, since first boot.  So
what's going on? Is it possible that one of the two disks
is serving up bad data, which explains the funny checksum
behaviour? Or maybe its bad RAM, so that a fresh disk
read shows good data?  If its bad ram, why doesn't the
system crash?  I forced fsck last night, fsck came back
spotless.

So ... moral of the story: If phyp is doing some sort of
hardware checks and validation, that's great. I wish I could
afford a pSeries system for my home computer, because
my impression is that they are very stable, and don't do
things like data corruption.  I'm such a friggin cheapskate
that I can't bear to spend many thousands instead of many
hundreds of dollars. However, I will trade a longer boot
for the dream of higher reliability.

--linas