[PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)

Thu Feb 24 12:05:38 EST 2005

On Wed, Feb 23, 2005 at 01:34:56PM -0600, Brian King was heard to remark:
> Linas Vepstas wrote:
> > +#define CONFIG_SCSI_IPR_EEH
> 
> This will obviously need to get cleaned up. This config option should
> go away eventually.. I am assuming it is your way of flagging the
> parts of code you have changed...

Yes, its temporary scaffolding ...

> I really don't like this polling.

Yes, I didn't either, but I couldn't tell how else to do it in IPR.

> this function should start the ipr reset job. You need to prevent
> ipr from talking to the device. Maybe something like:

Yes, and that is what I couldn't figure out how to do.
And now I do ... I'll try this shortly.

> > +static void ipr_eeh_frozen (struct pci_dev *pdev, void * data)
> > +{
> > +	struct ipr_ioa_cfg *ioa_cfg = data;
> 
> Probably don't need the second arg - void * data. You can get the
> ioa_cfg pointer with pci_get_drvdata(pdev)

OK, well, that's worth general discussion.  For the IPR, no,
it may not be needed.  For general C-based OO style coding,
this is the standard style for passing "user data" aka 
"pointer to 'self'" aka "pointer to 'this'".  I notice that 
more and more OO style is creeping into the kernel, and so 
I thought I'd add the standard convention for this here.  

I note that this is *not* the convention currently used in 
struct pci_driver; it uses the "pci_get_drvdata(pdev)" style,
but that is frowned upon in standard OO-style circles.

> > +static void ipr_eeh_perm_failure (struct pci_dev *pdev, void * data)
> > +{
> > +	ipr_cmd->job_step = ipr_reset_shutdown_ioa;
> 
> This needs to "bringdown" the adapter, but not actually touch it. Basically
> stuff like unblocking requests so that we fail them instead of hanging, etc.

Yes.  Actually, right after this, I unconfig the pci slot,
which calls pci_remove_bus_device() .. pci_destroy_dev() .. 
and eventually pci_driver->remove() and so IPR finds out about this
sooner or later anyway.  The goal of this function was to provide
an alternate and/or earlier warning that the device is going away.

Maybe its superfluous.  I tend to add callbacks like this because
I know that sooner or later someone will want one ... 

> > +
> > +static void ipr_register_eeh_handlers (struct ipr_ioa_cfg *ioa_cfg)
> > +{
> > +	/* XXX borken memory management; this malloc not managed */
> > +	struct eeh_recovery_ops *eeh_ops;
> > +	eeh_ops = kmalloc (sizeof(struct eeh_recovery_ops), GFP_KERNEL);
> > +	memset (eeh_ops, 0, sizeof(struct eeh_recovery_ops));
> > +	eeh_ops->frozen = ipr_eeh_frozen;
> > +	eeh_ops->post_reset = ipr_eeh_post_reset;
> > +	eeh_ops->perm_failure = ipr_eeh_perm_failure;
> > +	eeh_ops->data = ioa_cfg;
> > +	eeh_register_recovery_ops (ioa_cfg->pdev, eeh_ops);
> > +}
> 
> If this was generalized a bit more, you could add these function
> pointers into the pci_driver struct so they could be statically
> initialized in each driver.

Yes, that is the intent.  I just stuck it in its own structure 
"for now".  

> > +#ifndef CONFIG_SCSI_IPR_EEH
....

> > +#else
> > +	if (!ipr_reset_allowed(ioa_cfg) && ipr_cmd->u.time_left 
> > +	    && !eeh_slot_is_isolated (ioa_cfg->pdev)) {
> > +			  
> > +		ipr_cmd->u.time_left -= IPR_CHECK_FOR_RESET_TIMEOUT;
> > +		ipr_reset_start_timer(ipr_cmd, IPR_CHECK_FOR_RESET_TIMEOUT);
> > +	} else {
> > +		if (eeh_slot_is_isolated (ioa_cfg->pdev)) {
> > +			ipr_cmd->job_step = ipr_reset_poll_eeh_recovery;
> > +		} else {
> > +			ipr_cmd->job_step = ipr_reset_start_bist;
> > +		}
> > +		rc = IPR_RC_JOB_CONTINUE;
> > +	}
> > +#endif
> 
> Not sure what you are trying to accomplish with this bit of code. Does
> not seem necessary to me.

I want to make sure that the IPR driver holds off from starting the BIST
until after the PCI slot has been reset.  The callback notification
given above is async to the detection of the EEH error, and so the IPR
driver may have already decided that something is wrong, and started a
bist, before the "slot is frozen" callback arrives.  So I wanted to
prevent this.

In other words, the IPR "bist" code should make sure that EEH is not to
blame, and, if it is, wait for the EEH error to clear before continuing
with the reset.

--linas