Hi Richard,
OK, taken me a while to learn enough to respond, at least partially, to your suggestions.
-I have cases open with both HP and NetApp
-The storage admins and our HP SE assure me that the LUNs are all configured with mag level availability, with the root disk (yes, this cluster boots from the T800 also) configured with cage availability.
-I have produced a diagram of the Fibre Channel connections between the T800 and the V3170:
https://vishnu.fhcrc.org/3par/3par-netapp-diagram.pdf I believe this configuration meets your suggestions: the hosts are suitably distributed between the node pairs, for example.
-No zoning, as the V3170 is direct-attached
The HP TAC is pointing to confused SCSI reservation behavior from the V3170. The NetApp TAC is pointing to SCSI timeouts (along with Fibre Channel errors) between the V3170 and the T800. [I can post the details of each TAC's analysis, if you like -- EvtLog, EMS Log, and Syslog extracts showing storms of SCSI CLEAR followed by REGISTER-IGNORE, with no REGISTER, not even after the failover, various fci.device.timeouts.]
At the moment, I'm interested in understanding the LESB reports I see in the logs.
Link Failure
Loss of Sync
Loss of Signal
Invalid Transmission Word
Invalid CRC
I've crawled through the last several years of logs and produced both a history
https://vishnu.fhcrc.org/3par/parse-3par.txtand a summary
https://vishnu.fhcrc.org/3par/parse-3par-stats.txtHave any insights into how 'normal' these rates are? I understand Link Failure when we reboot a host ... but Link Failure shows up other times, too (perhaps when Tungsten resets an HBA ... I have no story around the Link Failures between the Brocade switches and the T800, though). And a steady dribble of Loss of Sync, Loss of Signals with the hosts, and Invalid Transmission Words with the cages.
==> What does Loss of Sync and Loss of Signal mean?
==> Suggestions on where I can read up about these?
==> Know any independent consultants who play in this space (reading logs of SCSI command/response and interpreting the result)?
==> What are your disk failure rates? 5% AFR for us (i.e. we're losing ~5% of our ~500 1-2TB SATA drives annually, in this T800) [As far as I can tell, industry averages range from 2-8%.]
--sk