What happens during disk replacement?

skendric · **Joined:** Wed Jan 18, 2012 6:58 am **Posts:** 5

We have a T800 running v2.3.1, loaded with ~500 1-2TB SATA drives. Four clients (a clustered NetApp v3140, a Solaris x86 box, and a Win2008 box).

Does anyone else experience hiccups when a tech replaces a disk?

We lose disks of course: ~1-3 / month. On three occasions over the last two years, while the HP tech is replacing the failed drive in the T800 (Cobalt), one of the NetApp heads (same NetApp head each time: Tungsten-A) panics and hands its services to its partner (Tungsten-B). [Regrettably, that process doesn't work cleanly -- some services require manual intervention before they start on Tungsten-B ... ouch.]

Tungsten-B doesn't notice the event (other than receiving services from Tungsten-A). The Solaris and Windows boxes don't notice the event -- nothing in their logs.

A few tidbits from syslog (I'm leaving most lines out of course):

[I think this is where the HP tech pulls the magazine]
Jan 10 12:13:41 cobalt comp_state_change hw_cage:17,hw_cage_sled:2:0:0 Cage 17, Magazine 2:0:0 Degraded (Offloop_Req_Via_Admin_Interface)
Jan 10 12:13:42 cobalt lesb_error sw_port:0:0:3 FC LESB Error Port ID [0:0:3] Counters: (Invalid transmission word) ALPAs: a3, e0, a9
Jan 10 12:13:45 cobalt comp_state_change hw_cage_sled:17:2:0,sw_pd:225 Magazine 17:2:0, Physical Disk 225 Degraded (Notready, Not Available For Allocations, Missing A Port, Missing B Port, Sysmgr Spundown)
Jan 10 12:13:45 cobalt comp_state_change hw_cage_sled:17:2:3,sw_pd:228 Magazine 17:2:3, Physical Disk 228 Failed (Invalid Media, Smart Threshold Exceeded, Not Available For Allocations, Missing A Port, Missing B Port, Sysmgr Spundown)
Jan 10 12:13:45 cobalt disk_state_change sw_pd:228 pd 228 wwn 5000C500196CF4C4 changed state from valid to missing because disk gone event was received for this disk.
Jan 10 12:17:54 cobalt comp_state_change hw_cage:17,hw_cage_sled:2:0:0 Cage 17, Magazine 2:0:0 Failed (Missing)
Jan 10 12:18:25 cobalt cli_cmd_err sw_cli {3parsvc super all {{0 8}} -1 140.107.42.192 20329} {Command: servicemag start -dryrun -mag 17:2 Error: } {}

[I think the tech has replaced the disk and has reinserted the magazine.]
Jan 10 12:21:15 cobalt comp_state_change sw_port:0:0:3 Port 0:0:3 Normal (Online)
Jan 10 12:21:15 cobalt comp_state_change sw_port:1:0:3 Port 1:0:3 Normal (Online)
Jan 10 12:21:20 cobalt comp_state_change hw_cage:17,hw_cage_sled:2:0:0 Cage 17, Magazine 2:0:0 Normal

[Various Tungsten clients start complaining]
Jan 10 12:31:10 hamster-1 MSSQL$SPS: 833: SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on ...
Jan 10 12:31:18 tungsten-a-svif1 [echodata@tungsten-a: iscsi.notice:notice]: ISCSI: Initiator (iqn.1991-05.com.microsoft:csssql1.[...]) sent LUN Reset request, aborting all SCSI commands on lun 0

[I don't understand this section.]
Jan 10 12:32:00 cobalt dskabrt hw_disk:5000C5001987AC14;sw_pd:231 pd 231 port b0 on 1:0:3: scsi abort/sick/hwerr status TE_NORESPONSE
Jan 10 12:32:00 cobalt comp_state_change hw_cage_sled:17:7:1,sw_pd:231 Magazine 17:7:1, Physical Disk 231 Degraded (Errors on B Port)
Jan 10 12:32:12 cobalt dskabrt hw_disk:5000C5001987AC14;sw_pd:231 pd 231 port a0 on 0:0:3: scsi abort/sick/hwerr status TE_ABORTED
Jan 10 12:32:32 cobalt comp_state_change hw_cage_sled:17:7:1,sw_pd:231 Magazine 17:7:1, Physical Disk 231 Failed (No Valid Ports, Errors on A Port, Errors on B Port)
Jan 10 12:32:32 cobalt dskfail sw_pd:231 pd 231 failure: drive has no valid ports All used chunklets on this disk will be relocated.

[Tungsten-A gives up]
Jan 10 12:33:17 tungsten-a-svif1 [tungsten-a: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-b enabled
Jan 10 12:34:41 tungsten-a-svif1 [tungsten-a: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-b enabled

[More Cobalt error messages]
Jan 10 12:35:01 cobalt comp_state_change hw_cage_sled:17:7:1,sw_pd:231 Magazine 17:7:1, Physical Disk 231 Failed (Invalid Media, Smart Threshold Exceeded, No Valid Ports, Errors on A Port, Errors on B Port)
Jan 10 12:35:01 cobalt dskfail sw_pd:231 pd 231 failure: drive SMART threshold exceeded Internal reason: Smart code 0x00 : Unknown SMART code. All used chunklets on this disk will be relocated.

[More Tungsten clients complaining]
Jan 10 12:52:47 hamster-2 iScsiPrt: 63: Can not Reset the Target or LUN. Will attempt session recovery.

[Tungsten is deep into its failover procedure ... dang this failover takes a while]
Jan 10 12:46:12 tungsten-a-svif1 [tungsten-a: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-b enabled
Jan 10 12:52:36 tungsten-b-svif1 [tungsten-b: cf.fsm.takeoverOfPartnerEnabled:notice]: Cluster monitor: takeover of tungsten-a enabled

==> Does anyone else notice hiccups during disk replacement?
==> Suggestions on where to look to better understand what happened?
==> Suggestions for monitoring we could put in place, to capture more data during the next event?
==> Pointers to URLs to read on interpreting T800 log messages
==> I'm building a diagram of which ports on the clients plug into which ports on the T800, including how each port is configured. Suggestions on what parameters to include in the diagram?

--sk

Stuart Kendrick
FHCRC

Richard Siemers · **Posted:** Tue Jan 24, 2012 12:52 am

Hello Stuart.

We also have Netapp V3140s serving up 3PAR disk, ours is diskless, and boots from the 3PAR.
I have never had any issues like you described. I would recommend taking full advantage of your support options and get a failure analysis performed.

The clues provided that stand out to me: Windows and Solaris = no issue, just the netapp has issue, and its always the Tungsten-A node.

First and foremost, audit the LUNs assigned to the Tungsten-A node, make sure they are all properly configured with a valid raid level and either "cage availability" or "mag availability".

Then check the SAN connections to Netapp, generally zoning. Each Netapp should have no more, and no less than 2 ports zoned per 3PAR. So in your v3140 cluster you will have a total of 4 paths to a single 3PAR. Its also important you use node pairs on the 3par side. A node pair is a horizontal pair of nodes... your T800 could have anywhere from 2 to 8 nodes... or 1 to 4 pairs of nodes. Pairs are side by side and share disk shelves. For the Netapp to have proper redundant access to disk, it needs to be connected to each side of a 3PAR node pair. So lets say you have 4 nodes, 4,5,6 and 7 on your T800. Tungsten-A should have 1 path zoned to node 4, and one path to node 5. Tungsten B should go to 6 and 7. Zoning to 4 and 6 (vertical pairing) is not supported. The main issue at hand here is that the Netapp does not do MPIO, each attached LUN is active on one path, and passive on the other. Each additional LUN added flip flops which path is active/passive.

Out of curiosity, what is the driving factor for using the Netapp for iSCSI instead of directly from the 3PAR?

Hope this helps,

Richard

skendric · **Joined:** Wed Jan 18, 2012 6:58 am **Posts:** 5

Hi Richard,

OK, taken me a while to learn enough to respond, at least partially, to your suggestions.

-I have cases open with both HP and NetApp
-The storage admins and our HP SE assure me that the LUNs are all configured with mag level availability, with the root disk (yes, this cluster boots from the T800 also) configured with cage availability.
-I have produced a diagram of the Fibre Channel connections between the T800 and the V3170: https://vishnu.fhcrc.org/3par/3par-netapp-diagram.pdf I believe this configuration meets your suggestions: the hosts are suitably distributed between the node pairs, for example.
-No zoning, as the V3170 is direct-attached

The HP TAC is pointing to confused SCSI reservation behavior from the V3170. The NetApp TAC is pointing to SCSI timeouts (along with Fibre Channel errors) between the V3170 and the T800. [I can post the details of each TAC's analysis, if you like -- EvtLog, EMS Log, and Syslog extracts showing storms of SCSI CLEAR followed by REGISTER-IGNORE, with no REGISTER, not even after the failover, various fci.device.timeouts.]

At the moment, I'm interested in understanding the LESB reports I see in the logs.
Link Failure
Loss of Sync
Loss of Signal
Invalid Transmission Word
Invalid CRC
I've crawled through the last several years of logs and produced both a history
https://vishnu.fhcrc.org/3par/parse-3par.txt
and a summary
https://vishnu.fhcrc.org/3par/parse-3par-stats.txt

Have any insights into how 'normal' these rates are? I understand Link Failure when we reboot a host ... but Link Failure shows up other times, too (perhaps when Tungsten resets an HBA ... I have no story around the Link Failures between the Brocade switches and the T800, though). And a steady dribble of Loss of Sync, Loss of Signals with the hosts, and Invalid Transmission Words with the cages.

==> What does Loss of Sync and Loss of Signal mean?
==> Suggestions on where I can read up about these?
==> Know any independent consultants who play in this space (reading logs of SCSI command/response and interpreting the result)?
==> What are your disk failure rates? 5% AFR for us (i.e. we're losing ~5% of our ~500 1-2TB SATA drives annually, in this T800) [As far as I can tell, industry averages range from 2-8%.]

--sk

skendric · **Joined:** Wed Jan 18, 2012 6:58 am **Posts:** 5

Hi Richard,

And to follow up on a couple of other thoughts in your post:

-We notice that the NetApp hosts are clustered, whereas the Solaris & Windows hosts are not -- meaning, the NetApp boxes can more easily engage in SCSI reservation conflicts.

-Also, the NetApp hosts use the T800 constantly, ~10,000 IOPS average per 24 hr period, whereas the Solaris and Windows hosts will occasionally burst to a few thousand IOPS but are generally silent.

-The thinking behind sending iSCSI traffic through the NetApp rather than directly to the T800 has two components:
(a) Mostly, the iSCSI hosts want snapshotting (so they tend to load SnapManager for Exchange and SnapManager for SQL Server)
(b) Seemed simpler conceptually to us: push all the IP-based protocols through the NetApp heads; leave the T800 speaking only Fibre Channel.
Most of the IO originates from NFS and CIFS clients, with the iSCSI SQL Server clusters close on their heels.

--sk

Richard Siemers · **Posted:** Sat Feb 25, 2012 1:48 pm

Cabling diagram looks good, excellent documentation on your part.

I had a re-occuring LESB errors from a NODE/controller to a disk shelf. It never caused an outage, but it generated alerts. We replaced the cable, SFPs and the controller on the disk shelf and the problem resolved. I have never seen LESB errors on a host port, but all my ports are attached to a switch. Even direct connected to a Netapp, those ports should not be resetting or reinitializing by themselves. I assume you are running at 4G speed, are you using the 50/125 OM3 aqua cables or orange cables?

I also have noticed the HIGH iops from the Netapp. Ive opened and escalated several cases with Netapp to no avail. Its as if the Netapp is doing aggressive disk scrubbing and crc check summing with idle cycles on the 3PAR luns. Netapp offered no solution to throttling that back or disabling that. 3PAR documents recommend dedicating front end ports to the Netapp, which you have done, which helps mitigate this issue. In my measurements, the iops are high but the mb/s are very low.

Something else that comes to mind is the # of, and size of, LUNs assigned to the Netapps.... I believe when you assign luna to the Netapp, it decides which path will be primary, and which one will be the standby... normally it flip flops these to load balance the disk workload. You may want to check your disks on the netapp to make sure they're split about 50-50 across the 2 paths.

skendric · **Joined:** Wed Jan 18, 2012 6:58 am **Posts:** 5

Hi Richard,

OK, so, after more digging:
-We see LESB errors on one of the host ports ... and, per your thought, are suspecting physical layer issues (cabling, optics) on that path. All the rest of the LESB reports come from cage ports. [Yup, we use 50 micron (aqua) multimode glass for this application, yup, 4GB FC.]

-WRT ONTAP and V-Series: according to my understanding (take this with salt; I don't claim to be an expert here), WAFL still performs its write optimization behavior, even with V-Series. i.e. it shuffles blocks around on the backend, in order to build up lots of 128K contiguous segments of blocks. This is one of the ways in which WAFL achieves its strong write performance: since it can 'Write Anywhere' (the WA part of WAFL), it can convert random IO into sequential IO ... assuming that the backend has 128K chunks of contiguous blocks available to receive these bursts of IO. Of course, this is a total waste of CPU and IO on a V-Series ... since the physical block layout is abstracted by the backend (3Par in our case) ... so WAFL spends all this energy lining up 128K segments of contiguous /logical/ blocks ... into which it dumps its writes ... but that logical arrangement of blocks has no relevance to the physical block structure on the 3Par box, so it doesn't benefit anyone. I offer this because perhaps it explains some of the low-throughput /hi-count IO you are seeing. In a perfect world, ONTAP would disable this portion of WAFL's optimization behavior on V-Series ... but presumably, this would involve a non-trivial change in code and so hasn't been done yet.

-Returning to the issue which attracts my attention these days: whenever we fry a disk, we see Service Time latencies for that disk spike, into the 2-4 second range
https://vishnu.fhcrc.org/3par/Physical- ... rmance.pdf
Now, 2-4 seconds doesn't sound bad ... but, as I understand it, System Reporter /averages/ Service Time across a minute to produce that 2-4 second number ... so ... it is quite likely that we're seeing Service Times spiking much higher than that ... into the 5-15 second range which ONTAP uses for its Fibre Channel / SCSI timers. This smells like a problem to me.

-One can see the FC and SCSI timers firing in the EMS logs (/etc/log/ems*)
https://vishnu.fhcrc.org/3par/2012-02-1 ... ms-log.txt
As one might guess from this log, the T800 fried a disk on Feb 16 and then again on Mar 20
But ... we see these timeouts intermittently, every week or two, even when we aren't frying disks.

==> Have you ever taken System Reporter PD Perf snapshots after a disk failure and seen similarly high Service Times?
==> Do the EMS logs on your NetApps report intermittent Fibre Channel timeouts?
[BTW: I convert my EMS logs from their native XML format into what I call 'syslog format' using a script: https://vishnu.fhcrc.org/3par/convert-o ... -to-syslog]

--sk

Richard Siemers · **Posted:** Thu Mar 22, 2012 2:24 am

What do your VV service times look like during that PD failure?

Can you describe what LUN 173 is, and how it is configured? What hardware has been replaced thus far to try and eliminate the LESB errors?

skendric · **Joined:** Wed Jan 18, 2012 6:58 am **Posts:** 5

We haven't tackled the host port LESB reports ... but I'm not too concerned about that right now, as that particular host (plugged into the port producing intermittent LESB reports) isn't reporting difficulties (Solaris box). [Of course, it may be /having/ difficulties but not be logging anything! But no user complaints.]

On the NetApp side, here is a view into VLUN latencies across the last week:
https://vishnu.fhcrc.org/3par/VLUN-Histogram/

Yesterday (2012-03-21) was a good day for latencies -- three disks fried. In the 16384ms bucket (which, correct me if I'm mistaken, contains latencies in the 16384 - 65535ms range), I see 48 Read IOs and 8 Write IOs. Every one of those would have triggered an ONTAP timeout.

Of course, even on days when we don't fry disks, we'll see a handful of IOs landing in that bucket:
https://vishnu.fhcrc.org/3par/VLUN-Hist ... togram.pdf

WRT LUN173 ... I'm afraid I'm so new to this space that I don't really know what you're asking. I believe ... plse correct me if I'm fumbling the lingo here ... each NetApp LUN is configured as a 14+2 RAID6 set. On the 3Par side, we configure ~1TB (more recently, 2TB) VLUNs ... which, of course, show up as LUNs on the NetApp side. Were you looking for something different than that?

--sk

Richard Siemers · **Posted:** Thu Mar 22, 2012 10:13 am

skendric wrote:

WRT LUN173 ... I'm afraid I'm so new to this space that I don't really know what you're asking. I believe ... plse correct me if I'm fumbling the lingo here ... each NetApp LUN is configured as a 14+2 RAID6 set. On the 3Par side, we configure ~1TB (more recently, 2TB) VLUNs ... which, of course, show up as LUNs on the NetApp side. Were you looking for something different than that?

So the 3PAR is doing the Raid6 14+2, not the Netapp correct? Are the Netapp Aggregate's raid level set to raid0?

How many days of Hi-res data is your system reporter configured to provide? I changed mine to 7 days, I think default was 24 hours. So you have been watching PDs and VVs and having a hard time zeroing in on a culprit. Try looking up what the Fc ports to the shelves are doing. I had a weird issue once where my NL drives on just one certain shelf were having weird issues, but the other 13 shelves were working fine. It was hard to identify the problem. Try a System reporter query that looks like:

High-Res PD Performance vs Time
--------------------------------------------------------------------------------
Current Selection
Systems: <Your 3PAR system> ;
PDIDs : --All PDIDs-- ;
Ports (n:s:p) : --All Ports-- ;
Disk Speeds : --All Disk Speeds-- ;
Disk Types : NL ; <---- first try this as ALL Disks, see note below
Compare : n:s:p <---- this is important
Select Peak : total_svctms

All your N:S:P lines should be within a couple ms of each other and very symmetrical, if you see some line way out of tolerance from the rest of the group, then re-run query for each disk type to see if the non-compliant N:S:P is specific to one type of disk. This could help narrow down issues to a single shelf, or show that workload is not evenly balanced across shelves.

HPE Storage Users Group

What happens during disk replacement?

Who is online