HPE Storage Users Group

A Storage Administrator Community




Post new topic Reply to topic  [ 22 posts ]  Go to page 1, 2, 3  Next
Author Message
 Post subject: Physical Disk Failures
PostPosted: Mon Feb 10, 2014 12:58 pm 
Site Admin
User avatar

Joined: Tue Aug 18, 2009 10:35 pm
Posts: 1328
Location: Dallas, Texas
PD failures are common... I wanted to discuss/share/learn how to properly audit/verify the PD failure and recovery process. I have seen drives fail, then come back online, then fail again a week later. I am curious as to what the workflow is for a PD failure, where in that workflow does it try to re-spin up the drive, or move readable chunklets off the drive vs rebuild them from parity.. at what point will it stop trying to re-spin up the drive and just rebuild everything from parity?

How many different ways does a PD fail, and how does the system react differently to each? I can think of a few... You have 1 port-A or port-B failures, both ports failing, over 5 chunklets go bad on the drive (media errors), and a failed drive that can no longer be read from at all.

So at the point of a drive fail, after an alert is sent out to the customer and HP... what is next?

To see which drives are failed:
showpd -failed -degraded

Shows chunklets that are moved, scheduled to move or are moving:
showpdch -mov

Show chunklets that have moved, or are moving from a specific PD
showpdch -from <pdid>

It appears that "showpdch -sync" may reveal which chunklets are being rebuilt from parity.

It appears that "showpdch -log" may show which chunklets are offline, but being serviced through parity reads, and logged writes, as in what happens during a service mag to the other 3 drives on a 4 drive magazine.


One thing I would like to be able to do is confirm for the field tech that the system is ready for him to come onsite. What I do currently to "check" this is a couple things, because I am not 100% confident the first few are absolulte.

1: showpd -space <failed pd #>
Code:
ESFWT800-1 cli% showpd -space 285
                         -----------------(MB)------------------
 Id CagePos Type -State-   Size Volume Spare Free Unavail Failed
285 7:9:1   FC   failed  285440      0     0    0       0 285440
----------------------------------------------------------------
  1 total                285440      0     0    0       0 285440

If I don't see volume at 0, then I assume the drive evac/rebiuld is not complete yet.

2: showpdch -mov <failed pd #>
Code:
ESFWT800-1 cli% showpd -space 285
                         -----------------(MB)------------------
 Id CagePos Type -State-   Size Volume Spare Free Unavail Failed
285 7:9:1   FC   failed  285440      0     0    0       0 285440
----------------------------------------------------------------
  1 total                285440      0     0    0       0 285440
ESFWT800-1 cli% showpdch -mov
Pdid Chnk                 LdName LdCh  State Usage Media Sp Cl     From  To
  42  584          tp-2-sd-0.144  514 normal    ld valid  Y  N  285:793 ---
  42  792           tp-2-sd-0.69  726 normal    ld valid  Y  N  285:488 ---
  42 1084           tp-2-sd-0.86  478 normal    ld valid  Y  N  285:521 ---
 102  574          tp-2-sd-0.140  917 normal    ld valid  Y  N  285:785 ---
 102  771           tp-5-sd-0.31  181 normal    ld valid  Y  N  285:190 ---
 102 1085           tp-2-sd-0.41  438 normal    ld valid  Y  N  285:418 ---
 109  580            tp-5-sa-0.3   42 normal    ld valid  Y  N   285:47 ---
 109  771          tp-2-sd-0.130  696 normal    ld valid  Y  N  285:697 ---
...
...
---------------------------------------------------------------------------
Total chunklets: 824

If I see any chunklets still on PDID 285 (the failed one) or that have the To field with data in it, I will assume the rebuild/evac is not done yet.


Is there anyway to view the tasks that relocate/rebuild these chunklets? I dont see anything in my showtask history.

_________________
Richard Siemers
The views and opinions expressed are my own and do not necessarily reflect those of my employer.


Top
 Profile  
Reply with quote  
 Post subject: Re: Physical Disk Failures
PostPosted: Tue Feb 11, 2014 7:00 am 

Joined: Wed Nov 09, 2011 12:01 pm
Posts: 392
I tend to see three methods;

1) Disk fails, little warning and auto rebuild from parity.
2) Disk failing, sometimes get warnings and auto moves data elsewhere.
3) Disk not happy, maybe a few warnings or not available for allocations but requires manual servicing to start the data rebuild/move process before the engineer arrives.

Support typically are aware if the disk is ready for replacement but not sure what info from the SP uploads they check for that, I suspect the estimates they sometime give are generic based on disk type/size.

The fun tends to begin when the extra load from the rebuild fails another disk and/or when inserting the new disk doesn't go to plan. Three different service companies and over a dozen different engineers in 5 years has led to random events during replacements but no data loss. ;)


Top
 Profile  
Reply with quote  
 Post subject: Re: Physical Disk Failures
PostPosted: Fri Feb 14, 2014 3:26 am 

Joined: Tue Oct 30, 2012 10:05 am
Posts: 26
Hi,

Run a showpd -c pdid (e.g. showpd -c 285) and check for the following columns to be zero:
* NORMAL USED OK, NORMAL USED FAIL, NORMAL UNUSED FREE
* SPARE USED OK, SPARE USED FAIL en SPARE UNUSED FREE

If one is not zero, drive is not ready to be swapped.


Cheers


Top
 Profile  
Reply with quote  
 Post subject: Re: Physical Disk Failures
PostPosted: Mon Feb 24, 2014 5:57 pm 
Site Admin
User avatar

Joined: Tue Aug 18, 2009 10:35 pm
Posts: 1328
Location: Dallas, Texas
Thanks for that feedback.

What determines what sort of servicemag they will do? I have seen cases where they used logging, which servicemag wasn't initiated until the tech and the part were onsite, and in other cases they do a full service mag several hours ahead before the the tech arrives.

I presume its based on activity of the system, how does one determine which to use and when?

_________________
Richard Siemers
The views and opinions expressed are my own and do not necessarily reflect those of my employer.


Top
 Profile  
Reply with quote  
 Post subject: Re: Physical Disk Failures
PostPosted: Tue Feb 25, 2014 5:52 am 

Joined: Wed Nov 09, 2011 12:01 pm
Posts: 392
It's been a while since I've seen a full evac of a Mag, I'd guess performance, load and % full would be considered. Logging seems to be the norm now, I know they used to have concerns regarding how long you could run with Logging on but we've had failed inserts of new disks that left us running with Logging for several hours until the engineer was able to get hold of someone who knew enough under the hood 3PAR to work around the problem.
It may have only been replacing an entire Mag (there was a time that FC450 disks weren't available so any failures were replaced with FC600 disks plus they had to have all the disks in the Mag the same size) where I've seen this or some early disk replacements where the Mag was only maybe 10% full and Logging was still a new feature.
Although I have had to start the servicemag manually and tell the Engineer to come back in a few hours when Support have forgotten or had them ask me to do it because certain Support staff were using some remote portal that broke often (other teams appeared to have access to better tools at the time and didn't have a problem :) ).


Top
 Profile  
Reply with quote  
 Post subject: Re: Physical Disk Failures
PostPosted: Tue Feb 25, 2014 9:48 am 

Joined: Fri Feb 14, 2014 2:26 pm
Posts: 20
eve wrote:
Hi,

Run a showpd -c pdid (e.g. showpd -c 285) and check for the following columns to be zero:
* NORMAL USED OK, NORMAL USED FAIL, NORMAL UNUSED FREE
* SPARE USED OK, SPARE USED FAIL en SPARE UNUSED FREE

If one is not zero, drive is not ready to be swapped.


Cheers


For drive ready to be replaced, only Used OK and Used FAIL need to be 0. HP doesn't care about the others.


Top
 Profile  
Reply with quote  
 Post subject: Re: Physical Disk Failures
PostPosted: Tue Feb 25, 2014 9:53 am 

Joined: Fri Feb 14, 2014 2:26 pm
Posts: 20
Our reporting shows that a drive has failed.

I double check on the Inserv via command line.
Code:
showpd -failed -degraded


I check servicemag to make sure nothing is running currently
Code:
servicemag status


I issue the command below to get the model number of the drive, since HP will not have it for the company I work for
Code:
showpd -i <PD#>


I issue the following to get the drive position, drive state, chunklet status.
Code:
showpd -c <PD#>


The following two commands are also needed by support
Code:
showversion

Code:
showsys


If replacing a single drive, I issue
Code:
servicemag start -log -pdid <PD#>

The Inserv will begin preparing to take the magazine offline and log the chunklets normally bound for this magazine to other magazines in the system.

Verify the magazine is ready to be pulled by issuing
Code:
servicemag status

You should see SUCCEEDED when it's ready to be pulled. The orange indicator light on the magazine will be lit.

Replace the drive in the magazine, put the magazine back in the Inserv and please wait for the orange light to go away or make sure all of the lights on the magazine are green and NOT blinking. Blinking lights indicate the drives are still spinning up upon initial insertion of the magazine.

Back at command-line, type in
Code:
cmore showpd

You should see the drive placement at the top and says NEW. This just shows you the Inserv sees the new drive and is ready to go.

Issue the following to have the servicemag script resume the magazine.
Code:
servicemag resume <CAGE#> <MAGAZINE#>


That is how it is done here.


Top
 Profile  
Reply with quote  
 Post subject: Re: Physical Disk Failures
PostPosted: Thu Mar 06, 2014 2:41 pm 
Site Admin
User avatar

Joined: Tue Aug 18, 2009 10:35 pm
Posts: 1328
Location: Dallas, Texas
Excellent thank you for the step by step write up!

_________________
Richard Siemers
The views and opinions expressed are my own and do not necessarily reflect those of my employer.


Top
 Profile  
Reply with quote  
 Post subject: Re: Physical Disk Failures
PostPosted: Tue Mar 11, 2014 6:07 am 

Joined: Tue Oct 30, 2012 10:05 am
Posts: 26
corge wrote:
eve wrote:
Hi,

Run a showpd -c pdid (e.g. showpd -c 285) and check for the following columns to be zero:
* NORMAL USED OK, NORMAL USED FAIL, NORMAL UNUSED FREE
* SPARE USED OK, SPARE USED FAIL en SPARE UNUSED FREE

If one is not zero, drive is not ready to be swapped.


Cheers


For drive ready to be replaced, only Used OK and Used FAIL need to be 0. HP doesn't care about the others.





HP does care about the others.
I have been servicing 3PAR for quite some years and I know from experience that
you may get issues if you do not have all zero


Top
 Profile  
Reply with quote  
 Post subject: Re: Physical Disk Failures
PostPosted: Tue Mar 11, 2014 6:18 am 

Joined: Tue Oct 30, 2012 10:05 am
Posts: 26
corge wrote:
Our reporting shows that a drive has failed.

I double check on the Inserv via command line.
Code:
showpd -failed -degraded


I check servicemag to make sure nothing is running currently
Code:
servicemag status


I issue the command below to get the model number of the drive, since HP will not have it for the company I work for
Code:
showpd -i <PD#>


I issue the following to get the drive position, drive state, chunklet status.
Code:
showpd -c <PD#>


The following two commands are also needed by support
Code:
showversion

Code:
showsys


If replacing a single drive, I issue
Code:
servicemag start -log -pdid <PD#>

The Inserv will begin preparing to take the magazine offline and log the chunklets normally bound for this magazine to other magazines in the system.

Verify the magazine is ready to be pulled by issuing
Code:
servicemag status

You should see SUCCEEDED when it's ready to be pulled. The orange indicator light on the magazine will be lit.

Replace the drive in the magazine, put the magazine back in the Inserv and please wait for the orange light to go away or make sure all of the lights on the magazine are green and NOT blinking. Blinking lights indicate the drives are still spinning up upon initial insertion of the magazine.

Back at command-line, type in
Code:
cmore showpd

You should see the drive placement at the top and says NEW. This just shows you the Inserv sees the new drive and is ready to go.

Issue the following to have the servicemag script resume the magazine.
Code:
servicemag resume <CAGE#> <MAGAZINE#>


That is how it is done here.





A few thing to add
* showpd -c
Be sure to check for the following columns to be zero:
- NORMAL USED OK, NORMAL USED FAIL, NORMAL UNUSED FREE
- SPARE USED OK, SPARE USED FAIL en SPARE UNUSED FREE
If one is not zero, drive is not ready to be swapped.
* servicemag -log -pdid xx
the -log option is only needed on S, T and V class systems where you have
four drives on a magazine
-log will divert write IO to the three remaining drives on that mag to other disks
which will be played back to the disks during resume
(read IO is from parity for all four drives)
-log is the 3PAR recommended option for large drives
If on S, T and V-Class systems the -log option is left out, you will issue a full servicemag
which will copy all data from the three remaining drives on that mag to other disks
This will take hours
* If you run a showpd after replacing the failed drive, the new drive may show status
"degraded" instead of new.
This means that the drive is running old firmware.
Just continue with the servicemag resume as the drive will be first upgraded during the resume

If the drive shows "failed" try a reseat


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 22 posts ]  Go to page 1, 2, 3  Next


Who is online

Users browsing this forum: Google [Bot] and 49 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group | DVGFX2 by: Matt