Danger - data loss warning with Gold 764W power supplies!!!

dcolpitts · **Posted:** Sat Dec 06, 2014 3:40 pm

Hey folks - see below for the long story, or take the short version in this first paragraph - the new Gold 764W power supplies that are now shipping in new StoreServ 7200s for the controller nodes do not seem to be able to handle the power fluctuation when UPSes (or atleast HP R5000 units) switch from utility to battery during a brown or blackout, and it causes **all** the drives in cage 0 to go offline immediately, failing any VVs that are on them.

In my case, this customer's new 7200 consisted of two SFF cages - the controller cage and one additional cage for 48 x 450GB SFFdrives in total. Both cages were plugged into a pair of brand new R5000 UPSes (the left power supplies of both cages were plugged directly in the back of one R5000 on load segement 2, and the right power supplies of both cages were plugged directly into the back of the other R5000 on load segment 2).

On load segment 1 of each R5000 where 4 DL380 Gen9 servers, a StoreOnce 2700, and a ProCurve 5406R (each device having it's power supplies split between the two R5000s), which resulted in a load of approximately 26% to 28% on each R5000 (including the StoreServ cages). The customer location has a 100KVa standby generator as well, which would kick in after approximately 1 minute of utility power being offline.

Last Thursday afternoon as we had just finished storage vmotioning the customer's datastores off their old EVA P6300 to the 7200, the power flickered, and suddenly everything appeared to go down. When I went into the computer room to check, I found all 24 drives in the controller cage to have amber lights on them, and IMC suddenly reported all all VVs were failed because 24 drives were failed. Everything else running on the R5000s stayed up, and cage 1 stayed up. I figured it must have been a really bad power spike, so we proceeded to restart the customer's entire environment, and once both StoreServ cages were power cycled, everything came back so I didn't really think anything else of the episode.

On Saturday afternoon, after having pulled the EVA out of the rack and installing the StoreServ, we powered the StoreServ back up, and while waiting for it to finish starting, I was in the back of the rack finishing up plugging the power cables in for the new DL380 Gen9 servers. Apparently the power browned out again (I didn't notice, but I found out later one of my onsite deployment team noticed lights dimming in the office), and suddenly cage 0 was all amber again. This time I figured it had to have been me in the back of the cabinet that had somehow caused a power issue. A quick power cycle of the StoreServ and it was back up.

When the deployment team left the customer site Sunday night around 9 pm, the last thing they did was stick their head into the computer room and do an amber light scan - it was running fine. When we arrived onsite on Monday morning at 7:15, the first thing I did was head to the computer room only to find the controller cage showing amber lights on all drives. After swearing, I proceeded to reboot the StoreServ, then opened a ticket with HP.

At first they were rather skeptical of my story, and insisted I must not have things installed correctly (dispite the fact that I'm a certified 3Par installer). Eventually they sent out a field tech who verified the configuration, and the ticket got bumped up to level 2 support. The rest of Monday and Tuesday were uneventful.

Then Wednesday at noon, that 100KVa standby generator spun up to perform it's weekly failover test, and at 12:05, the automatic power switch flipped over to the generator, causing a 1.5 to 2 second building blackout.... And WHAM!!! Cage 0 goes down - but nothing else does (and nothing registers as an issue in the Gen9 ILOs either). At this point, after swearing loud enough everyone in the customer's office heard me, I start to get wise. I go restart just the StoreServ this time and bring it back online, but I leave all the other host and devices offline, knowing that in about 35 more minutes the generator is going to end it's self test, and that we'll have another blackout when it fails us back to utility power. Sure enough, 35 minutes later, off goes the power for about 3/4 of a second, and along with it, cage 0.

After getting everything back online, and roping in level two support again, I hear back from them that they are sending me another field engineer (who I happen to have worked with in the past) to check out the R5000s and to double check my work again, and then I find out the case has been pulled from level two and has now been giving to engineering (it was also referred to as level 4 support a couple of times)... When the field engineer contacts me, I advised him the end customer was rather unhappy and that we needed to take steps to fix this. Based on what I had saw, I was pretty confident the issue was not with the R5000s since everything else stayed up (including the StoreServ controllers, it was only the drives that were going offline in cage 0), as a result, I requested he bring a pair of 764W power supplies for cage 0 with him.

Once he arrived and he finished double checking my installation work (and tracing power cables - not very hard with a 2' power cable coming out of the cages and into the UPSes immediately below the cages), we set out to replace the power supplies.

When the FE pulled the first power supply and set it on the desk next to the ones he had brought so he could remove the battery, the first thing I noticed was the writing on the power supply that had shipped with the StoreServ versus the one the FE brough. One it had a different revision number on it (202 versus 201), two it had a different spare part number on it (727286-001 versus 683239-001), and three the words "Gold Series" written in yellow on the handle under the words 764W PCM.

Now I've been around the industry in both pre-sales and post-sales tech support long enough to recognize that when something that is related to power is gold or platinum, then it has a engery star rating with it which may not always be for the better. I've also been around StoreServ 7200 long enough to know that every other StoreServ 7000 I had installed did not come with Gold Series power supplies. And when we checked cage 1, the 560W power supplies do not have Gold on them either. I then proceeded to double check my install photo archive of the last couple 7200s I had installed (all on R5000s) and found not one of them had Gold Series power supplies.

I then proceeded to call a couple of my customer sites to verify the spare part number of the power supplies in their 7200s, and every other 7200 had a 764W PCM spare part number of 683239-001, not 727286-001 like these ones here. The FE and I finished swapping out the two Gold power supplies with the two "non-Gold" power supplies he brought with him, and then we waited for our test window at noon (we had arranged with the customer to take the entire company down from 12 to 1 pm to run a generator self test again).

At 12, I shutdown all the customer's VMware hosts and power them off. Then we went to the generator control panel and initiated a self test. After waiting the 60 seconds for the generator to warm up, it flicked the auto-power switch and we experienced a 1.5 to 2 second blackout - just like the day before, but this time drives in cage 0 stayed online. We then terminated the generator self test, and 2 minutes later we got another blackout lasting about 3/4 of second when the switch back to utility power occurred. Again cage 0 stayed online. We performed this cycle one more time, and during each blackout, cage 0 stayed online, so right now I'm concluding that there is a major issue with these Gold power supplies and that we fixed this customer's issue by swapping them out for the older non-energy star rated power supplies.

In the end, we were able to collaborate the brown and blackouts in the UPSes logs to the time that the StoreServ drives went offline the last 3 times (logging hadn't been enabled on the UPSes at the point of the first two events - we had previously planned to do that the afternoon that the 3rd outage occurred).

As a follow up as to why the FE brought the older version power supplies, his spare parts list of the 7200 was a few months old and did not contain these new power supplies. The spare parts list I downloaded from CSN to compare with was about 1 month old, and has both power supplies listed. Further to that, when I spec'd this 7200 in early September in SBW, there was no option for power supplies, so I was rather surprise once I seen there was two different power supplies now.

Sorry for the length of this post, but hopefully this long tale will save someone else's bacon while we wait on HP to either recall or adjust the sensitivity via firmware on these new gold power supplies.

dcc

afidel · **Joined:** Tue May 07, 2013 1:45 pm **Posts:** 216

Sounds like the kind of thing that keeps me devoted to double conversion ups's despite the slightly lower efficiency. I want my load receiving clean power all the time, not just when firmware decides the signal isn't clean enough.

JasonAntes · **Joined:** Wed Oct 30, 2013 12:06 pm **Posts:** 17

Yeah, I would never find a blackout of 1-2 seconds off UPS on a Generator test acceptable. It's just asking to fry something out with power transitions. I've had enough equipment brick from a simple power down/power up that was planned to want to ever deal with an unplanned down due to poor Genset design.

aDelaloy · **Joined:** Tue Nov 18, 2014 5:13 am **Posts:** 17

Hello,

I'm not quite sure if it helps, but I remember that four years ago, something strikingly similar happened on the HP ProLiants G6 , whose new PSU didn't like some UPSes' output (not being a good enough sine-wave, apparently)

There was never a good solution (other than checking one's UPS output with an oscilloscope !). here is the 100-posts thread on hpsc :

http://h30499.www3.hp.com/t5/ProLiant-S ... IXCvMlS2DI

Schmoog · **Joined:** Wed Oct 30, 2013 2:30 pm **Posts:** 242

saving money by scrimping on the UPS is NEVER a good idea.

Like others have said, I always use high quality double-conversion UPS's, and prefer to have them hard wired as opposed to modular/in-rack UPS's

dcolpitts · **Posted:** Wed Dec 24, 2014 9:38 am

There was "no scrimping" on UPSes. These are HP's R5000 units and they are not exactly cheap (especially compared to some units on the market)... This the standard we use on **all** our other StoreServ 7000 units, and not a single other site experiences this issue (even the 7200 I have at home in my basement is on a R5000 and it's never had issues). Nor does this site now that the Gold PCMs have been replaced with the non-energy start PCMs.

Actually - I have a feeling engineering would have blamed any non-HP UPSes first for the issues too based on the questions and comments they made to me.

I'm told there is a patch coming with firmware in it to resolve the issue in the gold PCMs.

dcc

Josh26 · **Joined:** Thu Oct 24, 2013 6:50 pm **Posts:** 185

dcolpitts wrote:

I'm told there is a patch coming with firmware in it to resolve the issue in the gold PCMs.

dcc

Now that is interesting. I would have been certain the issue you described would require a hardware replacement to resolve.

ascheale · **Joined:** Thu May 30, 2013 7:48 am **Posts:** 29

Are there any news? A customer of mine is having the same issue since today

JohnMH · **Joined:** Wed Nov 19, 2014 5:14 am **Posts:** 505

It's fixed in 3.2.1 MU2 P07 released end of January.

ascheale · **Joined:** Thu May 30, 2013 7:48 am **Posts:** 29

Thank you very much !

Alex

HPE Storage Users Group

Danger - data loss warning with Gold 764W power supplies!!!

Who is online