Hey folks - see below for the long story, or take the short version in this first paragraph - the new Gold 764W power supplies that are now shipping in new StoreServ 7200s for the controller nodes do not seem to be able to handle the power fluctuation when UPSes (or atleast HP R5000 units) switch from utility to battery during a brown or blackout, and it causes **all** the drives in cage 0 to go offline immediately, failing any VVs that are on them.
In my case, this customer's new 7200 consisted of two SFF cages - the controller cage and one additional cage for 48 x 450GB SFFdrives in total. Both cages were plugged into a pair of brand new R5000 UPSes (the left power supplies of both cages were plugged directly in the back of one R5000 on load segement 2, and the right power supplies of both cages were plugged directly into the back of the other R5000 on load segment 2).
On load segment 1 of each R5000 where 4 DL380 Gen9 servers, a StoreOnce 2700, and a ProCurve 5406R (each device having it's power supplies split between the two R5000s), which resulted in a load of approximately 26% to 28% on each R5000 (including the StoreServ cages). The customer location has a 100KVa standby generator as well, which would kick in after approximately 1 minute of utility power being offline.
Last Thursday afternoon as we had just finished storage vmotioning the customer's datastores off their old EVA P6300 to the 7200, the power flickered, and suddenly everything appeared to go down. When I went into the computer room to check, I found all 24 drives in the controller cage to have amber lights on them, and IMC suddenly reported all all VVs were failed because 24 drives were failed. Everything else running on the R5000s stayed up, and cage 1 stayed up. I figured it must have been a really bad power spike, so we proceeded to restart the customer's entire environment, and once both StoreServ cages were power cycled, everything came back so I didn't really think anything else of the episode.
On Saturday afternoon, after having pulled the EVA out of the rack and installing the StoreServ, we powered the StoreServ back up, and while waiting for it to finish starting, I was in the back of the rack finishing up plugging the power cables in for the new DL380 Gen9 servers. Apparently the power browned out again (I didn't notice, but I found out later one of my onsite deployment team noticed lights dimming in the office), and suddenly cage 0 was all amber again. This time I figured it had to have been me in the back of the cabinet that had somehow caused a power issue. A quick power cycle of the StoreServ and it was back up.
When the deployment team left the customer site Sunday night around 9 pm, the last thing they did was stick their head into the computer room and do an amber light scan - it was running fine. When we arrived onsite on Monday morning at 7:15, the first thing I did was head to the computer room only to find the controller cage showing amber lights on all drives. After swearing, I proceeded to reboot the StoreServ, then opened a ticket with HP.
At first they were rather skeptical of my story, and insisted I must not have things installed correctly (dispite the fact that I'm a certified 3Par installer). Eventually they sent out a field tech who verified the configuration, and the ticket got bumped up to level 2 support. The rest of Monday and Tuesday were uneventful.
Then Wednesday at noon, that 100KVa standby generator spun up to perform it's weekly failover test, and at 12:05, the automatic power switch flipped over to the generator, causing a 1.5 to 2 second building blackout.... And WHAM!!! Cage 0 goes down - but nothing else does (and nothing registers as an issue in the Gen9 ILOs either). At this point, after swearing loud enough everyone in the customer's office heard me, I start to get wise. I go restart just the StoreServ this time and bring it back online, but I leave all the other host and devices offline, knowing that in about 35 more minutes the generator is going to end it's self test, and that we'll have another blackout when it fails us back to utility power. Sure enough, 35 minutes later, off goes the power for about 3/4 of a second, and along with it, cage 0.
After getting everything back online, and roping in level two support again, I hear back from them that they are sending me another field engineer (who I happen to have worked with in the past) to check out the R5000s and to double check my work again, and then I find out the case has been pulled from level two and has now been giving to engineering (it was also referred to as level 4 support a couple of times)... When the field engineer contacts me, I advised him the end customer was rather unhappy and that we needed to take steps to fix this. Based on what I had saw, I was pretty confident the issue was not with the R5000s since everything else stayed up (including the StoreServ controllers, it was only the drives that were going offline in cage 0), as a result, I requested he bring a pair of 764W power supplies for cage 0 with him.
Once he arrived and he finished double checking my installation work (and tracing power cables - not very hard with a 2' power cable coming out of the cages and into the UPSes immediately below the cages), we set out to replace the power supplies.
When the FE pulled the first power supply and set it on the desk next to the ones he had brought so he could remove the battery, the first thing I noticed was the writing on the power supply that had shipped with the StoreServ versus the one the FE brough. One it had a different revision number on it (202 versus 201), two it had a different spare part number on it (727286-001 versus 683239-001), and three the words "Gold Series" written in yellow on the handle under the words 764W PCM.
Now I've been around the industry in both pre-sales and post-sales tech support long enough to recognize that when something that is related to power is gold or platinum, then it has a engery star rating with it which may not always be for the better. I've also been around StoreServ 7200 long enough to know that every other StoreServ 7000 I had installed did not come with Gold Series power supplies. And when we checked cage 1, the 560W power supplies do not have Gold on them either. I then proceeded to double check my install photo archive of the last couple 7200s I had installed (all on R5000s) and found not one of them had Gold Series power supplies.
I then proceeded to call a couple of my customer sites to verify the spare part number of the power supplies in their 7200s, and every other 7200 had a 764W PCM spare part number of 683239-001, not 727286-001 like these ones here. The FE and I finished swapping out the two Gold power supplies with the two "non-Gold" power supplies he brought with him, and then we waited for our test window at noon (we had arranged with the customer to take the entire company down from 12 to 1 pm to run a generator self test again).
At 12, I shutdown all the customer's VMware hosts and power them off. Then we went to the generator control panel and initiated a self test. After waiting the 60 seconds for the generator to warm up, it flicked the auto-power switch and we experienced a 1.5 to 2 second blackout - just like the day before, but this time drives in cage 0 stayed online. We then terminated the generator self test, and 2 minutes later we got another blackout lasting about 3/4 of second when the switch back to utility power occurred. Again cage 0 stayed online. We performed this cycle one more time, and during each blackout, cage 0 stayed online, so right now I'm concluding that there is a major issue with these Gold power supplies and that we fixed this customer's issue by swapping them out for the older non-energy star rated power supplies.
In the end, we were able to collaborate the brown and blackouts in the UPSes logs to the time that the StoreServ drives went offline the last 3 times (logging hadn't been enabled on the UPSes at the point of the first two events - we had previously planned to do that the afternoon that the 3rd outage occurred).
As a follow up as to why the FE brought the older version power supplies, his spare parts list of the 7200 was a few months old and did not contain these new power supplies. The spare parts list I downloaded from CSN to compare with was about 1 month old, and has both power supplies listed. Further to that, when I spec'd this 7200 in early September in SBW, there was no option for power supplies, so I was rather surprise once I seen there was two different power supplies now.
Sorry for the length of this post, but hopefully this long tale will save someone else's bacon while we wait on HP to either recall or adjust the sensitivity via firmware on these new gold power supplies.
dcc
|