How many nodes does 7400 can fail at same time?

hdtvguy · **Joined:** Sun Jul 29, 2012 9:30 am **Posts:** 576

port persistence works great for use, they just rebooted each of our nodes yesterday as they added HBAs to each controller. We also zone all 4 controllers to each host alternating between which node pair is on fabric "A" and which is fabric "B"

So on a given esxi host we have dual port HBA;
port 0 zoned to nodes 0,1
port 1 zoned to nodes 2,3

we alternate that so on next set of hosts we do
port 0 zoned to nodes 2,3
port 1 zoned to nodes 0,1

This means under optimal condition each hs has access to each node and with persistence they still think they have 4 paths. What I will say is out of about 80 hosts usually 2-5 will noticed the disruption by posting the datastore redundancy alarm, but the paths are fine, just notices that very shot port fail over sometimes.

shrek · **Joined:** Fri Mar 29, 2013 10:10 am **Posts:** 5

Questions I asked to try and gain a better understanding of how the 7400 4-node works. First question is obvious but might help users new to 3PAR 7400. Second q and a are relevant to this post.

=======================================================
Q1. What happens in a 7400 2-node system if you lose one controllers? I assume everything will continue to work and all data will be accessible because one controller is still working...I presume some sort of ownership transition takes place whereby the physical drives owned by node0 become owned by node1

A1. Yes everything will continue to run with no interaction required. The idea is that every volume presented to each server is truly active on both controllers, so in the event of a single controller failure you will still have access to the volume via the other controller.
==============================================================

Q2. What happens in a 7400 4-node system if you lose 2 controllers. I assume the array will continue to function.

A2. NO, the 7400 is designed where it can sustain a single controller failure only, what the 4 nodes allow for is great performance/scalability and persistent cache. Persistent cache will guarantee that in the event of a single node failure on a system with 4 nodes or above the cache will remain online and will have a predictable performance degrade, i.e. a four node 7400 with 4 controllers and one of these fails then the performance available will be three quarters of the four node system.
===============================================================

Cleanur · **Joined:** Wed Aug 07, 2013 3:22 pm **Posts:** 254

This is about 3PAR being a close coupled cluster and the absolute requirement to maintain data integrity throughout failures. The chances of a dual controller failure, whether within or across node pairs is so small that the system must treat this as an edge case and will likely shutdown the system to ensure data integrity is maintained. Some of this is down to timing between failures and also whether this is a hard failure or just an administrator initiated reboot, in which case it's a controlled process and so wouldn't bring the system down.

Think of it from the arrays perspective whilst recovering from one controller failure a second occurs this is not how failures typically happen, if a third failure were now to occur then all bets are off. It's more likely something outside of the array is causing the failure such as environmental issues or things being physically unplugged, as such the array suspends I/O and shuts the system down until the external issue can be resolved.

I'm not aware of any single system that will provide this level of availability in all but a very specific static configurations with multiple mirrors and even then you'd be hard pressed to get any kind of guarantee beyond a single failure. It's the same with 3PAR, if the answer is "it depends" there's no point setting an incorrect expectation, the the worst case scenario should be presented (3PAR protects against a single point of failure) to ensure there's no ambiguity.

If this is a real problem and you absolutely must protect against multiple node / controller failures then a second array with replication would be the answer, Peer Persistence, HP CLX, VMware SRM or host based mirroring could then fully automate the failover to a second array.

Arkturas · **Posted:** Wed Aug 07, 2013 5:03 pm

Interesting point, you're right though if you want to mitigate the risk of multiple node failures , another inserv is the way to go.

So depending on the customers requirements , two 7200's with the peer persistent license sounds like a good option, or even better two 7400's.

regards
Gareth

mugurs · **Joined:** Thu Aug 22, 2013 7:17 am **Posts:** 21

What is it ment by "major revision upgrade"? On our 10800 system, for an upgrade from 3.1.1 MU2 to 3.1.2 MU1, HP rebooted one node at a time.

Regards,
Mugur

Richard Siemers · **Posted:** Fri Aug 23, 2013 3:16 pm

By major upgrade, I was referring to when we went from 2.2.4 to 2.3.1 in April 2010. They rebooted all the of odd nodes at one time, which scared the heck out of me, but was working as intended.

Their health check prior to the upgrade specifically surveyed and flagged any hosts that were "vertically attached" for the purpose of ensuring this vertical reboot of all the evens, or the odds did not cause a host outage. We had to correct that in order to proceed with an online upgrade.

Here is the exact verbiage sent from 3PAR support (John Tolnay) on 4/2/2010

Quote:

3PAR Support has reviewed the host information that was supplied for InServ S/N 1202XXX against the attached Configuration Matrix. From the supplied data we cannot support an online upgrade as the configuration does not match the 3PAR support matrix. We do not recommend an online upgrade for configurations that have not been tested. The successful completion of an online upgrade with an untested configuration cannot be guaranteed.

4. The following hosts are attached to vertical nodes.
These connections are not to 3PAR recommendations and may prevent an online upgrade.
Connections have to be changed to horizontally adjacent node pairs.
ID Name --- WWN --- Port
2 R6KXXXQAA 10000000C9464CXX 5:5:1
2 R6KXXXQAA 10000000C9464CXX 7:5:1
14 NTFWXXXSQLD1 10000000C93719XX 5:5:2
14 NTFWXXXSQLD1 10000000C93719XX 7:5:2
18 NTFWXXXMISCPD 10000000C92418XX 5:5:2
18 NTFWXXXMISCPD 10000000C92418XX 7:5:2

That said, perhaps they turn something off or on when doing an upgrade to allow 2 nodes to go down without triggering a panicked shutdown. I, personally believe but do not certify or guarantee, that if 2 nodes from different node pairs died, the system would keep on serving.

Architect · **Joined:** Thu Dec 06, 2012 1:25 pm **Posts:** 138

Richard Siemers wrote:

By major upgrade, I was referring to when we went from 2.2.4 to 2.3.1 in April 2010. They rebooted all the of odd nodes at one time, which scared the heck out of me, but was working as intended.

You were scared for a good reason. Rebooting more then one node is not supported, nor ever teached to the CE's. If I would see that a CE would handle one of my 3PAR's like that I'd have him removed off premises and he wouldn't allowed back in the datacenter ever... and that's putting it very nicely...

I've witnessed a double node failure due to stupidity once (0 and 3 on a V400), and I can tell you from first hand the 3par WILL panic and WILL go down (same on a V800 by the way, but that is another story).

Richard Siemers · **Posted:** Mon Sep 09, 2013 4:03 pm

Architect wrote:

Rebooting more then one node is not supported, nor ever teached to the CE's.

Remote "Global Field Deployment Support Engineers" are who do our upgrades, never anyone onsite, and never a CE. This is a part of a email I received yesterday from the remote upgrade team who is reviewing the host audit worksheet I compiled and submitted last week. It clearly states that half the nodes will go offline during the upgrade. I stick by my previous statement that perhaps they change a setting during upgrades that allows this to happen without panic.

Quote:

This will be a multi-hop OS upgrade from 2.3.1.330 to 3.1.2.422.
Standard OS upgrade from 2.3.1.330 to 3.1.1.448 will be performed first and then followed by a Simple OS upgrade from 3.1.1.448 to 3.1.2.422 is needed to complete the process.

For Standard OS upgrade, half of the operating nodes will be individually updated while the other half continues to monitor the system. The upgrade proceeds by loading the new revision of the software on half of the controller nodes in the system. Once the first half of the nodes have successfully loaded the new software, the second half of the controller nodes proceed to reboot and reload the new revision of the software.

Richard Siemers · **Posted:** Wed Sep 11, 2013 1:02 pm

Also, while in process of planning an upgrade, support directed me to read the HP 3PAR Operating System Upgrade Pre-Planning Guide
http://bizsupport1.austin.hp.com/bc/doc ... 660486.pdf

Where page 5 describes the difference between a online major upgrade, and a maintenence level upgrade.

Quote:

Half-System Online Update: Used when there is major OS-level change. (For example, updating
from HP 3PAR OS 2.3.1 to OS 3.1.1 is a major OS-level change.) Half of the nodes are
updated at one time, then the other half of the nodes are updated at one time.

Cleanur · **Joined:** Wed Aug 07, 2013 3:22 pm **Posts:** 254

Architect, the half system online update was a fully supported online update process, but only applied to Inform O/S major updates prior to 3.11. From 3.11 and beyond only single node restarts with port persistence are used which also alleviates the dependence on MPIO to handle failover.

As stated above two nodes going down doesn't necessarily take down the system, it depends on whether it's a controlled reboot / shutdown by an Admin as above or a hard failure. In terms of failure, timing is important. e.g if concurrent failures occur within a specified time period the system will safely shutdown (outside influence / environmental / stupidity is assumed).

Also it depends on which nodes happen to fail and since there is no way to predict which particular nodes will fail and how far apart (time wise) those failures will occur. The only safe assumption you can make is that the system will shutdown. If you truly want to protect against multiple concurrent failures and guarantee data availability, then you need a BC/DR solution.

It's a bit like saying Raid 10 can survive two failures, yes it can under very specific failure conditions. i.e no two mirrored pairs fail within the stripe set. However there's no way to guarantee this order of events will happen, see murphy's law http://en.wikipedia.org/wiki/Murphy's_law :cry:

HPE Storage Users Group

How many nodes does 7400 can fail at same time?

Who is online