Failover behaviour vs. ESXi APDs

mlu · **Posted:** Fri Jul 14, 2017 12:48 pm

Hello all,

first of all, thanks for having this forum here.
It already helped a lot over the time. 8-)

We are currently testing a new 8440 "Cluster" for VMware only.
While doing acceptance tests, we encountered some behaviour we think is quite strange, and which is "not amusing" for the customer.

Setup is as following:

System_A (primary) ---> RCG ---> System_B (secondary)
LUNs in RCG are presented to Host_A from both System_A and System_B.

When simulating a power outage (by pulling all 3par power cords) of the secondary 3par system (that is: the system which is the target for the remote copy group), the ESXi(s) which are only having standby paths to that very system (replicated r/o LUNs are being presented) are experiencing an APD (all paths down) for the LUNs which are on the primary array.

The APDs duration is quite short (1-2s while there is no I/O, maybe longer if there is more I/O) but VMs running off that LUN(s) experience a freeze of I/O for about 15s.

The main question is - is this normal?
Works as designed? Someone seeing the same phenomenon?

To clarify things here:
LUNs which are active on the powered off array and have to be switched over (failovered) to the other site (read: SystemA "breaks" RCG, makes Copy-LUNs active, and accepts IO) are usually _not_ experiencing this delay.
It is the LUNs that are active on the primary array - which is _not_ powered off - which are having this problem.

I just can not wrap my head around this.
HPE Case is in the works, same with VMware.

Best practice guide(s) have been followed, with exception of one "recommendation" (dynamic FC-ID assignment should be "off")

Specs:

Code:

3Par Arrays:   2x 8440 Flash Only
3parOS:      3.2.2 (MU4)
HostPersona:   11 / SATP rule for round robin and iops=1 in place on esx

ESXi:      5.5u3a & 6.0.0u3
HBAs:      Qlogic & Emulex (CNA & "classic" HBAs) (rec. FW lvl)

SAN:         Brocade G620 FOS 8.0.1b (1st fabric) & 5100 FOS 7.4.1d (2nd fabric)
Distance:      between primary/secondary: 10km, 8x DarkFibre

Thankful for every input.

TIA & Cheers,
mlu

koopa79 · **Joined:** Mon Apr 14, 2014 9:07 am **Posts:** 189

So i've not tested like you describe, but i can imagine if you remove power from the secondary array due to the fact you had to have these in sync rep to allow this to work, that when the power is removed, then all i/o on the primary should have a slight stun as anything in sync mode has to receive the ack from the secondary array before writing the next i/o to the primary array. It will take a short time for it decide to give up on sync rep to teh secondary array and create a snapshot on the primary to store the changes which it would sync back to the secondary when it comes back live.

I wouldn't think it should show all paths down on the primary array when this happens.

Obviously i have to ask the simple questions like, do you have WWN on both LUN's and are they presented to the Hosts as the same LUN number.
Also if you do a showrcopy -d can you see the group has auto_recover,auto_failover,path_management

Maybe someone else will provide you more insight than this

mlu · **Posted:** Wed Jul 19, 2017 5:18 am

Quote:

It will take a short time [...] create a snapshot on the primary [...]

I thought so too.
Can not find an official statement by HPE on that.
This is why others experiences would be so important for us right now

We never experience a PDL (APD >140s with default esxi settings)
- so a definite answer like "APD is 'ok' for 1-5s, everything is fine." would be ok too I guess.

Personally I don't think it's ok at all.
From a FC/SAN perspective the information of the secondary storage being offline is available to both the primary storage and the ESXi-Host (at the same site as the primary storage) at the very same time.
Allowing the primary storage then some time to break the RCG and create a snapshot (it us just redirect-on-write, isn't it?) is ok, yes. Lets say its very lazy today and needs 3s for that (very long!) - I would then still assume (from a hosts perspective) to not being impacted by that.. Why?
Because the storage-system itself already knows that it will contine to be able to write this vvol in the near future and it is okay to tell the requesting host that the paths which it is presenting to the host are in fact still _active/optimized_ paths.
I am even not going as far to say it should never pause to accept I/O requests (actually I do think it should never pause.. but getting less and less demanding

), but it should at least answer to host inquiries about path availability in a positive way.

This is not happening as I can see SCSI sense codes of "0x2 0x4 0xa" for all active paths.
AFAIK this translates to
A) "NOT READY", and
B) is usually only used in ALUA when there is an Active Path Transistion.

But turning off the _secondary_ should never lead to this answer being sent by the _primary_.

Would love to being disproved here

VMware L2 Engineer says this "not ready" sense code can be the fault of the used FC HBA (driver, firmware) too.
- honestly I can not (want not) believe that.

Any take on that?

mlu · **Posted:** Wed Jul 19, 2017 6:17 am

koopa79 wrote:

Obviously i have to ask the simple questions like, do you have WWN on both LUN's and are they presented to the Hosts as the same LUN number.
Also if you do a showrcopy -d can you see the group has auto_recover,auto_failover,path_management

Yes, same host LUN id (number) and same LUN WWN.
Yes, group has auto_failover & path_management set.

No, group does not have "auto_recover" set.
(which to my understanding only takes care about automatically resyncing when failed storage comes back online)

koopa79 · **Joined:** Mon Apr 14, 2014 9:07 am **Posts:** 189

mlu wrote:

No, group does not have "auto_recover" set.
(which to my understanding only takes care about automatically resyncing when failed storage comes back online)

Yes i believe you are right, this starts the sync working again, and from my understanding of peer persistence you need this to sync the secondary array back up or visa versa.

Let us know what HP come back with, but i should thinka slight delay is normal.

mlu · **Posted:** Mon Aug 07, 2017 1:58 am

Bumping this as neither HPE nor VMware having a clue whats happening here.
(still unsolved, despite escalated support cases and being a "standard functionality")

If any Partner/Consultant with a lab or a new installation in testing stage is reading this -
it would be very interesting to get the snippet (anonymized) of your ESXi log from the time during failover/"killing" of secondary storage (replication target).

Again, thankful for every input.
(we are out of ideas)

Morbid · **Joined:** Tue Jun 13, 2017 2:57 am **Posts:** 19

We will install 2*8400 with Peer Persistence and esx cluster in september, I will give you my feedback about this issue

koopa79 · **Joined:** Mon Apr 14, 2014 9:07 am **Posts:** 189

does it stay with APD when power is removed, as peer persistence doesn't automatically kick, it has to allow a small amount of time to see if the system comes back up on its own / network issue. So i have seen loss of paths before, but it sorts itself very quickly.

I assume you have a witness in a 3rd site? which is connected to both 3pars?

mlu · **Posted:** Mon Aug 07, 2017 11:24 am

koopa79 wrote:

does it stay with APD when power is removed, as peer persistence doesn't automatically kick,

To my understanding, Peer Persistence should only kick in if the _primary_ paths go down.
Failover of primary to secondary works perfectly fine.
APD only occurs when we kill the secondary storage. :roll:

koopa79 wrote:

it has to allow a small amount of time to see if the system comes back up on its own / network issue. So i have seen loss of paths before, but it sorts itself very quickly.

Fair enough, i have to say the APD situation isn't that long, and never became a PDL.
It is just some seconds..but then again the storage is doing nothing. (1 CPG, 1 VVOL, 1 RCG, 1 ESXi Host, 1 VM, ...)

koopa79 wrote:

I assume you have a witness in a 3rd site? which is connected to both 3pars?

Yes, witness in 3rd site, RTT <2ms.

koopa79 · **Joined:** Mon Apr 14, 2014 9:07 am **Posts:** 189

sorry i should have paid attention when you said teh secondary system.

Yes that seems very odd that losing the secondary paths causing all APD to the primary storage. Have you tried logging a call with vmware as well? just incase its something they can see their end?

Thanks

HPE Storage Users Group

Failover behaviour vs. ESXi APDs

Who is online