3.1.3 MU2 Weirdness - RCIP Ports & CPG Allocation

AdamKnox · **Posted:** Fri Mar 27, 2015 12:19 pm

I already have tickets in with HP for these issues, but figured I'd reach out to this community just in case anyone has some insight.

Our V400 was upgraded to 3.1.3 MU2 on Wednesday evening. Later that night, our monitoring software alerted that the RCIP ports were no longer available. I started up a CLI session and determined that the ports were up and Remote Copy was transferring properly. I could not, however, ping the Remote Copy ports. Now, I don't know if I was ever able to ping them, but it seemed strange to me.

Logging into the IMC, I went to Systems --> Ports --> Remote Copy. Again, everything looked fine in terms of IPs, Gateways, etc. I then used the ping function from node 0's RCIP port to ping the gateway: success. Its replication partner's node 0 RCIP port: success. Then I tried pinging my computer's IP. This is where it gets fun. Immediately, my IMC froze up, and my CLI sessions shut down. I kill the IMC and try to log in, but the 3PAR array isn't reachable. Ping the management port... nothing. I ask a coworker, and he's able to get to the array fine. My PC, however, was completely locked out from communicating with the array. Thinking maybe something got wonky with my network adapter, I reset it, flush my arp cache, and try again... no dice.

I log onto a server where I keep a bunch of management tools, and I'm able to access the array. Now everything looks good, but Remote Copy link 0 (node 0 here to node 0 remote) is down. Now's a good time to mention I'm running a constant ping to the management IP from my desktop with 100% packet loss. I disable RCIP port 0 on the 3PAR, thinking I'll reset it. Instantly, my computer starts being able to ping the management port (which is on a different VLAN entirely from RCIP). Sure enough, I'm able to log into the IMC and CLI. I reenable port 0, Remote Copy indicates the link is up, data begins transferring, and I'm still able to reach the 3PAR from my PC.

Figuring it may be a fluke or some routing issue, I try the same thing, but this time pinging my management server (which is on the same VLAN as the 3PAR management port) from the Remote Copy port. Same thing! A constant ping to the 3PAR immediately dropped. Resetting the port restored its access.

Okay, that's issue number 1. Issue number 2 is even worse, because at least Remote Copy is working, despite the ports not responding to pings (except if performed from the core). Despite a CPG having plenty of room (>25TB), a volume was unable to allocate space.

Event id: 7252821 Node 1 Cust Alert - Yes, Svc Alert - Yes
Severity: Critical
Event type: TP VV allocation failure
Alert ID: 554
Msg ID: 270007
Component: Virtual Volume 31556 3PAR-Volume-Name CPG 11 CPG_Name
Short Dsc: TP VV 3PAR-Volume-Name allocation failure
Event String: Thin provisioned VV 3PAR-Volume-Name unable to allocate SD space from CPG CPG_Name

This directly impacted underlying hosts. Luckily, our VMware admin was just doing a storage vmotion, so no data was lost, but it could have been. I used DO on the affected volume to tune it back to the same CPG. This caused the free space on that CPG to grow, essentially avoiding the automatic SD space allocation process. The VV was then able to grow as required... for a while. When that free space was consumed, the error reappeared, and I had to DO the volume again to create wiggle room.

So yeah, fun times. Anyone have any ideas or have an array on 3.1.3 MU2 that they can check for the same RCIP port behavior?

Thanks,
Adam

::edit::

Got an update from HP on the CPG issue.

L2 investigation yields that we currently have a SW glitch that is preventing thin-provisioned volumes from grabbing more space from the CPG occasionally, even though there is free space available. The issue is under investigation.

Options at this time are:
1) convert your volumes to fully privisioned, which might not be feasible for you
2) upgrade your current OS 3.1.3MU2 to 3.2.1MU2, as we have not seen the issue in the 3.2.1 code base

There's no way we could accommodate fully provisioning our VVs and we, as an organization, try to steer clear of the bleeding edge, so neither option is really suitable for us. Thanks HP!

nsnidanko · **Joined:** Mon Feb 03, 2014 9:40 am **Posts:** 116

Do you have RCIP and controller's management on the same subnet/vlan by any chance?

AdamKnox · **Posted:** Fri Mar 27, 2015 2:25 pm

Nope, different VLANs.

nsnidanko · **Joined:** Mon Feb 03, 2014 9:40 am **Posts:** 116

We had similar issue after upgrade to 3.1.3 MU2 but it was related to improper configuration (both management and RCIP were on the same network). We couldn't access array from networks accessible via gateway, only LAN.

I suggest to open up ticket with HP and have them investigate. They can go inside linux and check arp table, etc...

AdamKnox · **Posted:** Tue Mar 31, 2015 8:39 am

I already have a ticket open. Next step is to put a server in the same subnet as the RC ports and see if I can duplicate the behavior. We're able to ping the RC ports from the core router.

It's unfortunate that HP won't let us access a pure Linux session, as that would really allow us to troubleshoot the issue.

With the CPG growth issue, we might not be on 3.1.3 MU2 for long. Our DR array is scheduled to be upgraded to 3.2.1 MU2 on Thursday morning, prod likely to follow about a week later.

hdtvguy · **Joined:** Sun Jul 29, 2012 9:30 am **Posts:** 576

3par has had notorious issues with NIC driver changes that impact how the NICs link to the upstream switch. 3.1.3 was very particular about wanting ports set to AUTO/AUTO. Look at your network switch ports and see how the NICs are connecting. They also may be flapping. 3.2.1 MU2 breaks them yet again, seems like every OS upgrade causes us to keep re-configuring our NICs and switch ports. We are back to configuring both to 1gb / FULL for RCIP and they are solid.

AdamKnox · **Posted:** Tue Mar 31, 2015 12:20 pm

Thanks for the information. I'll take a look at the switch settings, but I'm pretty sure they're set to auto (I also set the RCIP ports to auto as per the upgrade guide).

Remote Copy itself is working fine. In fact, earlier today I was looking at the Total Data Throughput in the IMC and the ports were actually showing as >110% utilized.

I finished my testing from the same VLAN as the RCIP ports (on the same FEX, too). From there, I was able to ping the Remote Copy ports, but I couldn't ping the 3PAR management port. I could ping things on the same VLAN as the management port, but not the management IP itself.

Pinging from the RCIP port to my computer was also successful when it was on the same VLAN, and when I connected my computer back to its normal VLAN, it was able to access the management IP again (but not RCIP).

For such a (for the most part) well thought out and brilliantly engineered piece of hardware, it's amazing how the little things like this seem to plague every software release.

hdtvguy · **Joined:** Sun Jul 29, 2012 9:30 am **Posts:** 576

Yes, I have been burned by 3.1.3 and 3.2.1 NIC issue that took our RC down for 15 hours on 3.1.3 and 5+ hours on 3.2.1. I am growing very disappointed with 3par QA especially when I specifically asked to verify 3.2.1 was not impacted by more NIC issues and was assured only the management ports had known issues. That and RC in general has been huge disappointment and Achilles heel for this product, such that we are considering looking at other options. While the array is excellent performer and great architecture the RC is in the stone ages compared to their competition. What good is a great performing array if I can;t DR the damn thing!

HPE Storage Users Group

3.1.3 MU2 Weirdness - RCIP Ports & CPG Allocation

Who is online