node rescue fails

cheese2 · **Joined:** Thu Oct 05, 2017 5:24 am **Posts:** 13

We have a 2-node 7400 running 3.2.2.612-MU4 that was acquired "refurbished" from HP just about a year ago. Recently node 1 failed, as in stone cold dead. Maybe there was a single orange light, I don't really recall.

For various reasons (long story) we are not able to contact HP support about this, however we purchased a replacement controller and I followed the documented procedure for replacing the failed node. The new node booted up apparently just fine - it got to a solid green light, however the node rescue process did not automatically start it just sat there for several days doing nothing.

Once it was clear it wasn't going to do anything on it's own, I signed into node 0 and ran

Code:

startnoderescue -node 1

It ran the process up until

Code:

Kernel on node 1 has started. Waiting to retrieve install details.

and then timed out with error message

Code:

Node 1 rescue failed due to being unable to retrieve installation details over TCP port 80.

I reran this several times - note that the Mgmt Ethernet ports of the two nodes are connected directly into adjacent ports of a switch, which are on the same vlan. Nothing has changed in this respect since it was a functional system.

After several more attempts, with identical results, I replaced the SSD in the new node with the one from the failed node. It now gets passed the previous failure but still fails to rejoin the cluster:

Code:

Detailed status:
2017-10-05 11:01:49 BST Created     task.
2017-10-05 11:01:49 BST Updated     Running node rescue for node 1 as 0:30220
2017-10-05 11:01:57 BST Updated     Using IP 169.254.128.170
2017-10-05 11:01:57 BST Updated     Informing system manager to not autoreset node 1.
2017-10-05 11:01:57 BST Updated     Resetting node 1.
2017-10-05 11:02:01 BST Updated     Attempting to contact node 1 via NEMOE.
2017-10-05 11:02:33 BST Updated     Setting boot parameters.
2017-10-05 11:02:54 BST Updated     Waiting for node 1 to boot the node rescue kernel.
2017-10-05 11:03:13 BST Updated     Kernel on node 1 has started.  Waiting for node to retrieve install details.
2017-10-05 11:03:35 BST Updated     Node 1 has retrieved the install details.  Waiting for file sync to begin.
2017-10-05 11:03:48 BST Updated     File sync has begun.  Estimated time to complete this step is 5 minutes on a lightly loaded system.
2017-10-05 11:07:30 BST Updated     Remote node has completed file sync, and will reboot.
2017-10-05 11:07:30 BST Updated     Waiting for node to rejoin cluster.
2017-10-05 11:37:30 BST Error       The node has not rejoined the cluster after 30 minutes.  The rescue was not successful.
2017-10-05 11:37:30 BST Failed      Could not complete task.

At this point in the process the node is all it up with a solid green status light. The Management Console further reports

Code:

Node 0 Failed to establish link to Node 1 from Node 0 link 3

Can anyone see what's gone wrong? Any ideas how to proceed? Thanks.

JohnMH · **Joined:** Wed Nov 19, 2014 5:14 am **Posts:** 505

Sorry can't help on the node rescue, but if it was a genuine HPE renew then you would receive the same warranty as on a new system.

cheese2 · **Joined:** Thu Oct 05, 2017 5:24 am **Posts:** 13

Yes I realise it's a strange situation, however I'm going to try to explain why we believe it has no warranty and no support:

Company A acquired the 3par on behalf of their wholly owned subsidiary, Company B. Company A then spun off Company C and ownership of Company B was passed to Company C. Company C then spun off Company D and Company B spun off Company E which was transferred to ownership of Company D, while Company B stayed with Company C. Company D was then acquired by Company F and the combined entity change it's name to Company G. Ownership of the hardware now resides with Company E, having been transferred through this process by means of "all assets" clauses, however the original acquisition and support arrangement was based on the business relationship between Company A and HP rather than a specific support contract. It isn't clear to us (yet) whether Company F had a similar relationship with HP, whether it still applies to Company G, and if so whether it would also cover assets owned by Company E, via Company F's aquisition of Company D.

Ladies and Gentlement I give you... Modern Capitalism.

MammaGutt · **Posted:** Thu Oct 05, 2017 10:01 am

cheese2 wrote:

Yes I realise it's a strange situation, however I'm going to try to explain why we believe it has no warranty and no support:

Company A acquired the 3par on behalf of their wholly owned subsidiary, Company B. Company A then spun off Company C and ownership of Company B was passed to Company C. Company C then spun off Company D and Company B spun off Company E which was transferred to ownership of Company D, while Company B stayed with Company C. Company D was then acquired by Company F and the combined entity change it's name to Company G. Ownership of the hardware now resides with Company E, having been transferred through this process by means of "all assets" clauses, however the original acquisition and support arrangement was based on the business relationship between Company A and HP rather than a specific support contract. It isn't clear to us (yet) whether Company F had a similar relationship with HP, whether it still applies to Company G, and if so whether it would also cover assets owned by Company E, via Company F's aquisition of Company D.

Ladies and Gentlement I give you... Modern Capitalism.

Log a case with the serial number. You shouldn't be getting any problems before you're going to extend your support or re-issue your license.

As long as there is a trail of the company structure and changes, you should be fine there as well.

My feeling is that your replacement node is tied to another serial number, hence not joining or not having the exact same patch level (there is a thin line between 3.2.2 MU4 with patches and 3.2.2 eMU4 with the same patches.

cheese2 · **Joined:** Thu Oct 05, 2017 5:24 am **Posts:** 13

Does the node rescue process not reimage the new node from the good one? I would have thought that would guarantee it's at the identical patch level. Likewise by swapping in the drive from the failed node - that would put it back exactly where it was (and the good node certainly hasn't been updated since the failure) My understanding is it can rescue a completely blank disk. The process looks to me to be completing all but the final step. How does the rescued node communicate it's status to the good node?

I have no idea where our purchasing department acquired the replacement node - everything is in turmoil there and I can't get a straight answer. It did arrive in a sealed package, however, so at the very least I think it's another refurb.

Because the system was acquired through a business relationship with HP, rather than purchased, there never was a warranty or a support contract for it. Indeed it was issued on the basis of 'no support provided' because we had access, via the business relationship, to HP's internal docs and training. It arrived as a pallet of parts and I did the install and setup myself. That access has now been cut off and afaik, as far as my boss knows, and as far as his boss knows, future support for all our HP/E gear was never discussed at any point in the business transfer process at any level. I am forbidden to talk to HPE support for any of it until this is resolved, the fear being they will happily fix whatever for us and then send us a $10k invoice for the service. It's been this way for months.

cheese2 · **Joined:** Thu Oct 05, 2017 5:24 am **Posts:** 13

Still not getting anywhere with this. I have succeeded in capturing a log of the entire rescue process through the serial port of the node being rescued. From what I can see, everything appears to go smoothly. It ends with a login prompt, identifying itself as 1610528-1 which is exactly what it should be, however node 0 still doesn't see it.

How do the two nodes communicate? Is it over IP? AFAIK there is only the one IP address shared by all the nodes - do they have their own as well? And if so, how do you view/configure that? I know there is some manner of communication through the enclosure - something to do with NEMOE but I don't know what that stands for or how it works.

I've attached the log of the final boot, after the node is successfully reimaged. (It's far too long to post inline) Can anyone see what's gone wrong?

Attachment:

File comment: log of final boot after reimage

boot2.txt [306.62 KiB]
Downloaded 2348 times

storsnapper · **Joined:** Thu Jul 06, 2017 3:28 pm **Posts:** 41

I had a similar experience refurbing a 7400. Shipped certified from HP (2 systems total) both arrived with 1 DOA node out of 4. We followed the same replacement procedure and had no luck getting the replacement node online. Even tried to the same, replacing the internal SSD from the new node with DOA nodes.

In our case the system looks for the correct node serial number when booting, and since ours didn't match the settings it got kicked out of the cluster. We had to log a ticket with a HPE tech to come onsite and do a remote restore with lab support.

Patrick · **Posted:** Tue Oct 17, 2017 10:39 pm

I've seen this twice before under two different scenarios.

First was a tech not understanding the 7000 series hardware and the cabling required between the controllers and the VSP....ie the VSP + both management interfaces need to be on the same network for any possible hope.

The second was around the embedded serial number on the controller. There is a reset process on a controller to reassign the serial number to match what your dead controller is....this way your existing license keys will still work.

With regards to the company A through Z scenario, just open the ticket up based on the serial number...if they ask for company info just provide company A with your current service address. Afterwards, go through the support portal and reassign the assets to your company name of the month....if the name game is likely to continue long term, possibly look at having a shell corporation for asset holding.

cheese2 · **Joined:** Thu Oct 05, 2017 5:24 am **Posts:** 13

The embedded serial number makes sense. Is there a documented procedure for doing a full factory reset on a node? Presumably they do not ship with a serial number, it is applied during the setup process, otherwise every new system would have this problem out of the box... A bit of googling has revealed the whack commands unset sys_serial and set perm sys_serial=<storage_system_serial_number> but I am reluctant to try these without a proper guide.

I would dearly love to turn this over to HPE support but until our corporate relationship is resolved, contact with HPE by anyone whose job title is less than "Chief <something> Officer" is a fireable offense. It's not like we have multiple data centres stuffed to the gills with HPE gear, on which our business completely depends, or anything like that... :roll:

I also like the idea of the holding company, but unfortunately the people who make these decisions are a continent and at least 12 pay grades away from here. These sorts of issues aren't even on their radar.

adam · **Joined:** Tue Aug 25, 2015 3:01 am **Posts:** 49

I've had a colleague with 3PAR 7400 that had a node failure and they tried to buy one off ebay to get it going and running a node recovery.

Apparently it's not that simple because when HPE ship a replacement node it's in a clean state.
The nodes from ebay and other sources are not in this 'clean' state and always fail node recoveries.
They ended up paying HPE T&M (time and materials) which resulted in about 4 hours work to get it all back up and going.

If you do work it out please share

HPE Storage Users Group

node rescue fails

Who is online