We have a 2-node 7400 running 3.2.2.612-MU4 that was acquired "refurbished" from HP just about a year ago. Recently node 1 failed, as in stone cold dead. Maybe there was a single orange light, I don't really recall.
For various reasons (long story) we are not able to contact HP support about this, however we purchased a replacement controller and I followed the documented procedure for replacing the failed node. The new node booted up apparently just fine - it got to a solid green light, however the node rescue process did not automatically start it just sat there for several days doing nothing.
Once it was clear it wasn't going to do anything on it's own, I signed into node 0 and ran
Code:
startnoderescue -node 1
It ran the process up until
Code:
Kernel on node 1 has started. Waiting to retrieve install details.
and then timed out with error message
Code:
Node 1 rescue failed due to being unable to retrieve installation details over TCP port 80.
I reran this several times - note that the Mgmt Ethernet ports of the two nodes are connected directly into adjacent ports of a switch, which are on the same vlan. Nothing has changed in this respect since it was a functional system.
After several more attempts, with identical results, I replaced the SSD in the new node with the one from the failed node. It now gets passed the previous failure but still fails to rejoin the cluster:
Code:
Detailed status:
2017-10-05 11:01:49 BST Created task.
2017-10-05 11:01:49 BST Updated Running node rescue for node 1 as 0:30220
2017-10-05 11:01:57 BST Updated Using IP 169.254.128.170
2017-10-05 11:01:57 BST Updated Informing system manager to not autoreset node 1.
2017-10-05 11:01:57 BST Updated Resetting node 1.
2017-10-05 11:02:01 BST Updated Attempting to contact node 1 via NEMOE.
2017-10-05 11:02:33 BST Updated Setting boot parameters.
2017-10-05 11:02:54 BST Updated Waiting for node 1 to boot the node rescue kernel.
2017-10-05 11:03:13 BST Updated Kernel on node 1 has started. Waiting for node to retrieve install details.
2017-10-05 11:03:35 BST Updated Node 1 has retrieved the install details. Waiting for file sync to begin.
2017-10-05 11:03:48 BST Updated File sync has begun. Estimated time to complete this step is 5 minutes on a lightly loaded system.
2017-10-05 11:07:30 BST Updated Remote node has completed file sync, and will reboot.
2017-10-05 11:07:30 BST Updated Waiting for node to rejoin cluster.
2017-10-05 11:37:30 BST Error The node has not rejoined the cluster after 30 minutes. The rescue was not successful.
2017-10-05 11:37:30 BST Failed Could not complete task.
At this point in the process the node is all it up with a solid green status light. The Management Console further reports
Code:
Node 0 Failed to establish link to Node 1 from Node 0 link 3
Can anyone see what's gone wrong? Any ideas how to proceed? Thanks.