Quantcast
Channel: SQL Archives - SQL Authority with Pinal Dave
Viewing all articles
Browse latest Browse all 594

SQL SERVER – Event ID: 1135 – Cluster node ‘NodeName’ was Removed From the Active Failover Cluster Membership

$
0
0

When I work with customers, there are situations when I get chance to learn something from them. I was engaged with an AlwaysOn availability group engagement and got some interesting information from a customer which I am sharing here. In this blog, we would learn about how to solve event id 1135 – Cluster node ‘NodeName’ was removed from the active failover cluster membership.

SQL SERVER - Event ID: 1135 - Cluster node 'NodeName' was Removed From the Active Failover Cluster Membership clus-mem-err-01-800x555

Here are two “Critical” errors which you might see in System Event logs:

Event ID: 1135   

Message: Cluster node ‘N2’ was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Event ID: 1177

Message: The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.  Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges.

Based on my knowledge about clustering, event Id 1135 indicates that the heartbeat communication failed between some nodes. It could be mostly the network connection or communication is failed among the cluster nodes. Next, the event 1177 indicates that fail-over occurred since the network connectivity between some or all nodes in the cluster, or a failover of the witness disk.

SOLUTION/WORKAROUND

Of course, your networking team needs to be engaged first to understand the root cause of network issue. If it is happening on random basis and network team has no clue about it then here are few things which DBA can also do.

$cluster = Get-Cluster
$cluster.SameSubnetDelay=2000
$cluster.SameSubnetThreshold=10
$cluster.CrossSubnetThreshold=10
$cluster.CrossSubnetDelay=4000

Along with cluster setting, one of my clients also told me to disable TCP offloading and few more properties. As per him, they might cause network delays and intermittent failures. You can run the following commands in the CMD (run as administrator) on all nodes.

Netsh int tcp set global chimney=disabled
Netsh int tcp set global rss=disabled
Netsh int tcp set global netdma=disabled
Netsh int tcp set global autotuninglevel=disabled
netsh interface teredo set state disabled
netsh int ipv4 set global taskoffload=disabled

Also, update the NIC drivers, firmware, and teaming software (if there is) on all cluster nodes.

Above steps have solved the issue for them on several servers and they gave me permission to blog. If above steps solve the issue, please comment and let them know.

Reference: Pinal Dave (https://blog.sqlauthority.com)

First appeared on SQL SERVER – Event ID: 1135 – Cluster node ‘NodeName’ was Removed From the Active Failover Cluster Membership


Viewing all articles
Browse latest Browse all 594

Trending Articles