Failover Cluster Manager Connection Error Fix

A few days ago I encountered a new error with Failover Cluster Manager.  A couple of servers had been rebuilt to upgrade them from Windows Server 2008 to 2012. They were added back to the cluster successfully. However, one of the servers would not open Failover Cluster Manager properly, and tracking down the solution took a long time.

The problem server successfully joined the cluster, but now it would not connect to the cluster using Failover Cluster Manager. If you opened up the application, it didn’t try to automatically connect, and manually connecting with the fully qualified name failed too. Below is the generated error.


I love how this error has absolutely no useful information to it. Luckily I was able to track Error 0x80010002 down online.

Research indicated that there was some sort of WMI error on the computer. Rebooting didn’t help anything, and after numerous attempts to correct/rebuild the WMI repository, not much was accomplished. Eventually, the server could connect to the cluster, but that only worked about 30% of the time, and it nearly timed out even when it did succeed! The cluster still never connected automatically.

After further poking around on the internet, I found a few suggested solutions, with my ultimate fix closely following this post. I still had to combine everything together and run scripts all over the cluster before things returned to normal.

First of all, this is a condensed version of the Cluster Query from the TechNet post linked above.

1) Cluster Query

$Nodes = Get-ClusterNode
ForEach ($Node in $Nodes)
 If($Node.State -eq "Down")
  { Write-Host "$Node : Node down skipping" }
   $Result = (Get-WmiObject -Class "MSCluster_CLUSTER" -NameSpace "root\MSCluster" -Authentication PacketPrivacy -ComputerName $Node -ErrorAction Stop).__SERVER
   Write-Host -ForegroundColor Green "$Node : WMI query succeeded"
   Write-Host -ForegroundColor Red "$Node : WMI Query failed" -NoNewline
   Write-Host  " //"$_.Exception.Message

Any server that throws an error with the above query needs to have the following scripts ran on it:

2) MOF Parser
This will parse data for the cluster file. 

cd c:\windows\system32\wbem
mofcomp.exe cluswmi.mof

FCM was still not working correctly, so I reset WMI with the following command.

3) Reset WMI Repository

Winmgmt /resetrepository

That will restart the WMI service, so you’ll probably have to try running it multiple times until all the dependent services are stopped. The command shouldn’t take more than a few seconds to process either way though.

After that, the server that failed the Cluster Query (1) was reporting good connections, but FCM still wouldn’t open properly!

I decided to try the two WMI commands (2 & 3) again on the original server that couldn’t connect to FCM. I had already ran those commands there during the initial troubleshooting, so I was starting to think this was a dead end. Still, it couldn’t hurt, so I gave it a shot.

I reopened FCM and voila! Now the cluster was automatically connecting and looking normal.

As a further note, after everything appeared to be working correctly, SQL was having trouble validating connections to each node in the cluster during install, and I had to run commands 2 & 3 on yet another node in the cluster before things worked 100%, even though that node never had a connection error using the Cluster Query (1).


SQL Server Changing Passwords and an SSPI Context Error

The other day I encountered a login error when connecting to a SQL Server. The circumstance seemed strange compared to similar errors described online with many of those seeming rather complicated to find the real solution. Since this server had been around for awhile, it was unlikely that some major Active Directory change would be necessary to resolve the issue.

This SQL Server was part of an Availability Group, and the connection worked fine when connecting using the Server/Instance name, however, when attempting to connect via the Listener, the following error occurred.

Cannot connect to Server/Instance.
The target principal name is incorrect.
Cannot generate SSPI context (Microsoft SQL Server)

Articles online indicated this was an SPN, Kerberos, and/or Active Directory issue, and something needed to be reset, but the only way to know for sure was to continue down a long troubleshooting list. Luckily, the problem was simpler than that, but still very strange.

I had reset the service account passwords the afternoon before this error became apparent. Each service was restarted afterwards to verify the change worked properly and SQL had been successfully connected to. Everything seemed fine from my perspective.

The next day, some users attempted to connect using the Listener and that’s when the errors started. I don’t normally connect via the Listener, so I hadn’t thought to check that, didn’t think it would be necessary.

Troubleshooting the easy solutions first seemed like a good idea, so I decided to try restarting the SQL service, which failed everything to another server in the cluster immediately. The services came online, and now both the instance and Listener could be connected to. OK, well probably sort of solved.

I failed it to a third node in the cluster, everything still worked great. Cool. This was looking even better.

Next I failed it back to the original node.  This time, the SQL Service came online, but not the Listener. Strange, how did it work in the first place? Everything was running on that server before I restarted the service, even if it wasn’t running correctly. I reset the passwords in SQL Configuration Manager, and then restarted the services. Everything worked perfectly now.

In summary, somehow all the services restarted on the server after the password change, but the Listener had a bad password and was not allowing connections. When I attempted to restart the Listener again, it failed until the password was corrected. I still don’t know how this happened, but it’s a good reminder to be especially careful when changing service passwords.  Changing passwords on a cluster can be even more dangerous since you have extra services to update that may not even be running on the server at the time, so verifying everything went smoothly can take a few extra steps.

Getting Cluster Nodes through T-SQL

One of my most frustrating moments is when I need to log into the active node of a cluster. I inevitably choose a passive node first, then wind up opening the Failover Cluster Manager to find the active node then finally log in to the correct server. Well, I finally took some time to get it right by using my trusty Central Management Server and T-SQL.

Running all of these scripts on a CMS will allow you to hit every SQL server in your domain. When I first decided to pull Server data using a script, I immediately thought of @@SERVERNAME. However, this returns the cluster name (for clusters), not the active server node. What I really needed took a minor amount of research.

Another possible choice included sys.dm_os_cluster_nodes, but this had surprising limitations. While it does provide node names, if you try to query all servers using a CMS, the script can fail to merge the result sets, displaying the following error message. In my case, it also failed to return results for one entire cluster.

An error occurred while executing batch. Error message is: The result set could not be merged because the result schema did not match the schema from the first responding server.

The next option was to use SERVERPROPERTY to find the actual server name hosting SQL. When using a CMS, the script will also return the results of @@SERVERNAME as a bonus.

SELECT [ActiveServer] = SERVERPROPERTY('ComputerNamePhysicalNetBIOS')

While this is good enough for some scenarios, I wanted a more targeted answer, so I kept digging. First, I wanted a script that would only show the active nodes in each cluster. I’ll mostly use this query to ensure I’m avoiding the primary server during maintenance and to verify which server is the active node when developers or management start asking questions and want immediate answers. (You know, as opposed to when they aren’t in a hurry…)

To get the exact results I want will require a bit of string parsing. I suggest you review SUBSTRING and CHARINDEX if you aren’t positive on how they work. In short, Substring will display the server name up to where Charindex determines the ‘\’ appears. This will provide a column consisting of only the cluster name. The next column will display the active node. I then filter out all servers other than clusters and exclude default instance names as well.

,[ActiveNode] = SERVERPROPERTY('ComputerNamePhysicalNetBIOS')

This final query returns every node in each cluster, with the last column denoting the active node. Useful when you need to remember which servers belong to what and your documentation has gone missing – or doesn’t exist yet.

,[Nodes] = NodeName
,[IsActiveNode] = CASE WHEN NodeName = SERVERPROPERTY('ComputerNamePhysicalNetBIOS') THEN '1' ELSE '' END
FROM sys.dm_os_cluster_nodes

Summary I shared these scripts with my colleagues, who were very excited about not having to play “Guess the Active Node” anymore. There might be more eloquent ways to determine cluster nodes, but this method is perfect for my environment. I suggest saving the query to a code repository project in Solution Explorer or as a Snippet so that it’s always easily accessible.

I’m participating in the many SQL Blog challenges this month, and I’d like to thank each of them for the push and community support they provide.