Hyper-V VM Network Connectivity Troubleshooting

Last week I detailed my problems in creating a Virtual Machine in Hyper-V after not realizing that I had failed to press any key and thus start the boot process. Well, I had another problem with Hyper-V after that. Getting the internet working on my VM turned out to be another lesson in frustration. Worse, there was no good explanation for the problem this time.

Problem: A new VM has no internet connectivity even though a virtual switch was created and has been specified.

nonetworkaccess

Solution: Getting the internet working on my VM was a multistep process, and I can’t really say exactly what fixed it. Here are the steps I tried though:

RESET EVERYTHING!

Sadly, that is the best advice I can give you. If you have created a virtual switch, and the internet isn’t working correctly, select everything, and then uncheck whatever settings you don’t actually want. It sounds screwy, but it worked for me. This forces the VM to reconfigure the settings and resets connectivity.

Supposedly the only important setting on the virtual switch properties would be to ensure that you have Allow management operating system to share this network adapter. That will allow your computer and your VM to both have internet access. When I first set this, however, the PC lost internet while the VM had an incredibly slow connection. Needless to say, that was not good enough. Disabling the option did nothing but revert back to my original problem though.

For good measure, I then checked Enable virtual LAN identification  for management operating system. Nothing special still, but I left it to continue troubleshooting. Later, I would uncheck that feature, but I wanted results first.

SwitchSettings.png

Next I went into the Network Adapter properties and checked Enable virtual LAN identification. This is another setting I would later turn back off.

LanSetting.png

Finally I restarted my PC, restarted the Virtual Machine, and for some reason, I then had consistent internet on both the VM and the PC.

Ultimately, the problem was that features needed to be reset. I’m still not sure specifically which one had to be turned on and off again, but toggling everything and restarting worked well enough for me in this case. I was just tired of fighting with it by the time it was working.

At least now I have a VM running JAVA so it won’t touch my real Operating System.

Advertisements

Failover Cluster Manager Connection Error Fix

A few days ago I encountered a new error with Failover Cluster Manager.  A couple of servers had been rebuilt to upgrade them from Windows Server 2008 to 2012. They were added back to the cluster successfully. However, one of the servers would not open Failover Cluster Manager properly, and tracking down the solution took a long time.

The problem server successfully joined the cluster, but now it would not connect to the cluster using Failover Cluster Manager. If you opened up the application, it didn’t try to automatically connect, and manually connecting with the fully qualified name failed too. Below is the generated error.

failoverclustermanager_wmierror

I love how this error has absolutely no useful information to it. Luckily I was able to track Error 0x80010002 down online.

Research indicated that there was some sort of WMI error on the computer. Rebooting didn’t help anything, and after numerous attempts to correct/rebuild the WMI repository, not much was accomplished. Eventually, the server could connect to the cluster, but that only worked about 30% of the time, and it nearly timed out even when it did succeed! The cluster still never connected automatically.

After further poking around on the internet, I found a few suggested solutions, with my ultimate fix closely following this post. I still had to combine everything together and run scripts all over the cluster before things returned to normal.

First of all, this is a condensed version of the Cluster Query from the TechNet post linked above.

1) Cluster Query


$Nodes = Get-ClusterNode
ForEach ($Node in $Nodes)
{
 If($Node.State -eq "Down")
  { Write-Host "$Node : Node down skipping" }
 Else
 {
  Try
  {
   $Result = (Get-WmiObject -Class "MSCluster_CLUSTER" -NameSpace "root\MSCluster" -Authentication PacketPrivacy -ComputerName $Node -ErrorAction Stop).__SERVER
   Write-Host -ForegroundColor Green "$Node : WMI query succeeded"
  }
  Catch
  {
   Write-Host -ForegroundColor Red "$Node : WMI Query failed" -NoNewline
   Write-Host  " //"$_.Exception.Message
  }
 }
}

Any server that throws an error with the above query needs to have the following scripts ran on it:

2) MOF Parser
This will parse data for the cluster file. 

cd c:\windows\system32\wbem
mofcomp.exe cluswmi.mof

FCM was still not working correctly, so I reset WMI with the following command.

3) Reset WMI Repository


Winmgmt /resetrepository

That will restart the WMI service, so you’ll probably have to try running it multiple times until all the dependent services are stopped. The command shouldn’t take more than a few seconds to process either way though.

After that, the server that failed the Cluster Query (1) was reporting good connections, but FCM still wouldn’t open properly!

I decided to try the two WMI commands (2 & 3) again on the original server that couldn’t connect to FCM. I had already ran those commands there during the initial troubleshooting, so I was starting to think this was a dead end. Still, it couldn’t hurt, so I gave it a shot.

I reopened FCM and voila! Now the cluster was automatically connecting and looking normal.

As a further note, after everything appeared to be working correctly, SQL was having trouble validating connections to each node in the cluster during install, and I had to run commands 2 & 3 on yet another node in the cluster before things worked 100%, even though that node never had a connection error using the Cluster Query (1).

SQL Server Changing Passwords and an SSPI Context Error

The other day I encountered a login error when connecting to a SQL Server. The circumstance seemed strange compared to similar errors described online with many of those seeming rather complicated to find the real solution. Since this server had been around for awhile, it was unlikely that some major Active Directory change would be necessary to resolve the issue.

This SQL Server was part of an Availability Group, and the connection worked fine when connecting using the Server/Instance name, however, when attempting to connect via the Listener, the following error occurred.

Cannot connect to Server/Instance.
The target principal name is incorrect.
Cannot generate SSPI context (Microsoft SQL Server)

Articles online indicated this was an SPN, Kerberos, and/or Active Directory issue, and something needed to be reset, but the only way to know for sure was to continue down a long troubleshooting list. Luckily, the problem was simpler than that, but still very strange.

I had reset the service account passwords the afternoon before this error became apparent. Each service was restarted afterwards to verify the change worked properly and SQL had been successfully connected to. Everything seemed fine from my perspective.

The next day, some users attempted to connect using the Listener and that’s when the errors started. I don’t normally connect via the Listener, so I hadn’t thought to check that, didn’t think it would be necessary.

Troubleshooting the easy solutions first seemed like a good idea, so I decided to try restarting the SQL service, which failed everything to another server in the cluster immediately. The services came online, and now both the instance and Listener could be connected to. OK, well probably sort of solved.

I failed it to a third node in the cluster, everything still worked great. Cool. This was looking even better.

Next I failed it back to the original node.  This time, the SQL Service came online, but not the Listener. Strange, how did it work in the first place? Everything was running on that server before I restarted the service, even if it wasn’t running correctly. I reset the passwords in SQL Configuration Manager, and then restarted the services. Everything worked perfectly now.

In summary, somehow all the services restarted on the server after the password change, but the Listener had a bad password and was not allowing connections. When I attempted to restart the Listener again, it failed until the password was corrected. I still don’t know how this happened, but it’s a good reminder to be especially careful when changing service passwords.  Changing passwords on a cluster can be even more dangerous since you have extra services to update that may not even be running on the server at the time, so verifying everything went smoothly can take a few extra steps.

Tales of when a Log Fails to Shrink in an Availability Group

I received a report that one of my servers had 7% free space on its log drive. Sounded like something fun to resolve. I checked on what was going on and found a log file that was 99% free and a hundred gb in size. While shrinking a log file is not a good practice and I’m not advocating it by any means because it’s just going to grow again and your storage is there specifically to hold logs, this situation was a out of the ordinary and we needed space.

The problem was, this log would not shrink. It was being extremely uncooperative. I took a backup, log backups, multiple shrink attempts, but it wouldn’t budge. The message returned was a big clue though.

<code>The log for database ‘dbname’ cannot be shrunk until all secondaries have moved past the point where the log was added.</code>

As you might have guessed, this server was a SQL Server 2012 instance and in an Always On Availability Group. The database in question could not shrink because it was participating in the AG.

It wasn’t an ideal fix, but by removing the database from the Availability Group, I was able to perform a log shrink to get the size to a more manageable amount. No, I did not truncate it to minimum size, I adjusted it to a reasonable amount based on its normal work. I didn’t want the log to just have to grow again. The shrink worked flawlessly, and with adequate drive space, I attempted to add the database back to the AG via the wizard.

The AG wizard refused to help. The database was encrypted and the AG wizard will not let you add a database if it is encrypted. No explanation why, it just doesn’t like that. You can add an encrypted database to an AG via script though. You can even script the change from the wizard by using a non-encrypted database then changing the database name in the scripted result. The resulting script is exactly what the AG wizard would do, it just cannot execute it automatically.


ALTER AVAILABILITY GROUP AgName
ADD DATABASE DbName;
GO

With free space and an encrypted database safely back in my AG, I was off to new adventures!

SQL Server Troubleshooting: Token-based Login Failure

In continuation of the forced mirroring failover procedure I posted last week, this post describes the another level of pain you may encounter.

After forcibly failing a mirroring session to the secondary server, users were unable to connect to the SQL Server. The SQL Error Log was full of the same error message.

Error_TokenBasedAuthentication

Login failed for user

 

Our SQL Server uses Windows Based Authentication so that was a major hint. The solution was actually incredibly easy. Originally I assumed that an account was locked out or perhaps missing from the mirror server – who knew how long ago everything had been correctly synched.

There are two likely solutions to this issue.

UAC is blocking the connection attempts

I find this to be a less likely scenario, and since changing UAC settings require a server restart, I highly suggest testing the next method first. It will probably solve your problem without restart. If the issue is UAC, the server is probably a recent setup, otherwise I feel you would have noticed this.

Recreate the User Account

More likely, the account in question needs to be dropped and recreated in SQL Server. If the user connects though SQL using a group or service account, you’ll need to track down the appropriate account that they are a member of in Active Directory. Ask your AD administrator to look up the groups which the user is a member of if you don’t have access yourself, or intuitively already know the answer.

Once you have the correct account, you’ll need to take note of the existing permissions via screenshot, scripting the login as a create to statement, or just remembering everything. Delete the existing account in the server-wide Security as well as Database-level Security. Deleting an account in server Security does not cascade to the Database level. In fact, you’ll be warned of this by SQL Server when you attempt the delete.

Now you’ll just need to recreate the account providing the correct permissions and any relevant mappings. The token-based server access validation errors should be solved now.

Fixing Hotkeys in SQL 2012

A colleague was working in SQL Management Studio as usual today when he tried to hide the results pane using Ctrl + R. It didn’t work though. He hit it a few more times. Nothing. Finally, in frustration, he asked if anyone knew what was going on. I had encountered this exact problem awhile back and had to look up a fix, so I rolled over to help. It took a few moments, but after pulling up the menu, I was able to direct him to the proper place to revert the settings. (and not create a custom hotkey like I did when this happened to me originally)

Starting with SQL Server 2012, Microsoft decided to simplify their hotkeys to follow the style of Visual Studio. The problem is, for us SQL folks, we don’t know those hotkeys. More importantly, most of us are stuck in our ways with no desire to change something that works. Just take a look at the new hotkey. It’s ridiculous.

WindowsHideResults

2 keys or 4, your choice

 

Microsoft, in their finite wisdom, realized that some people might not like this change, and kept a simple interface option to revert back to original hot keys. You can find the fix under the Tools menu -> Options.

ToolsOptions

From there, expand Environment, then select Keyboard. At the top of the dialog box you will see a drop down about keyboard mapping scheme. Select Default or hit Reset. Then press OK and you should be good to go.

ToolsOptionsKeyboard

I always get lost in the options, so I blog so I don’t have to remember

 

Now when you press Ctrl + R, it should once again Show/Hide the Results pane. The default hotkey option will change some others as well, probably most notably Execute goes from Ctrl + Shift + E back to F5 and Execution Plan hot keys get simplified. Sadly, if you use multiple computers to run SSMS, you’ll have to make this adjustment on each of them.

Collection Set Failure Troubleshooting

My shop uses a lot of monitoring. In fact, just about every monitoring feature built into SQL Server we have tried, are using, or have plans to use in the near future. Some of the monitoring overlaps in purpose, and some of it steps on each others’ toes. We get useful data, but we’ve also discovered a lot of unexpected errors. One such issue was uncovered this month when we were updating servers to a new cumulative update.

A little background first
We use Collection Sets. A more obscure feature of SQL Server that you may have never even heard of. I admit, I did not remember that it existed before one of my colleagues suggested trying it. Collection sets essentially provide data audits and have helped us track down some anomalies in the past. The collection sets run nearly constant queries across all of our servers, then report back to the CMS to catalog the data. We’ve altered the jobs a bit, most importantly offsetting the collection set’s upload schedules to avoid latency and blocking issues.

Well, after installing a Cumulative Update this month, all of the collection sets broke. Failures started flowing in stating that the data uploads were not processing and our inboxes exploded. We had to figure out what the problem was and then fix over a hundred servers.

Error Message
…Component name: DFT – Upload collection snapshot, Code: -1073450901, Subcomponent: SSIS.Pipeline, Description: “RFS – Read Current Upload Data” failed validation and returned validation status…

Solution
The first step was to determine how to get the collection set jobs working correctly again. We had to stop each collection set, then stop any collection upload jobs that were still running.

CollectionSetStop

That alone wasn’t enough though, we had to then clear the cache files as well. These are stored in a folder on each server running the collection sets. You can find your cache location in the Object Explorer under Management -> Data Collection -> Right-click Properties.

CacheLocation

Deleting the cache seemed unnecessary to me at first, but when I tried skipping it, the collection still failed after restart.
We had to fix a lot of servers, and it would have taken hours to do this manually. So I scripted all the changes. I needed it done quickly because I was planning to get Korean food with a friend on the day of the break.

Scripts
You can do this all through PowerShell using Invoke-SqlCmd, but I find that process is very slow to make an initial connection, limits feedback, and I just needed this done fast.

-- SQL
-- Stop Collection Sets
EXEC msdb.dbo.sp_syscollector_stop_collection_set @name = 'Query Statistics'
EXEC msdb.dbo.sp_syscollector_stop_collection_set @name = 'Server Activity'

-- Stop Collection Upload jobs - they keep running after collection stops
-- Change job names as necessary
EXEC msdb.dbo.sp_stop_job @job_name = 'collection_set_3_upload'
EXEC msdb.dbo.sp_stop_job @job_name = 'collection_set_2_upload'
## POWERSHELL
# Create file with computer names to connect to, or supply names with commas
$Computers = (Get-Content C:\Users\$Env:UserName\Desktop\Computers.txt)
# Provide path where you save your cache files
$Cache = "D$\Cache\CollectionSet"

$StartTime = Get-Date
ForEach ($Computer in $Computers)
{
$BeginTime = Get-Date
(Get-ChildItem "\\$Computer\$Cache" -Recurse) | Remove-Item -Force
"$Computer complete in $(((Get-Date) - $BeginTime).TotalSeconds) seconds."
}
"Total Runtime: $(((Get-Date) - $StartTime).TotalSeconds) seconds."
-- SQL
-- Start Collection Sets
EXEC msdb.dbo.sp_syscollector_start_collection_set @name = 'Query Statistics'
EXEC msdb.dbo.sp_syscollector_start_collection_set @name = 'Server Activity'

The PowerShell code should run quickly as it will delete files from more than one server at a time. I included the computer name and run time  as feedback so you don’t sit wondering how far along the script is. I had no feedback in my original script, so I sat fretting over how long it would take and how long I’d have to wait to go to lunch.

PowerShell Remote Copy Workaround

Recently I ran into an unexpected problem when trying to copy some files remotely via PowerShell. I could have sworn that I have done this exact copy scenario in the past successfully, but this time, the computers just would not talk. The solution was fairly simple, but not at all intuitive.

Task

Copy a file remotely from one computer to another, both drives being local to their respective computers. Network connectivity, proper permissions, etc., exists between the computers. I wanted to do this completely remote, issuing the command from a third computer.

Script #1 (Remote Copy to Remote) – Failed

This should have been a very fast copy. I knew the cmdlet I needed and even the syntax. I wanted to copy a file from one computer to another, connecting to each from my computer.


$From = "\\Server1\C$\MyFile.txt"
$To = "\\Server2\C$"
Copy-Item -Path $From -Destination $To

However, that did not work. Instead I got an error indicating that my path was bad! “Invalid Path” isn’t a lot of help though. I know it’s valid!

RemoteCopyFail

OK, so maybe I had a typo. After quadruple-checking my paths though, I was finally led to believe that it was a double hop permissions issue causing the remote copy to fail. I decided to try connecting directly to one of the computers before attempting the copy.

Script #2 (Local Copy to Remote) – Failed

I used the same script here, the only difference is I no longer need to make a remote connection to the source file, so the $From parameter is no longer a remote call. Other options that I thought would work here were Invoke-Command or EnterPsSession.


$From = "C:\MyFile.txt"
$To = "\\Server2\C$"
Copy-Item -Path $From -Destination $To

However, it still did not work.

LocalCopyFail

The error was different this time, and even when recreating the error, I got confused and started wondering if I had a valid problem. Even though the error now says “Source and destination path did not resolve to the same provider”, it’s still the same problem and the resolution will be the same. My final attempt was a lot better.

Script #3 (Remote Copy to Remote, Fully Qualified) – Success!

After poking around the internet, I learned what the fully qualified name for PowerShell directories should be.  Providing the fully qualified name is required for any remote call. In other words, if your location starts with \\ then you need to add Microsoft.PowerShell.Core\FileSystem:: as a prefix. If you reference a local drive, you shouldn’t prefix the local drive. This worked perfectly, even remoting to both computers.


$From = "Microsoft.PowerShell.Core\FileSystem::\\Server1\C$\MyFile.txt"
$To = "Microsoft.PowerShell.Core\FileSystem::\\Server2\C$"
Copy-Item -Path $From -Destination $To

Hopefully this will save someone an hour of frustration.