Collection Set Failure Troubleshooting

My shop uses a lot of monitoring. In fact, just about every monitoring feature built into SQL Server we have tried, are using, or have plans to use in the near future. Some of the monitoring overlaps in purpose, and some of it steps on each others’ toes. We get useful data, but we’ve also discovered a lot of unexpected errors. One such issue was uncovered this month when we were updating servers to a new cumulative update.

A little background first
We use Collection Sets. A more obscure feature of SQL Server that you may have never even heard of. I admit, I did not remember that it existed before one of my colleagues suggested trying it. Collection sets essentially provide data audits and have helped us track down some anomalies in the past. The collection sets run nearly constant queries across all of our servers, then report back to the CMS to catalog the data. We’ve altered the jobs a bit, most importantly offsetting the collection set’s upload schedules to avoid latency and blocking issues.

Well, after installing a Cumulative Update this month, all of the collection sets broke. Failures started flowing in stating that the data uploads were not processing and our inboxes exploded. We had to figure out what the problem was and then fix over a hundred servers.

Error Message
…Component name: DFT – Upload collection snapshot, Code: -1073450901, Subcomponent: SSIS.Pipeline, Description: “RFS – Read Current Upload Data” failed validation and returned validation status…

Solution
The first step was to determine how to get the collection set jobs working correctly again. We had to stop each collection set, then stop any collection upload jobs that were still running.

CollectionSetStop

That alone wasn’t enough though, we had to then clear the cache files as well. These are stored in a folder on each server running the collection sets. You can find your cache location in the Object Explorer under Management -> Data Collection -> Right-click Properties.

CacheLocation

Deleting the cache seemed unnecessary to me at first, but when I tried skipping it, the collection still failed after restart.
We had to fix a lot of servers, and it would have taken hours to do this manually. So I scripted all the changes. I needed it done quickly because I was planning to get Korean food with a friend on the day of the break.

Scripts
You can do this all through PowerShell using Invoke-SqlCmd, but I find that process is very slow to make an initial connection, limits feedback, and I just needed this done fast.

-- SQL
-- Stop Collection Sets
EXEC msdb.dbo.sp_syscollector_stop_collection_set @name = 'Query Statistics'
EXEC msdb.dbo.sp_syscollector_stop_collection_set @name = 'Server Activity'

-- Stop Collection Upload jobs - they keep running after collection stops
-- Change job names as necessary
EXEC msdb.dbo.sp_stop_job @job_name = 'collection_set_3_upload'
EXEC msdb.dbo.sp_stop_job @job_name = 'collection_set_2_upload'
## POWERSHELL
# Create file with computer names to connect to, or supply names with commas
$Computers = (Get-Content C:\Users\$Env:UserName\Desktop\Computers.txt)
# Provide path where you save your cache files
$Cache = "D$\Cache\CollectionSet"

$StartTime = Get-Date
ForEach ($Computer in $Computers)
{
$BeginTime = Get-Date
(Get-ChildItem "\\$Computer\$Cache" -Recurse) | Remove-Item -Force
"$Computer complete in $(((Get-Date) - $BeginTime).TotalSeconds) seconds."
}
"Total Runtime: $(((Get-Date) - $StartTime).TotalSeconds) seconds."
-- SQL
-- Start Collection Sets
EXEC msdb.dbo.sp_syscollector_start_collection_set @name = 'Query Statistics'
EXEC msdb.dbo.sp_syscollector_start_collection_set @name = 'Server Activity'

The PowerShell code should run quickly as it will delete files from more than one server at a time. I included the computer name and run time  as feedback so you don’t sit wondering how far along the script is. I had no feedback in my original script, so I sat fretting over how long it would take and how long I’d have to wait to go to lunch.

Advertisements

PowerShell: What Computer is that IP?

Oftentimes I receive an alert about a server, but the only information I have to work with is an IP address. Luckily, with the help of PowerShell and Active Directory, I was able to write a script to pull the computer name and some information on what that computer does.

The script is short, but a bit complicated. The first variable acquires the computer name via the IP address.  That alone is enough if you have few servers, but with the number I deal with, oftentimes I need a bit more information.

The second part of the script splits the hostname up, removing any extra information and leaving just the computer name.  Afterwards that value is piped to Get-AdComputer and then formatted as an easy to read table with minimal information. It returns what Organizational Units the computer is in, which should tell you what purpose your mystery IP serves. If you are lucky enough to have a description listed in AD, it will return that as well.

$IP = "192.168.0.0"
$HostName = ([System.Net.DNS]::GetHostByAddress($IP)).HostName
$Hostname.Split('.')[0] | 
  Get-AdComputer | 
    FT Name,Description,DistinguishedName -AutoSize