Collection Set Failure Troubleshooting

My shop uses a lot of monitoring. In fact, just about every monitoring feature built into SQL Server we have tried, are using, or have plans to use in the near future. Some of the monitoring overlaps in purpose, and some of it steps on each others’ toes. We get useful data, but we’ve also discovered a lot of unexpected errors. One such issue was uncovered this month when we were updating servers to a new cumulative update.

A little background first
We use Collection Sets. A more obscure feature of SQL Server that you may have never even heard of. I admit, I did not remember that it existed before one of my colleagues suggested trying it. Collection sets essentially provide data audits and have helped us track down some anomalies in the past. The collection sets run nearly constant queries across all of our servers, then report back to the CMS to catalog the data. We’ve altered the jobs a bit, most importantly offsetting the collection set’s upload schedules to avoid latency and blocking issues.

Well, after installing a Cumulative Update this month, all of the collection sets broke. Failures started flowing in stating that the data uploads were not processing and our inboxes exploded. We had to figure out what the problem was and then fix over a hundred servers.

Error Message
…Component name: DFT – Upload collection snapshot, Code: -1073450901, Subcomponent: SSIS.Pipeline, Description: “RFS – Read Current Upload Data” failed validation and returned validation status…

Solution
The first step was to determine how to get the collection set jobs working correctly again. We had to stop each collection set, then stop any collection upload jobs that were still running.

CollectionSetStop

That alone wasn’t enough though, we had to then clear the cache files as well. These are stored in a folder on each server running the collection sets. You can find your cache location in the Object Explorer under Management -> Data Collection -> Right-click Properties.

CacheLocation

Deleting the cache seemed unnecessary to me at first, but when I tried skipping it, the collection still failed after restart.
We had to fix a lot of servers, and it would have taken hours to do this manually. So I scripted all the changes. I needed it done quickly because I was planning to get Korean food with a friend on the day of the break.

Scripts
You can do this all through PowerShell using Invoke-SqlCmd, but I find that process is very slow to make an initial connection, limits feedback, and I just needed this done fast.

-- SQL
-- Stop Collection Sets
EXEC msdb.dbo.sp_syscollector_stop_collection_set @name = 'Query Statistics'
EXEC msdb.dbo.sp_syscollector_stop_collection_set @name = 'Server Activity'

-- Stop Collection Upload jobs - they keep running after collection stops
-- Change job names as necessary
EXEC msdb.dbo.sp_stop_job @job_name = 'collection_set_3_upload'
EXEC msdb.dbo.sp_stop_job @job_name = 'collection_set_2_upload'
## POWERSHELL
# Create file with computer names to connect to, or supply names with commas
$Computers = (Get-Content C:\Users\$Env:UserName\Desktop\Computers.txt)
# Provide path where you save your cache files
$Cache = "D$\Cache\CollectionSet"

$StartTime = Get-Date
ForEach ($Computer in $Computers)
{
$BeginTime = Get-Date
(Get-ChildItem "\\$Computer\$Cache" -Recurse) | Remove-Item -Force
"$Computer complete in $(((Get-Date) - $BeginTime).TotalSeconds) seconds."
}
"Total Runtime: $(((Get-Date) - $StartTime).TotalSeconds) seconds."
-- SQL
-- Start Collection Sets
EXEC msdb.dbo.sp_syscollector_start_collection_set @name = 'Query Statistics'
EXEC msdb.dbo.sp_syscollector_start_collection_set @name = 'Server Activity'

The PowerShell code should run quickly as it will delete files from more than one server at a time. I included the computer name and run time  as feedback so you don’t sit wondering how far along the script is. I had no feedback in my original script, so I sat fretting over how long it would take and how long I’d have to wait to go to lunch.

Advertisements

Database Owner Verification

There’s often low hanging fruit once you start looking into servers, or you just need to verify settings are correct.  Recently I had to verify that all databases had the correct owner for auditing purposes. There are many scripts out there to do this, but I wanted to make a few exceptions, and I didn’t need a lot of extra fluff for this specific task either. I also had a plan to convert this to a policy afterwards to ensure settings were not changed later.

I used two different scripts, one for 2012 servers that might be utilizing Availability Groups and one for 2008 servers that might be using Mirroring. I wanted to allow exceptions for databases that were mirrors or secondaries, because without failing them over, I could not update the owner. These will be fixed on the next scheduled failover, but until then, I don’t want the exceptions mucking up my results.

The WHERE clause excludes my allowed owner while also verifying that I don’t have any databases where the ownership has been broken and listed as NULL.  Server roles for AGs and Mirrors are included in the result set so that when I want everything, I can exclude the WHERE clause and quickly see what’s what.

SQL 2012 (Availability Groups)

SELECT
d.name
,[Owner] = suser_sname(owner_sid)
,s.role_desc
FROM sys.databases d
,msdb.sys.dm_hadr_availability_replica_states s
WHERE (suser_name(owner_sid) <> 'SA' OR SUSER_NAME(owner_sid) IS NULL)
AND s.role_desc = 'Primary'

SQL 2008 (Mirrors)

Same as the above query, but while that query above will error when you try to run it on an older version of SQL than 2012 because of the HADR view, this one works while determining if any of your servers have the wrong database owner. Just exclude the AND statement to show the mirrors.


SELECT
d.name
,[Owner] = suser_sname(owner_sid)
,m.mirroring_role_desc
,d.state_desc
FROM sys.databases d
LEFT JOIN sys.database_mirroring m ON m.database_id = d.database_id
WHERE
(suser_name(owner_sid) <> 'SA' OR SUSER_NAME(owner_sid) IS NULL)
AND (m.mirroring_role_desc <> 'mirror' OR m.mirroring_role_desc IS NULL)

Afterwards, I used Policy Based Management to create a policy checking for DB ownership.  I chose the simple way of creating a Database Facet checking that @Owner = 'SA'. (Rename your account, but for demonstration purposes this is the most obvious). The policy will check every database, but automatically skip mirrors/secondaries because it cannot read those databases. The script remains useful for checking the databases that the policy cannot read.

Querying Memory Settings

SQL settings should be verified on a regular basis, and I’m occasionally asked to produce a list of memory configuration settings on all servers for management. It’s a simple process to check this information using Management Studio, but scripting is a much better solution, especially if you need to verify the settings on a large number of servers.

You can verify memory settings quickly for a single system by right-clicking on the server in Object Explorer then selecting Memory on the left hand side.

ServerProperties-Memory

SQL defaults to have access to all the memory on your server (2,147,483,647MB). Best practices suggest a variety of configurations, from 10-20% reserved for the OS, to a minimum of 4GB reserved, to 1GB reserved for every 4-8GB of RAM available. The most important thing to do is leave some space reserved for the OS. Leaving the default memory setting will allow SQL to eat all the available RAM, potentially leaving the server unresponsive as windows may eventually be completely memory starved.

I prefer to verify memory settings against all my servers at once. Not only is it faster, it’s easier to provide a report this way. Open a New Query from your Central Management Server targeting all of your servers for fastest results. Querying sys.configurations and sys.dm_os_sys_memory will provide the information you need, but the results leave a lot to be desired. By pivoting the data and performing a bit of formatting, you can create a report that is much cleaner and appealing.

SET NOCOUNT ON
GO
SELECT
[TotalRamGB] = Ram/1024
,[MinRamGB] = [Minimum size of server memory (MB)]/1024
,[MaxRamGB] = [Maximum size of server memory (MB)]/1024
,[RamLeftForWindowsGB] =
CASE WHEN [Maximum size of server memory (MB)]/1024 = 2097151 THEN 0
ELSE  Ram/1024 - [Maximum size of server memory (MB)]/1024
END
FROM (
SELECT
[Descr] = [Description]
,[Val] = CONVERT(BIGINT,value)
,[Ram] = total_physical_memory_kb/1024
FROM sys.configurations c
,sys.dm_os_sys_memory m
WHERE c.name IN ('min server memory (MB)','max server memory (MB)')
) a
PIVOT
(
MIN(Val) FOR Descr IN ([Minimum size of server memory (MB)],[Maximum size of server memory (MB)])
) b

This script converts all the memory settings to GB, but if you need to work in MB, just remove the GB conversions. The last column indicates how much memory is reserved for the OS. A value of 0 here indicates that SQL’s max memory setting has been left at default, and thus has free reign to steal all the memory. These servers should be addressed as soon as possible. Again, the amount of space you leave available will vary, but I suggest at least 4GB. If your server has such a small amount of RAM that this is a substantial portion of your total RAM, it’s probably time to ask for more RAM too. It’s one of the cheapest and best upgrades possible.

Getting Cluster Nodes through T-SQL

One of my most frustrating moments is when I need to log into the active node of a cluster. I inevitably choose a passive node first, then wind up opening the Failover Cluster Manager to find the active node then finally log in to the correct server. Well, I finally took some time to get it right by using my trusty Central Management Server and T-SQL.

Running all of these scripts on a CMS will allow you to hit every SQL server in your domain. When I first decided to pull Server data using a script, I immediately thought of @@SERVERNAME. However, this returns the cluster name (for clusters), not the active server node. What I really needed took a minor amount of research.

Another possible choice included sys.dm_os_cluster_nodes, but this had surprising limitations. While it does provide node names, if you try to query all servers using a CMS, the script can fail to merge the result sets, displaying the following error message. In my case, it also failed to return results for one entire cluster.

An error occurred while executing batch. Error message is: The result set could not be merged because the result schema did not match the schema from the first responding server.

The next option was to use SERVERPROPERTY to find the actual server name hosting SQL. When using a CMS, the script will also return the results of @@SERVERNAME as a bonus.

SELECT [ActiveServer] = SERVERPROPERTY('ComputerNamePhysicalNetBIOS')

While this is good enough for some scenarios, I wanted a more targeted answer, so I kept digging. First, I wanted a script that would only show the active nodes in each cluster. I’ll mostly use this query to ensure I’m avoiding the primary server during maintenance and to verify which server is the active node when developers or management start asking questions and want immediate answers. (You know, as opposed to when they aren’t in a hurry…)

To get the exact results I want will require a bit of string parsing. I suggest you review SUBSTRING and CHARINDEX if you aren’t positive on how they work. In short, Substring will display the server name up to where Charindex determines the ‘\’ appears. This will provide a column consisting of only the cluster name. The next column will display the active node. I then filter out all servers other than clusters and exclude default instance names as well.

SELECT
[ClusterName] = SUBSTRING(@@SERVERNAME,0,CHARINDEX('\',@@SERVERNAME))
,[ActiveNode] = SERVERPROPERTY('ComputerNamePhysicalNetBIOS')
WHERE SERVERPROPERTY('ComputerNamePhysicalNetBIOS') <> SUBSTRING(@@SERVERNAME,0,CHARINDEX('\',@@SERVERNAME))
AND @@SERVERNAME <> SERVERPROPERTY('ComputerNamePhysicalNetBIOS')

This final query returns every node in each cluster, with the last column denoting the active node. Useful when you need to remember which servers belong to what and your documentation has gone missing – or doesn’t exist yet.

SELECT
[ClusterName] = SUBSTRING(@@SERVERNAME,0,CHARINDEX('\',@@SERVERNAME))
,[Nodes] = NodeName
,[IsActiveNode] = CASE WHEN NodeName = SERVERPROPERTY('ComputerNamePhysicalNetBIOS') THEN '1' ELSE '' END
FROM sys.dm_os_cluster_nodes
WHERE SERVERPROPERTY('ComputerNamePhysicalNetBIOS') <> SUBSTRING(@@SERVERNAME,0,CHARINDEX('\',@@SERVERNAME))
AND @@SERVERNAME <> SERVERPROPERTY('ComputerNamePhysicalNetBIOS')

Summary I shared these scripts with my colleagues, who were very excited about not having to play “Guess the Active Node” anymore. There might be more eloquent ways to determine cluster nodes, but this method is perfect for my environment. I suggest saving the query to a code repository project in Solution Explorer or as a Snippet so that it’s always easily accessible.

I’m participating in the many SQL Blog challenges this month, and I’d like to thank each of them for the push and community support they provide.

Multi-Server Job Scripting Failure

Central Management Servers are probably the single most important thing in my SQL life right now. I manage a lot of servers, so having an organized way to keep track of them is paramount. Multi-Server Jobs are a great way to centralize jobs that need to be on all your boxes. If you are not familiar with the idea, I suggest you look into it if you have more than a handful of servers.

I was building a new CMS the other day and had to get dozens of Multi-Server Jobs transferred to the new instance. Some of the jobs have 30 steps, so using Object Explorer to script out the jobs was the obvious choice. However, I was immediately hit with an ambiguous error stating that SQL Server was “Unable to cast object of type ‘System.DBNull’ to type ‘System.String’.”

Test

My original error had another, equally unhelpful line.

Knowing that some object is Null doesn’t help very much. After browsing through the job steps, everything looked in order. Internet searches provided surprisingly little on this subject, and nothing I found really explained how it happens. I know I was going to be migrating a CMS again, and soon, so I wanted to know how to fix it. Surely it wasn’t just a bug in Management Studio? On to the test case!

First of all, you need to create a Multi-Server job. I’m assuming you already have a CMS set up, if not, you need one.
Right-click Multi-Server Jobs, create a new job, and then assign at least two targets for testing. I’m going to use the name TestJob for the rest of this scenario.

test

Make sure you have at least 2 target servers

Once TestJob is made, (it does not matter what it does; mine just performed SELECT @@VERSION) you need to run the job. Execute the following code:
msdb.dbo.sp_start_job 'TestJob' or just start the job using the GUI.

Test

Your job can fail, I don’t care. Just run it.

Now that you have some job history for 2 servers, let’s look at that history:

SELECT *
FROM msdb.dbo.sysjobservers js
INNER JOIN msdb.dbo.sysjobs j ON j.job_id = js.job_id
WHERE j.name = 'TestJob'

This will display one line per targeted server. This query is included for your own reference, we are concerned about the next one’s more targeted results. The query is checking your multi-server job history, specifically the server list, and comparing that to the servers that exist on the CMS. Right now, everything looks fine, and matches.

SELECT
[JobName] = j.Name
,[JobServerId] = js.Server_id
,[TargetServerId] = ts.Server_id
FROM msdb.dbo.sysjobservers js
INNER JOIN msdb.dbo.sysjobs j ON j.job_id = js.job_id
LEFT JOIN msdb.dbo.systargetservers ts ON ts.server_id = js.server_id
WHERE j.name = 'TestJob'

Test

Job History Server_Id matches Target Server_Id, awesome!

Again, at this point, all is well, and you can script out the multi-server job to your heart’s content. That would be fine if you saved a copy and never, ever changed the job while it sat on your CMS. Fast forward a few years though, and weird things may have happened. Let’s simulate a weird thing. I’m going to change the Server_Id in the history to an invalid value. That’s right, there are no constraints preventing me from doing this! This is going to simulate removing a server from the CMS after history already exists for it, or someone just going into your history and changing things because they hate you.

You will need to change the numeric values of the server_ids to match invalid and valid server_ids on your CMS, respectively.
UPDATE js
SET Server_id = 7
FROM msdb.dbo.sysjobservers js
INNER JOIN msdb.dbo.sysjobs j ON j.job_id = js.job_id
WHERE j.name = 'TestJob' and server_id = 5

Now we are going to rerun the script from earlier to verify the job history information.

SELECT
[JobName] = j.Name
,[JobServerId] = js.Server_id
,[TargetServerId] = ts.Server_id
FROM msdb.dbo.sysjobservers js
INNER JOIN msdb.dbo.sysjobs j ON j.job_id = js.job_id
LEFT JOIN msdb.dbo.systargetservers ts ON ts.server_id = js.server_id
WHERE j.name = 'TestJob'

Test

The history now has a server_id that doesn’t exist on the CMS

Great…now we have a null server. If you guessed this is a problem, you guessed correctly. If you try to script the job out now, you will receive a failure notification like the one at the top. Now that we have recreated the issue, we know how to fix it though!

Run the following code to find any null servers and just clean out that history. I’m going to assume if your server doesn’t exist, you don’t care about the history. I didn’t at least.

DELETE FROM js
FROM msdb.dbo.sysjobservers js
INNER JOIN msdb.dbo.sysjobs j ON j.job_id = js.job_id
LEFT JOIN msdb.dbo.systargetservers ts ON ts.server_id = js.server_id
WHERE j.name = 'TestJob' AND ts.server_id IS NULL

Finally, let’s look at the job history again, I’ve supplied the code for a third time, just in case you forgot it.

SELECT
[JobName] = j.Name
,[JobServerId] = js.Server_id
,[TargetServerId] = ts.Server_id
FROM msdb.dbo.sysjobservers js
INNER JOIN msdb.dbo.sysjobs j ON j.job_id = js.job_id
LEFT JOIN msdb.dbo.systargetservers ts ON ts.server_id = js.server_id
WHERE j.name = 'TestJob'

Test

Now we have clean history with no null values

If you try to script the job, it will work, and you can copy your multi-server job over to the new server.

Let’s take it a step further, and expand the script to check all the multi-server jobs. Notice that it is only looking at Category_Id=2, or Multi-Server jobs, and only those with invalid Server_Ids.

SELECT
[JobName] = j.Name
,[JobServerId] = js.Server_id
,[TargetServerId] = ts.Server_id
FROM msdb.dbo.sysjobservers js
INNER JOIN msdb.dbo.sysjobs j ON j.job_id = js.job_id
LEFT JOIN msdb.dbo.systargetservers ts ON ts.server_id = js.server_id
WHERE j.category_id = 2 AND ts.server_id IS NULL

Hopefully you got 0 results, and everything on your CMS is happy.
Even though I changed the Server_Id to recreate this scenario, the original problem occurred due to servers defecting from the CMS, but leaving behind their server history. Anytime a server defects, it needs to automatically remove the history from the sysjobservers table as well, otherwise people will continue to have this problem.