Getting Jobs by Run Time

New month, so I have to make sure I write something about SQL again, instead of babbling about PowerShell all the time! I manage a lot of servers, so scripting out something that I can run on all the servers rather than pulling up the GUI on one server at a time is always important.

Checking to see when jobs ran on a single server is a pretty quick task. You just have to pull up the Job Monitor, and you’ll have times and even more information easily accessible. Doing that on a dozen servers is tedious though. Doing it on a hundred would be a nightmare. Beyond that, checking specific times through the Job Monitor can be a bit of a hassle, as it displays a summary of the last execution times. What if you want to know what happened 5 hours ago, regardless of whether jobs have ran since then?

You can query SQL for job history, and with a lot of time conversions,  you can get a list of jobs that ran within a specific time period. I use this to narrow down what was happening on a box while troubleshooting. If you need to review long running jobs over a period of time, compare run times of jobs across servers, or just see what jobs might have been slowing performance at 3:03 am, this script can help you out.

The script takes a start and end time and checks for any jobs that started or started earlier but were still running by your supplied start time.

Calculating the Start and End times from job history is ugly. You’ll see that the datetime conversions below are disgustingly complicated. The short and sweet explanation is that date, time, and duration are all integer values. Because Microsoft thinks everything should be a number. To properly calculate date and time, then display it in a readable format, you have to stuff in colons and and use modulus division to get the hours, minutes, and seconds displaying correctly. If you really like math (I do!), that’s a technical explanation of modulo.

The run status will tell you if the job succeeded or not. Now, according to TechNet SQL Server 2000, there is a run_status of 4 meaning the job is still In Progress. Newer versions don’t mention a run_status of 4 and I’ve never seen run_status of 4 in a result set, even though I had jobs running at the time. Will it work for you? I really don’t know. I don’t trust it though. For newer versions of SQL, I’d suggest running a system stored procedure instead to get currently executing jobs, the code for which I’ve provided after my script.


DECLARE  @StartTime INT = '195500' --HHMMSS
,@EndTime INT = '195600' --HHMMSS
,@Date  INT = CONVERT(INT,CONVERT(VARCHAR,GETDATE(),112))
,@MinTime INT = 10  --Job Run Time Minimum in Seconds

SELECT
[JobName] = j.name
,[StartDate] = CONVERT(VARCHAR,CONVERT(DATETIME,RTRIM(jh.run_date)),101)
,[StartTime] = CONVERT(TIME(0),STUFF(STUFF(REPLACE(STR(RTRIM(jh.run_time),6,0),' ','0'),3,0,':'),6,0,':'))
,[EndTime] = DATEADD(second,run_duration%100,DATEADD(minute,run_duration / 100%100,DATEADD(hour,run_duration / 10000,CONVERT(TIME(0),STUFF(STUFF(REPLACE(STR(RTRIM(jh.run_time),6,0),' ','0'),3,0,':'),6,0,':')))))
,[Duration] = STUFF(STUFF(REPLACE(STR(jh.run_duration,7,0),' ','0'),4,0,':'),7,0,':')
,[RunStatus] =
CASE jh.run_status
WHEN 0 THEN 'Failed'
WHEN 1 THEN 'Succeeded'
WHEN 2 THEN 'Retry'
WHEN 3 THEN 'Cancelled'
WHEN 4 THEN 'InProgess'
ELSE 'Unknown'
END
FROM msdb.dbo.sysjobs j
INNER JOIN msdb.dbo.sysjobhistory jh ON jh.job_id = j.job_id
WHERE run_date = @Date
AND step_id = 0 -- Entire Job
AND run_duration > @MinTime
AND
(
( run_time >= @StartTime AND run_time <= @EndTime )
OR
( Run_Time < @StartTime
AND
DATEADD(second,run_duration%100,DATEADD(minute,run_duration / 100%100,DATEADD(hour,run_duration / 10000,CONVERT(TIME,STUFF(STUFF(REPLACE(STR(RTRIM(jh.run_time),6,0),' ','0'),3,0,':'),6,0,':'))))) > CONVERT(TIME,STUFF(STUFF(REPLACE(STR(RTRIM(@StartTime),6,0),' ','0'),3,0,':'),6,0,':'))
)
)
ORDER BY Run_Date,Run_Time

Note that jobs that are still executing do not have an entry in job history. So combining this script with the executing jobs system stored procedure is a good idea if you are dealing with a recent time window especially. You can add the parameter @Category_Name = ‘jobs from msx’ to return only the jobs managed by your Central Management Server.


EXEC msdb.dbo.sp_help_job @Execution_Status = 1

Advertisements

Get Failed Jobs with T-SQL

It’s been quite some time since I did a post relating to SQL. I’ve just been sharing a lot of PowerShell scripts lately because almost all my views are related to PowerShell.

Monitoring SQL Jobs can be done in many ways, and I highly suggest your jobs send an email notification to your DBA group on failure. No not just you. What happens when you are on vacation or you leave permanently and your account gets disabled? No notifications. Don’t do that. You can also set up fancy reports to display information about your job status using SSRS, but I always like to have something that is easily portable and can be ran in SSMS quickly.

Uses

In the middle of the night when email isn’t loading and a call comes in that something important failed, but the helpdesk can’t articulate what exactly it is over the phone…this script can help a lot. Suddenly confusion and trying to piece together a string of letters and numbers over the phone turns into a simple, “Oh, yes, I see Server X has failed Job Y 3 times. It ran successfully last time, I’ll go ahead and connect to that server to look into it but I suspect someone might have been doing Z.”

I developed this failed job script to run on a single SQL server or through a multi-server query to see all failed jobs across all servers. I commonly run this script in the mornings as a summary when there are multiple failed job emails that came in over night. It’s a much easier way for me to see the history this way. If a job is continuing to fail and it’s high priority, I start there.

Explanation

The job provides the name, run date, and run time for each failed job, but it also provides some extra information for sorting priority. I’ve used ROW_NUMBER to sort job history and retrieve the latest run results of each failed job. That’s the only purpose of using a CTE here, getting Row 1, or the most recent run result for each job. Using the latest result helps determine how problematic the reported job failure is. Did the latest trans log backup run successfully even though one of them failed in the past? OK, that’s probably a minor issue then. Are they still failing? Well, then we have something a bit more worrisome to consider.

I always limit the results to a 24 hour period to limit the result set. I’ve also included a clause in the past to exclude a specific job that was prone to failing due to network issues or a colleague testing something.

WITH JobInfo AS
(
 SELECT
  j.job_id
  ,j.name
  ,jh.run_status
  ,jh.run_date
  ,jh.run_time
  ,jh.[message]
  ,[PriorStatus] = --'NA' --2008
    LAG(jh.run_status,1) OVER (ORDER BY jh.job_id,jh.run_date,jh.run_time)  --2012
  ,[NextStatus] = --'NA' --2008
    LEAD(jh.run_status,1) OVER (ORDER BY jh.job_id,jh.run_date,jh.run_time) --2012
  ,[RN] = ROW_NUMBER() OVER (PARTITION BY jh.job_id
 ORDER BY jh.job_id DESC, jh.run_date DESC, jh.run_time DESC)
 FROM msdb.dbo.sysjobs j
  INNER JOIN msdb.dbo.sysjobhistory jh ON jh.job_id = j.job_id
 WHERE jh.run_date &amp;amp;gt;= CONVERT(INT,CONVERT(VARCHAR,GETDATE()-1,112))
  AND jh.step_id = 0
)
SELECT
  j.Name
  ,[RunDate] = CONVERT(DATE,CONVERT(VARCHAR,j.run_date,112))
  ,[RunTime] = CONVERT(TIME(0),STUFF(STUFF(REPLACE(STR(RTRIM(j.run_time),6,0),' ','0'),3,0,':'),6,0,':'))
  ,[PriorStatus] = CASE j.PriorStatus WHEN '1' THEN 'SUCCESS' WHEN '0' THEN 'FAIL' ELSE 'N/A' END
  ,[NextStatus] = CASE j.NextStatus WHEN '1' THEN 'SUCCESS' WHEN '0' THEN 'FAIL' ELSE 'N/A' END
  ,[LatestStatus] = CASE ji.run_status WHEN '1' THEN 'SUCCESS' ELSE 'FAIL' END
  ,j.[message]
 FROM JobInfo j
   INNER JOIN JobInfo ji ON ji.job_id = j.job_id AND ji.RN = 1
 WHERE j.run_status = 0
 ORDER BY j.name,[RunDate],[RunTime]

Exceptions

If you are running the script on SQL 2012, you can get a bit more information using LEAD/LAG in the form of the completion status of the job before and after the reported failure. I use this to pattern failures and to further prioritize multiple job failures. All other things being equal, I’m going to be more concerned about the job that has consecutive failure than a job that fails intermittently.

Sadly, IF statements, LEAD/LAG, and SQL 2008R2 or older do not play well with each other. I’ve tried making IF statements that run the LEAD/LAG version of the job only on SQL 2012 while running the NA version of the job otherwise, but SQL always wants to parse the 2012 functions, resulting in errors. I’ve resigned to just commenting out the LEAD/LAG functions when running this script on SQL 2008R2 and older. The benefits of the PriorStatus and NextStatus is fairly minimal, so it’s not a huge loss. I find the LatestStatus to be the most important of the three by far.

Multi-Server Job Scripting Failure

Central Management Servers are probably the single most important thing in my SQL life right now. I manage a lot of servers, so having an organized way to keep track of them is paramount. Multi-Server Jobs are a great way to centralize jobs that need to be on all your boxes. If you are not familiar with the idea, I suggest you look into it if you have more than a handful of servers.

I was building a new CMS the other day and had to get dozens of Multi-Server Jobs transferred to the new instance. Some of the jobs have 30 steps, so using Object Explorer to script out the jobs was the obvious choice. However, I was immediately hit with an ambiguous error stating that SQL Server was “Unable to cast object of type ‘System.DBNull’ to type ‘System.String’.”

Test

My original error had another, equally unhelpful line.

Knowing that some object is Null doesn’t help very much. After browsing through the job steps, everything looked in order. Internet searches provided surprisingly little on this subject, and nothing I found really explained how it happens. I know I was going to be migrating a CMS again, and soon, so I wanted to know how to fix it. Surely it wasn’t just a bug in Management Studio? On to the test case!

First of all, you need to create a Multi-Server job. I’m assuming you already have a CMS set up, if not, you need one.
Right-click Multi-Server Jobs, create a new job, and then assign at least two targets for testing. I’m going to use the name TestJob for the rest of this scenario.

test

Make sure you have at least 2 target servers

Once TestJob is made, (it does not matter what it does; mine just performed SELECT @@VERSION) you need to run the job. Execute the following code:
msdb.dbo.sp_start_job 'TestJob' or just start the job using the GUI.

Test

Your job can fail, I don’t care. Just run it.

Now that you have some job history for 2 servers, let’s look at that history:

SELECT *
FROM msdb.dbo.sysjobservers js
INNER JOIN msdb.dbo.sysjobs j ON j.job_id = js.job_id
WHERE j.name = 'TestJob'

This will display one line per targeted server. This query is included for your own reference, we are concerned about the next one’s more targeted results. The query is checking your multi-server job history, specifically the server list, and comparing that to the servers that exist on the CMS. Right now, everything looks fine, and matches.

SELECT
[JobName] = j.Name
,[JobServerId] = js.Server_id
,[TargetServerId] = ts.Server_id
FROM msdb.dbo.sysjobservers js
INNER JOIN msdb.dbo.sysjobs j ON j.job_id = js.job_id
LEFT JOIN msdb.dbo.systargetservers ts ON ts.server_id = js.server_id
WHERE j.name = 'TestJob'

Test

Job History Server_Id matches Target Server_Id, awesome!

Again, at this point, all is well, and you can script out the multi-server job to your heart’s content. That would be fine if you saved a copy and never, ever changed the job while it sat on your CMS. Fast forward a few years though, and weird things may have happened. Let’s simulate a weird thing. I’m going to change the Server_Id in the history to an invalid value. That’s right, there are no constraints preventing me from doing this! This is going to simulate removing a server from the CMS after history already exists for it, or someone just going into your history and changing things because they hate you.

You will need to change the numeric values of the server_ids to match invalid and valid server_ids on your CMS, respectively.
UPDATE js
SET Server_id = 7
FROM msdb.dbo.sysjobservers js
INNER JOIN msdb.dbo.sysjobs j ON j.job_id = js.job_id
WHERE j.name = 'TestJob' and server_id = 5

Now we are going to rerun the script from earlier to verify the job history information.

SELECT
[JobName] = j.Name
,[JobServerId] = js.Server_id
,[TargetServerId] = ts.Server_id
FROM msdb.dbo.sysjobservers js
INNER JOIN msdb.dbo.sysjobs j ON j.job_id = js.job_id
LEFT JOIN msdb.dbo.systargetservers ts ON ts.server_id = js.server_id
WHERE j.name = 'TestJob'

Test

The history now has a server_id that doesn’t exist on the CMS

Great…now we have a null server. If you guessed this is a problem, you guessed correctly. If you try to script the job out now, you will receive a failure notification like the one at the top. Now that we have recreated the issue, we know how to fix it though!

Run the following code to find any null servers and just clean out that history. I’m going to assume if your server doesn’t exist, you don’t care about the history. I didn’t at least.

DELETE FROM js
FROM msdb.dbo.sysjobservers js
INNER JOIN msdb.dbo.sysjobs j ON j.job_id = js.job_id
LEFT JOIN msdb.dbo.systargetservers ts ON ts.server_id = js.server_id
WHERE j.name = 'TestJob' AND ts.server_id IS NULL

Finally, let’s look at the job history again, I’ve supplied the code for a third time, just in case you forgot it.

SELECT
[JobName] = j.Name
,[JobServerId] = js.Server_id
,[TargetServerId] = ts.Server_id
FROM msdb.dbo.sysjobservers js
INNER JOIN msdb.dbo.sysjobs j ON j.job_id = js.job_id
LEFT JOIN msdb.dbo.systargetservers ts ON ts.server_id = js.server_id
WHERE j.name = 'TestJob'

Test

Now we have clean history with no null values

If you try to script the job, it will work, and you can copy your multi-server job over to the new server.

Let’s take it a step further, and expand the script to check all the multi-server jobs. Notice that it is only looking at Category_Id=2, or Multi-Server jobs, and only those with invalid Server_Ids.

SELECT
[JobName] = j.Name
,[JobServerId] = js.Server_id
,[TargetServerId] = ts.Server_id
FROM msdb.dbo.sysjobservers js
INNER JOIN msdb.dbo.sysjobs j ON j.job_id = js.job_id
LEFT JOIN msdb.dbo.systargetservers ts ON ts.server_id = js.server_id
WHERE j.category_id = 2 AND ts.server_id IS NULL

Hopefully you got 0 results, and everything on your CMS is happy.
Even though I changed the Server_Id to recreate this scenario, the original problem occurred due to servers defecting from the CMS, but leaving behind their server history. Anytime a server defects, it needs to automatically remove the history from the sysjobservers table as well, otherwise people will continue to have this problem.