Who blew up my TempDB?
Who hasn’t experienced a server all of a sudden running out of TempDB space and seeing the following error message?
Could not allocate a new page for database 'tempdb' because of insufficient disk space in filegroup 'PRIMARY'. Create the necessary space by dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.
The difficulty about investigating a TempDB issue like this is that in general the solution chosen to fix a full TempDB is to restart the SQL Server instance. Seeing as how the instance has quite often locked up completely and the customer wants their server to work again ASAP, then a restart is almost inevitable.
A restart is an option to combat the symptom, but not the cause. It also (in)conveniently removes the evidence of the transaction that caused TempDB to fill up in the first place.
To see how to deal with TempDB (or other databases) filling up/filling up the disk it is on please take a look at this blogpost.
Returning to the question at hand: Who filled my TempDB?
One solution I looked into, and have implemented for this exact reason, uses Extended Events.
Extended Events is an event processing framework built inside SQL Server since SQL 2008. It allows you to run very lightweight background monitoring of system events, and in this case can capture information about which command/transaction caused a file growth event to occur.
Catch the culprit
To be able to catch the offending process/user, we will create an Extended Events session on the server and the following code does just that:
CREATE EVENT SESSION [Database_Growth_Watchdog] ON SERVER ADD EVENT sqlserver.database_file_size_change ( ACTION ( sqlserver.client_app_name, sqlserver.client_hostname, sqlserver.database_name, sqlserver.session_nt_username, sqlserver.sql_text ) WHERE ( [database_id] = ( 2 ) ) -- We filter on database_id=2 to get TempDB growth only ) ADD TARGET package0.event_file ( SET filename = 'D:\Temp\Database_Growth_Wathdog.xel', max_file_size = ( 10 ) ) WITH ( MAX_MEMORY = 4096 KB, EVENT_RETENTION_MODE = ALLOW_SINGLE_EVENT_LOSS, MAX_DISPATCH_LATENCY = 1 SECONDS, MAX_EVENT_SIZE = 0 KB, MEMORY_PARTITION_MODE = NONE, TRACK_CAUSALITY = OFF, STARTUP_STATE = ON ) GO ALTER EVENT SESSION [Database_Growth_Watchdog] ON SERVER STATE = START
Some things to note about the code:
- We are monitoring for data file and log file growth. The event sqlserver.database_file_size_change fires for any file size change. We do this because we want to be informed of any and all file growth just to make sure we don’t miss anything.
- If you have multiple data files for TempDB (like you may for a multi-core environment) you will see one event fire for each file that is growing. E.g. You have 4 data files and the database grows, you will see 4 entries in the extended events output.
- The session is set to flush events to the output file in 1 second intervals (MAX_DISPATCH_LATENCY). This is done to ensure we lose as few entries to the output file as possible. If TempDB fills up, the entire instance can often stop working completely. We want to catch as much information as possible before this happens, so we flush to the output file in very short intervals.
- We start the session at instance startup (STARTUP_STATE). This ensures we have the event active immediately on server startup. As this is monitoring file growth events, we should remain very lightweight so not have to worry about swamping our system with extended events processing.
- We limit the output file to 10MB and allow it to deliver to a total of 5 files. This means we have to have 50MB for the files in the output directory and won’t spam the folder with too much data.
- When the event fires, we collect information about the query/command that caused the file growth to occur. This includes NT username, Hostname, origin database of the query, command text and application name.
The information collected by this session can be vital in pinpointing the cause for the TempDB filling up. However, there is the possibility of false positives in this setup. TempDB may have been almost completely filled by another previous transaction and the transaction causing the growth event is an innocent bystander. This is an unavoidable situation, but needs to be kept in mind when analysing the data. If you don’t catch the exact transaction this way, you are still on the right track.
Analyse the data
Once the data has been collected, we need to load and parse the output files to make sense of what has happened. The following code will parse the xml output that is in the (up to) 5 files.
DECLARE @TraceFileLocation NVARCHAR(255)= N'D:\Temp\Database_Growth_Watchdog*.xel'; WITH FileSizeChangedEvent AS ( SELECT object_name Event, CONVERT(XML, event_data) Data FROM sys.fn_xe_file_target_read_file(@TraceFileLocation, NULL, NULL, NULL) ) SELECT Data.value('(/event/@timestamp)', 'DATETIME') EventTime, Data.value('(/event/data/value)', 'BIGINT') GrowthInKB, Data.value('(/event/action/value)', 'VARCHAR(MAX)') ClientUsername, Data.value('(/event/action/value)', 'VARCHAR(MAX)') ClientHostname, Data.value('(/event/action/value)', 'VARCHAR(MAX)') ClientAppName, Data.value('(/event/action/value)', 'VARCHAR(MAX)') ClientAppDBName, Data.value('(/event/action/value)', 'VARCHAR(MAX)') SQLCommandText, Data.value('(/event/data/value)', 'BIGINT') SystemDuration, Data.value('(/event/data/value)', 'BIGINT') SystemDatabaseId, Data.value('(/event/data/value)', 'VARCHAR(MAX)') SystemDatabaseFileName, Data.value('(/event/data/text)', 'VARCHAR(MAX)') SystemDatabaseFileType, Data.value('(/event/data/value)', 'VARCHAR(MAX)') SystemIsAutomaticGrowth, Data FROM FileSizeChangedEvent;
Please take note of the variable @TraceFileLocation. The example uses a wildcard to allow loading/parsing of multiple files, this is particularly useful if you really do rollover into multiple files.
The results from the query provide a range of information regarding to who the offending process was, what command they had submitted last and a set of information about the client.
So, now we have a starting point to discover who/what is causing TempDB to grow and can discuss this with application owners.
It is extremely simple to to extend the session to monitor all databases and run a separate monitoring solution to inform you of such growth events. In fact, I saw Gianluca Satori ( b | t ) talk about streaming Extended Event data and processing them in near real time at his SQLSat Oslo session. I am certain that this would be a fantastic way of doing some sort of real time monitoring of growth events.
Some homework for you: What could you use this Extended Events session for? How could you modify/improve on it to help with your system monitoring and administration? Let us know in the comments.