Tag Archives: best practice

Who blew up my TempDB?

Who blew up my TempDB?

Who hasn’t experienced a server all of a sudden running out of TempDB space and seeing the following error message?

Could not allocate a new page for database 'tempdb' because of insufficient disk space in filegroup 'PRIMARY'. Create the necessary space by dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.

The difficulty about investigating a TempDB issue like this is that in general the solution chosen to fix a full TempDB is to restart the SQL Server instance. Seeing as how the instance has quite often locked up completely and the customer wants their server to work again ASAP, then a restart is almost inevitable.

A restart is an option to combat the symptom, but not the cause. It also (in)conveniently removes the evidence of the transaction that caused TempDB to fill up in the first place.

To see how to deal with TempDB (or other databases) filling up/filling up the disk it is on please take a look at this blogpost.

Returning to the question at hand: Who filled my TempDB?

One solution I looked into, and have implemented for this exact reason, uses Extended Events.

Extended Events is an event processing framework built inside SQL Server since SQL 2008. It allows you to run very lightweight background monitoring of system events, and in this case can capture information about which command/transaction caused a file growth event to occur.

For a full introduction to Extended Events, please take a look at the 31 Days of Extended Events series from Jonathan Kehayias ( b | t ).

Catch the culprit

To be able to catch the offending process/user, we will create an Extended Events session on the server and the following code does just that:

CREATE EVENT SESSION [Database_Growth_Watchdog] ON SERVER
ADD EVENT sqlserver.database_file_size_change (
 ACTION ( sqlserver.client_app_name, sqlserver.client_hostname, sqlserver.database_name, sqlserver.session_nt_username, sqlserver.sql_text )
 WHERE ( [database_id] = ( 2 ) ) -- We filter on database_id=2 to get TempDB growth only
)
ADD TARGET package0.event_file ( SET filename = 'D:\Temp\Database_Growth_Wathdog.xel',
 max_file_size = ( 10 ) )
WITH ( MAX_MEMORY = 4096 KB,
 EVENT_RETENTION_MODE = ALLOW_SINGLE_EVENT_LOSS,
 MAX_DISPATCH_LATENCY = 1 SECONDS,
 MAX_EVENT_SIZE = 0 KB,
 MEMORY_PARTITION_MODE = NONE,
 TRACK_CAUSALITY = OFF,
 STARTUP_STATE = ON )
GO
ALTER EVENT SESSION [Database_Growth_Watchdog] ON SERVER STATE = START

Some things to note about the code:

  • We are monitoring for data file and log file growth. The event sqlserver.database_file_size_change fires for any file size change. We do this because we want to be informed of any and all file growth just to make sure we don’t miss anything.
  • If you have multiple data files for TempDB (like you may for a multi-core environment) you will see one event fire for each file that is growing. E.g. You have 4 data files and the database grows, you will see 4 entries in the extended events output.
  • The session is set to flush events to the output file in 1 second intervals (MAX_DISPATCH_LATENCY). This is done to ensure we lose as few entries to the output file as possible. If TempDB fills up, the entire instance can often stop working completely. We want to catch as much information as possible before this happens, so we flush to the output file in very short intervals.
  • We start the session at instance startup (STARTUP_STATE). This ensures we have the event active immediately on server startup. As this is monitoring file growth events, we should remain very lightweight so not have to worry about swamping our system with extended events processing.
  • We limit the output file to 10MB and allow it to deliver to a total of 5 files. This means we have to have 50MB for the files in the output directory and won’t spam the folder with too much data.
  • When the event fires, we collect information about the query/command that caused the file growth to occur. This includes NT username, Hostname, origin database of the query, command text and application name.

The information collected by this session can be vital in pinpointing the cause for the TempDB filling up. However, there is the possibility of false positives in this setup. TempDB may have been almost completely filled by another previous transaction and the transaction causing the growth event is an innocent bystander. This is an unavoidable situation, but needs to be kept in mind when analysing the data. If you don’t catch the exact transaction this way, you are still on the right track.

Analyse the data

Once the data has been collected, we need to load and parse the output files to make sense of what has happened. The following code will parse the xml output that is in the (up to) 5 files.

DECLARE @TraceFileLocation NVARCHAR(255)= N'D:\Temp\Database_Growth_Watchdog*.xel';
WITH FileSizeChangedEvent
 AS (
 SELECT object_name Event,
 CONVERT(XML, event_data) Data
 FROM sys.fn_xe_file_target_read_file(@TraceFileLocation, NULL, NULL, NULL)
 )
 SELECT Data.value('(/event/@timestamp)[1]', 'DATETIME') EventTime,
 Data.value('(/event/data/value)[7]', 'BIGINT') GrowthInKB,
 Data.value('(/event/action/value)[2]', 'VARCHAR(MAX)') ClientUsername,
 Data.value('(/event/action/value)[4]', 'VARCHAR(MAX)') ClientHostname,
 Data.value('(/event/action/value)[5]', 'VARCHAR(MAX)') ClientAppName,
 Data.value('(/event/action/value)[3]', 'VARCHAR(MAX)') ClientAppDBName,
 Data.value('(/event/action/value)[1]', 'VARCHAR(MAX)') SQLCommandText,
 Data.value('(/event/data/value)[1]', 'BIGINT') SystemDuration,
 Data.value('(/event/data/value)[2]', 'BIGINT') SystemDatabaseId,
 Data.value('(/event/data/value)[8]', 'VARCHAR(MAX)') SystemDatabaseFileName,
 Data.value('(/event/data/text)[1]', 'VARCHAR(MAX)') SystemDatabaseFileType,
 Data.value('(/event/data/value)[5]', 'VARCHAR(MAX)') SystemIsAutomaticGrowth,
 Data
 FROM FileSizeChangedEvent;

Please take note of the variable @TraceFileLocation. The example uses a wildcard to allow loading/parsing of multiple files, this is particularly useful if you really do rollover into multiple files.

The results from the query provide a range of information regarding to who the offending process was, what command they had submitted last and a set of information about the client.

So, now we have a starting point to discover who/what is causing TempDB to grow and can discuss this with application owners.

Other ideas

It is extremely simple to to extend the session to monitor all databases and run a separate monitoring solution to inform you of such growth events. In fact, I saw Gianluca Satori ( b | t ) talk about streaming Extended Event data and processing them in near real time at his SQLSat Oslo session. I am certain that this would be a fantastic way of doing some sort of real time monitoring of growth events.

Some homework for you: What could you use this Extended Events session for? How could you modify/improve on it to help with your system monitoring and administration? Let us know in the comments.

Full disks and databases

How do you deal with a full hard disk/volume/partition?

I have recently had a number of discussions about dealing with hard drives or databases themselves filling up on SQL Servers. There seems to be a few different opinions on how to deal with this and I’d like to share those here and give my preferred solution.

N.B. (For brevity, I will refer to disks, but mean either disk, volume or partition where applicable)

The main error that people are confronted with when either a disk or a database file has filled to the brim is something like this:

Could not allocate a new page for database 'DatabaseName' because of insufficient disk space in filegroup 'PRIMARY'. Create the necessary space by dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.

There are a few reasons this could happen (not an exhaustive list):

  1. The database file is full but the disk holding the database file does not have enough room to accommodate the file growth required for auto-growth to occur.
  2. The database file is full but the disk holding the database file has a quota activated that has been reached, so does not have enough room to accommodate the file growth required for auto-growth to occur.
  3. The database file is full but database does not have auto-growth setup.

The problem with the first two situations I mentioned is that they are generally not easy/quick situations to fix. If a disk is full, then extending that disk may not be possible, or only possible with a lot of administrative overhead. The same applies for disk quotas, even more so with quotas, as quotas usually have some sort of “internal politics” attached to them.

The third reason occurs surprisingly often. For some reason, certain companies just don’t want their databases to grow automatically. They want to keep everything under human control, so they turn off all sorts of automatic support options like auto-growth. I also heard a customer mention that they preferred the data files to be static in size and it was easier for them to extend these than the disk (again confirming my point about disk/partition/volume extensions being difficult).

Regardless of the reasoning for databases/database files actually filling up and the error occurring, what can be done in an attempt to help prevent this from happening, or at least allow for a speedier resolution of the issue?

Dummy/Buffer files

If you have a volume size limitation and need to ensure that you can quickly arrange for more disk space very quickly, one option is to use “dummy” or “buffer” files. These are files that are of a certain size, but have no actual content. They are created on the disk that you want to have “insurance” for and kept around until disk space is either low (a pre-defined threshold) or the disk is actually full. When either of these situations occurs, you simply delete one or more of these “dummy” files and you immediately have more disk space.

To create these files you can use fsutil a great little tool that is bundled with windows and allows sysadmins to do a number of things with disks. You call fsutil from the command line and provide it with the correct parameters and an “empty” file will be created.

Syntax of fsutil (length is in bytes)
fsutil file createnew <filename> <length>

Example to create a single 50MB file
fsutil file createnew d:\temp\dummyfile 52428800

dummy_file

Once the file(s) has been created you have a buffer of space so that you can quickly react to a disk filling up (even with some sort of scripted automation). Of course, with the consideration that the disk now has less space available for “real” data, meaning a full disk can actually occur quicker.

Using Max Size Limits for a Database

Among the settings that a database can have, you can control in what size “steps” a database should grow by if auto-growth is activated. You can also set a maximum size for the database file(s) to grow to.  This is similar to the idea of a disk quota, but on a database file level. Setting this limit can be useful to prevent a database from growing so large that it will completely fill a disk. I actually prefer the idea of setting an upper limit on the database file(s) rather than allowing unlimited growth. The reason for my preference is down to the fact that it is easier to change the maximum growth limitation than it is to extend a disk.

Setting the maximum size for a database is a simple operation, either in SSMS via the GUI:

database_autogrowth

Or via T-SQL:

ALTER DATABASE [sandbox] MODIFY FILE ( NAME = N'sandbox', MAXSIZE = 102400KB )

With this soft-limit for a database file, you are able to control file growth with greater precision. The main benefit is, that you can increase the maxsize at a moment’s notice and allow normal operations to continue if a database file should accidentally/suddenly fill up. This is not an ideal solution, but still improves control and allows normal operations to resume.

I like dummies!

I prefer the dummy file solution out of the two that I have explained above. Using a dummy file allows you to have auto-growth turned on (to cover those immediate growth requirements), while stilling offering you the chance to also avoid a disk from filling up completely through an auto-growth event.

Using the maxsize solution is ok, so long as you are able to connect to the instance to extend the maxsize. This is often possible for user databases, but what about TempDB? I have experienced systems where the system has locked up because TempDB is full and attempts to connect were fruitless (Remote DAC was not activated and no local access to the SQL instance), so it was not possible to extend TempDB.

Having a few dummy files around would allow me to quickly supply TempDB with the required disk space for an auto-growth and for me to connect to the instance again and take charge of the situation. The other alternative (which customers often chose in the past), was to reboot the server / restart the instance. The issue with this being, the culprit for the “crash” is no longer identifiable, unless you have a solution implemented to do just that (which I will blog about soon!).

Conclusion

In my experience most installations / customers do not have a solution implemented to counteract a disk/database filling up. The two solutions I have explained are good starting points to help avoid catastrophe in future. Please take these ideas and see if you can improve the stability and availability of your systems using them, or even better, provide your ideas/solutions in the remarks of this blogpost.

Thanks for reading and have fun with your diskspace!

SQL Server Simple Recovery Model Requires Log Backups

“SQL Server Simple Recovery Model never requires a log backup!!!” I hear you say. Well sit back, relax and let me show you that it does…… sometimes…… although it shouldn’t!

Issue Description

I ran into a confusing situation yesterday which has a published solution, but I had never heard about it. It is one of those things that should stick in your mind considering how much havoc this may cause.

I am talking about what happens if you setup your model database so that future new databases are created in a way you want. If you stray from the SQL Server defaults, then this bug might catch you out. The issue I am talking about is the recovery model of the model database. By default SQL Server has a model database set to FULL recovery and with (in my opinion) vastly outdated initial size and auto-growth settings.

My department’s policy is to set the model database to SIMPLE and to set initial size and auto-growth to something more sensible than 1MB and 10% (for argument’s sake 128MB initial size and 128MB growth).

With SQL Server 2012 (RTM and SP1), setting the model to SIMPLE looks fine on the outside, but under the covers that is not what happens. A new database will be created using model, setting the new DB to SIMPLE recovery and you won’t immediately notice any issues. However, as soon as you start running DML you can see that something isn’t quite right by interrogating sys.databases and looking at the log_reuse_desc column for the database.

You can tag along for the ride by running the following code on a TEST!!! 2012 RTM or SP1 instance

SET NOCOUNT ON
GO

USE master
GO

IF EXISTS (SELECT * FROM sys.databases AS D WHERE name ='MyTestDB')
 DROP DATABASE MyTestDB
GO

ALTER DATABASE model SET RECOVERY SIMPLE
GO

-- Create a test database using defaults from model
CREATE DATABASE MyTestDB
GO

-- Verify that the database is in SIMPLE recovery model and log reuse is 'NOTHING'
SELECT name,
 recovery_model_desc,
 log_reuse_wait_desc
FROM sys.databases
WHERE name = 'MyTestDB';

-- Create a table in the test db and add some data (creating log file entries)
USE MyTestDB
GO
CREATE TABLE dbo.TestTable
 (Col1 int NOT NULL IDENTITY (1,1),
 Col2 char(256) NOT NULL
 DEFAULT REPLICATE('a', 256))
GO

INSERT INTO dbo.TestTable
 (Col2)
VALUES (default)
GO 2000

-- Now look at the log reuse information again - all is fine......
SELECT name,
 recovery_model_desc,
 log_reuse_wait_desc
FROM sys.databases
WHERE name = 'MyTestDB';
GO

-- Perform a backup of the database, because that is what we DBAs do regularly (here to nul so we don't fill up disk space)
BACKUP DATABASE MyTestDB TO DISK = 'nul'
GO

-- Add yet more test data
INSERT INTO dbo.TestTable
 (Col2)
VALUES (default)
GO 2000

-- Now look at the log reuse information again - oh look..... apparently we need a log backup!!!
SELECT name,
 recovery_model_desc,
 log_reuse_wait_desc
FROM sys.databases
WHERE name = 'MyTestDB';
GO

As you can see from running that code block, the log reuse description is now ‘LOG_BACKUP’ which is not possible for a SIMPLE recovery database! In fact, you can run a log backup if you like too, it will run successfully but won’t clear the log out for re-use!! So if you don’t do something about this, you will either fill your log file (if it isn’t set to auto-grow), or you will fill your disk drive where your log file resides. How fast this will happen depends solely on what activity is going on in that particular database. Your monitoring processes for drive space will warn you, but if the growth is fast enough, even these early warning systems may not be enough.

Solutions to the problem

As I mentioned at the beginning of this post, there is a solution to the problem (or rather, there are multiple solutions).

  1. Don’t set model to SIMPLE in the first place, making sure it is a step you make after creating databases that don’t require FULL recovery.
  2. Continue with model set to SIMPLE, but incorporate a process that “fixes” your databases after creation.
  3. Install at least CU4 for SQL Server 2012 SP1 or CU7 for SQL Server 2012 RTM both of which incorporate the bug-fix for this problem (along with a whole host of other fixes).

The steps required to follow option 2 would be:

  • Set the affected database to FULL recovery model
  • Perform a FULL backup of the database
  • Set the database back to SIMPLE recovery model

Here is the code to achieve that and also prove the title of this blog post:

-- Now look at the log reuse information again - oh look..... apparently we need a log backup!!!
SELECT name,
 recovery_model_desc,
 log_reuse_wait_desc
FROM sys.databases
WHERE name = 'MyTestDB';
GO

-- Perform the "required" log backup
BACKUP LOG MyTestDB TO DISK = 'nul'

-- The log has been cleared...... So we have FULL recovery in disguise
SELECT name,
 recovery_model_desc,
 log_reuse_wait_desc
FROM sys.databases
WHERE name = 'MyTestDB';
GO

-- How about we fix this permanently. Change to FULL, run a backup, change back to SIMPLE

ALTER DATABASE MyTestDB SET RECOVERY FULL
GO

BACKUP DATABASE MyTestDB TO DISK='nul'
GO

ALTER DATABASE MyTestDB SET RECOVERY SIMPLE
GO

-- Now fill our table with more data and check the log reuse column again
INSERT INTO dbo.TestTable
 (Col2)
VALUES (default)
GO 2000

SELECT name,
 recovery_model_desc,
 log_reuse_wait_desc
FROM sys.databases
WHERE name = 'MyTestDB';
GO

As for option 3: I realise that using RTM is something people just shouldn’t be doing, but the fact that this bug is in SP1 is a nasty one. Also, not everyone will install Cumulative Updates (CUs) because they are not as rigorously tested as Service Packs (SPs). I know of companies that have policies strictly forbidding installing CUs on production machines for this reason alone. This makes it all the more serious that it is a bug that managed to get through to SP1. Obviously, as soon as SP2 for SQL 2012 comes out (but when will that be?) this bug-fix will be available, but until that time, you have to live with CUs or other workarounds.

I have to say again, I am surprised I hadn’t heard about this issue before – it flew under my radar or was just not picked up by the “normal news channels”. Either way, I thought I’d throw this out there to help the google juices for anyone who has similar issues.

And so, that is how the SIMPLE recovery model sometimes does need a log backup……… sort of.

Have fun 🙂