Monthly Archives: May 2013

Windows Failover Cluster Patching using PowerShell

PowerShell is the new standard scripting and automation language for Microsoft products. In this post I will be describing how I used two new Powershell commands to help semi-automate the installation of Windows Updates to a Windows 2012 Failover Cluster.

I run a number of clusters on both Windows Server 2008 and Windows Server 2008R2 and I must say that although Windows Clustering was a new thing to me about 5 years ago, my overall experience has been pretty painless during that time. From what I hear/read Server 2008 was the first outing of a much updated Windows Failover Clustering (including PowerShell integration), so I managed to body-swerve the more painful versions before it.

Be that as it may, Windows Server 2008 was always lacking in terms of scripting/automation and didn’t get much better with Windows Server 2008R2. The previous incarnation of the automation tooling was “cluster.exe” which was a command line tool that allowed for quite detailed control of a cluster. This was a tool that has “grown” over time to cover many areas of Windows Clustering, but was not designed to be as flexible or programmable as PowerShell. The PowerShell command-set for Clustering covered just about everything that “cluster.exe” covered and then some more – a good overview can be found at the Windows File Server Team Blog.

As Windows has evolved, the PowerShell offerings for each feature within Windows has also evolved. I recall hearing that the with Windows Server 2012 you can control the entire O/S from within PowerShell and that the GUI basically constructs and runs these PowerShell commands in the background (my brain may be playing tricks on me though).

With this in mind, I am preparing a server migration/replacement which will move a cluster from Windows Server 2008 to Windows Server 2012.  I took another look at the PowerShell commands available to me on Server 2012 and was rather pleased to see two improvements in the Failover Clustering command-set. The major command addition that I discovered, and had to write about here, was the “Suspend-ClusterNode” command and its brother “Resume-ClusterNode”.

Suspend-ClusterNode allows you to temporarily “pause” a cluster node. This basically removes any currently running resources/services from the specified node and the cluster will not assign any workload to that node until it has been re-instated as an active cluster participant. Resume-ClusterNode brings a previously suspended node back online within the cluster.

You may ask “Why would you want to do this?” or be thinking “This was possible with Windows Server 2008R2”; well dear reader, let’s take a look at those two points.

“Why would you want to do this?”

The short answer: Server maintenance with minimal downtime.

The slightly longer answer: Imagine you have a 3 node cluster. It is patchday and you want to cleanly patch the nodes, one after the other, with a minimum of downtime. You can afford for one node of the three to be offline for maintenance at any one time. This means that you can suspend node 3 while nodes 1 and 2 remain online. The cluster then knows that node 3 is down, but it is not down not due to a failure (so will not start panicking). You can then patch the node, reboot as necessary (who am I kidding, this is a windows server, you’ll have to reboot it!) and then the node is ready to re-join the cluster as an active/available member. You are then able repeat this process on the remaining nodes to install updates across the whole cluster.

“This was possible with Windows Server 2008R2”

The short answer: Yes it was, but required some manual intervention.

The long answer: Windows Server 2008R2 offered the ability to suspend a cluster node, but without the ability to control how the cluster dealt with any resources on the node being suspended. This is where the real magic in the new PowerShell command “Suspend-ClusterNode” comes into play.

Let’s take a quick look at the syntax of both so we can compare:

Suspend-ClusterNode

Old Syntax

Suspend-ClusterNode [[-Name] ] [[-Cluster] ]

New Syntax

Suspend-ClusterNode [[-Name] ] [[-Cluster] ] [[-TargetNode] ] [-Drain] [-ForceDrain] [-Wait] [-Confirm] [-WhatIf]

As we can see, the new syntax offers quite a few extra parameters over the old; the main ones to note are [TargetNode] and [Drain]. [TargetNode] allows us to specify where any resources should be moved to and [Drain] initiates the move of the resources/services. This allows for a much finer control over resources within a cluster during maintenance operations. With the new command it is really easy to perform the 3 node cluster maintenance outlined earlier. We suspend one node after the other, moving any resources they should have to another node of our choosing and can then resume the node after the maintenance has completed. If we now take a look at Resum-ClusterNode, we will see another level of brilliance that becomes available to us that further eases node maintenance work:

Resume-ClusterNode

Let’s compare old vs. new:

Old Syntax

Resume-ClusterNode [[-Name] ] [-Cluster ]

New Syntax

Resume-ClusterNode [[-Name] ] [-Cluster ] [[-Failback] {Immediate | NoFailback | Policy}]

Again, we can see that there is more to decide upon with the new syntax. When you resume a suspended node, you can decide what happens to the resources that were previously running on that node.

The parameter [Failback] has three options:

  • “Immediate” is pretty obvious and will immediately take control of the resources that were previously running on that node before suspension.
  • “NoFailback” will resume the node but leave all resources where they currently are – this is a good idea if you have already failed over to an updated node and don’t want another service outage in this maintenance period.
  • Finally, “Policy” would postpone any failback until a pre-defined failover timeframe is reached.

Once we see how Suspend-ClusterNode and Resume-ClusterNode have been extended, we can understand how the extensions open up better scripting control of clusters and their resources.  I have prepared a script that can be run in our maintenance window that will suspend a node and push all resources to another node.  The suspended node can then receive all Windows Updates and be rebooted if necessary and finally a script can be run to bring the node back online.  The same set of scripts are then run against the second node, reversing the flow back to the node we have just patched.  Using Suspend-ClusterNode and Resume-ClusterNode only reduces the overall code in the PowerShell scripts by a few more lines, but the complexity is reduced drastically.  This makes code maintenance easier and the code is just easier on the eye.

After seeing the improvement to my local code, I am certain that the extension to the PowerShell command-set was internally driven at Microsoft. Imagine how many clusters they have running that needed this extension to improve their automation. As far as I can see it, this is an obvious knock-on effect of their drive to “The Cloud” and a very positive one at that!

As always, think before running any scripts and please don’t blindly setup scripts to suspend and resume nodes in a cluster. This is an aid to your daily work and not a robot you should build to run any/all updates onto a cluster 🙂

Have fun!