Monday 27th August 2007 will, in the future, be euphamistically referred to as "not one of my better days". I arrived this morning to find that all of the scheduled ETL jobs on my current project that had been running without problems for weeks had all been failing since Saturday morning. Not only that, they failed on all environments (dev, test, UAT, live) in exactly the same place with exactly the same error message:
The script threw an exception: A deadlock was detected while trying to lock variables <variable-list> for read access and variables <variable-list> for read/write access. A lock cannot be acquired after 16 attempts. The locks timed out. Task <task-name> failed.
Ho hum. This wasn't going to be a good day at all!
I became suspicious when I realised that they had all failed in exactly the same place and for the same reason and yet they'd been working just fine for ages. Something MUST have changed on those servers inbetween Friday and Saturday morning and sure enough after checking with my trusty server admin guy it turned out that 5 patches has been applied to all of our servers on Friday evening. Here's the details as best I can discern:
Windows 2003 Post-SP2 Hotfix - MS07-31/935840 W2K3 Server
Windows 2003 Post-SP2 Hotfix - MS07-34/929123 W2K3 Server
Windows 2003 Post-SP2 Hotfix - MS07-039/926122 W2K3 Server
Windows 2003 Post-SP2 Hotfix - KB924054 W2K3 Server
2.0 IE Update W2K3 Server
Note that the links are the best I've been able to find. They are not official links on what these patches actually are - I am trying to find out more. I don't know which of these updates (if any) has caused the problem.
[N.B. Don't talk to me about the folly of installing updates onto production servers at the same time as dev servers - you're preaching to the converted. Unfortunately I don't have any input into that particular policy.]
So once I had identified why all my ETL jobs had suddenly started failing I had to work out what I could do to fix it. The fraction of you that read my article "Real Life SSIS Project Experiences" in the November 2006 edition of SQL Server Standard will know that the error above is caused by script tasks and script components locking variables within the package which means they are inaccessible to concurrently executing tasks. There are two things you need to do to mitigate the risk of (not prevent) this happening:
- Never ever ever use the ReadOnlyVariables/ReadWriteVariables properties of the script task/component to lock variables. Instead, lock them within the code; this will mean that the locks are taken for a much smaller length of time. I have some example code showing how to do this here.
- Install the post-SP1 hotfix. [Get more information about SSIS service packs here.] Previous to this hotfix SSIS would always place a write lock on a variable even if all you wanted to do was read it.
I and my team have spent most of today diagnosing and making changes to all our packages. Tomorrow we'll be testing the changes (which are basically to eradicate the use of the ReadOnlyVariables/ReadWriteVariables properties that we foolishly still have lying around in our template package) so I'll know more then about whether these worked or not. Early signs are positive.
I have emailed the SSIS product team within Microsoft so they are aware of the problem and are investigating. Watch this space!
-Jamie
UPDATE: I've just remembered this blog post from Daniel Read where he provides some code that makes locking of variables a breeze. What I really like about Daniel's code is that it locks the variable, does what it needs to do, and then unlocks it again immediately and in the context of the blog entry you're reading right now that is really important.