Limiting Down Time by “Swapping Stacks” in AWS Lambda + Serverless

Justin Kruse
FloSports Engineering
3 min readMay 1, 2019

--

Photo by Jarvik Joshi on Unsplash

At FloSports, we make use of nested stacks in AWS. Sometimes resources need to move between stacks. To do so we need to delete the entire stack because, when doing validation on the template changes, CloudFormation will see this moved resource as existing in two stacks.

To fix this, the stack needs to be deleted; however, the stack deletion process can take anywhere from 15 minutes to a couple hours. This down time is obviously unacceptable for a production environment.

To prevent this, we create two versions of each environment. For example, we will label them something like prod1 and prod2. The prod2 version remains empty so when we need to delete the primary environment, we can deploy to the prod2 version first.

However, this presented another problem. There would be a window of time where both stacks would be active, meaning lambdas responding to SNS, SQS, scheduled, or other events would run twice if such an event came through.

There were three things we needed to do to address this issue.

  1. Ability to disable the Lambdas in one stack and enable them in the other
  2. Point the API’s domain name to the active stack/stage
  3. Ability to deploy a stack with the Lambdas disabled by default

To swap the stacks (flip which stack’s Lambdas were active or disabled), we were able to develop a script, using the AWS JavaScript SDK, to go into AWS, get each of our Lambdas, then get the policy (which holds the permissions) for each Lambda, and change that policy’s action to either “InvokeFunction” or “DisableInvokeFunction”. In this same script, after the policies are changed, we then update API Gateway to point to the correct stage (prod1 or prod2).

To accomplish disabling Lambdas by default, I wrote a small plugin for Serverless that runs at the end of the Serverless package lifecycle event (after:aws:package:finalize:mergeCustomProviderResources) and sets each Lambda’s policy action to the appropriate permission.

AWS Lambda Console, SNS triggered function

We ran into a couple gotchas. The first is the one we always seem to run into: weak AWS documentation. When in the AWS Lambda console, each function has an enabled or disabled toggle in the UI. However, this was not documented anywhere… much less how to change it programmatically.

Eventually we discovered that setting the action in the function policy to either “lambda:InvokeFunction” or “lambda:DisableInvokeFunction” is what sets that slider to enabled or disabled. To view the function policy there is an image of a key just above the highlighted SNS box in the image above, which I couldn’t capture in my screen shot. Click that box to show the policy and execution role.

Second, not all Lambda events have a function policy. Instead, functions responding to SQS, Kinesis streams, and Dynamo DB streams respond to an event source mapping. The script had to be modified to not only update these mappings, but catch any rejections for functions without a policy (A ResourceNotFoundException would be thrown when requesting a policy which did not exist). I should also note that Function Policies cannot be updated, you must delete then recreate them.

Finally, when running the policy swap script, our API calls via the JavaScript SDK would get AWS throttled. How and when would this throttle would rear its ugly head, we weren’t certain as it is not exactly documented how fast/often you can send these requests. So we set a short timeout to run after each call and if we hit the throttle, we retry after a longer timeout.

--

--