Scaling for Traffic Surges: When auto-scaling is not fast enough

A year and a half ago we had a problem. Our application could not scale up quickly enough to handle realtime surges.

Published in

FloSports Engineering

5 min readDec 18, 2018

FloSports is an online sports media company; live streaming is the core of our business. Therefore, it’s a really bad day when we fail on that very basic promise to our customers and go down during one of our live events.

That’s exactly the position we were in far too often a year and a half ago. Our systems were all PHP monoliths. Yes, we had auto-scaling groups that could deal with typical spikes in traffic. However, with live events you may go from 500 viewers to 8,000 viewers in a few minutes. Our automation simply could not bring up new servers quickly enough.

As a result, we set up weekly meetings to preview the forecast to try to stay ahead of it. How many servers do we think we’ll need? We might as well have been living back in 2010, losing much of the benefit of the modern cloud!

Not only was it exhausting and more expensive to keep these extra servers around, but it was error-prone because sometimes we’d forecast incorrectly. The worst part was that the nature of these surges is they happen at precisely the most interesting part of the event! When everyone is coming online to watch and everyone’s posting about it on social media!

So one afternoon one of our co-founders, Mark Floreani, called me to discuss this problem. He flung the door wide open to examine any crazy idea that would resolve the issue: “Jason, if I said ‘gun to your head, make it so we can’t go down’… how would you do it?”

First I reminded him “can never go down” is a fairy tale— even the FANGs (Facebook, Amazon, Netflix, Google) go down sometimes. Then I explained how I’d scrap our entire infrastructure and start over. Thankfully, he did not balk at the suggestion, and I got to work architecting our new platform!

For many years we have partnered with a small CDN partner, who actually has served us very well. Although our existing CDN network was not the biggest issue in the legacy stack, it needed to become a strength. So we augmented and then replaced the bare metal edge network they’d colocated for us in a dozen datacenters around the continent. We partnered with Akamai, the most well-respected content delivery network on the planet.

The details of our integration with Akamai would be a blog post of its own. In short, before each event we create single-use primary and backup endpoints on Akamai. Our backend configures those as stream targets on our Wowza origins. We use a HTTP push method, rather than more typical pull methods to deliver our transcoded HLS renditions to the Akamai mid-origin.

We threw our PHP front end out the window (which was running on EC2 instances behind an ALB) and created a pure HTML/CSS/JS solution based on Backbone. Now we have a purely client-side MVC that can be hosted on S3, served by CloudFront, and can stand up to any surge.

The final leg was the most difficult: the dynamic portion of our site. Our API does a whole lot of things for us, which includes a lot of queries to the database and interaction with third party services. This was another PHP application, running on EC2 instances, that we kicked into the sun.

Although an unknown commodity (especially at the time) to most developers, it was clear to me the Serverless framework with API Gateway and Lambdas was the way to go. AWS seamlessly scales out for any possible influx of traffic, and we only have to pay for exactly the server utilization that we use. No excess capacity sitting around just in case!

The problem was none of our developers had ever worked with Serverless before; not only that, our “typeof” Node experience was “undefined.” Thankfully we work with an amazing team of engineers, management and dev leads who bought into the vision, were not intimidated by learning a new paradigm, and passionately embraced the challenge.

Today we have transitioned 100% of our live streams to this platform. Looking back over this journey of the last year and a half, I could not be more proud of our team. It has not been without some bumps and bruises, starts and stops, and a lot of iteration… but they have executed this vision amazingly.

Initially our new Live API was written in JavaScript to run on the Node-based Lambda functions. After six months, knowing we could do better, our team took it upon themselves to port the entire codebase to TypeScript. They also have near 100% test coverage and a well-oiled deployment process that is simple, yet controlled so that we can feel confident in our releases going into each full weekend of action-packed sporting events!

Our fans may never realize all that we have invested. Or know the blood, sweat and tears poured into this new system. It’s one of those things that is hair-on-fire when it fails, but taken for granted when it works. The rest of the company may never fully appreciate the bleeding edge technology employed.

But… I will be forever grateful for a management team that gave us the green light to make this big bet. I am so thrilled to work with this fantastic team of product managers, engineers, QA and DevOps that can take these half-baked speeches, network diagrams and Confluence docs and execute them so beautifully. You fleshed out the vision, added your own talent and creativity, and took it above and beyond.

Congratulations, team! You guys and gals made it happen. Here’s to a fantastic 2019 at FloSports! Continue to embrace challenges and raise the bar.

Scaling for Traffic Surges: When auto-scaling is not fast enough

A year and a half ago we had a problem. Our application could not scale up quickly enough to handle realtime surges.

Written by Jason Byrne