Hi everyone,
My name is Rob and I'm Technical Director of services for Rare. I'm here because I wanted to take some time to talk to you about the issues we had over the festive period (and this past weekend), and how we're moving forward in preventing another issue of this nature.
As I'm sure you're all aware, Sea of Thieves has gone from strength to strength in 2020. We've launched on Steam, we've continuously been releasing monthly content updates, and the seas became a place for people to connect when they couldn't meet in person. This all culminated in an extraordinarily busy period for Sea of Thieves over the Holiday period. Our game proved extremely popular throughout and up to Christmas 2020, and the Holiday period saw the most successful period for the title since launch in terms of traffic.
Alongside everything else we shipped to Sea of Thieves in the year, in 2020 we also introduced campaigns that allow us to schedule events for players to experience between updates.
Around 8pm on the 28th December 2020, the service that is responsible for tracking campaign progression began falling behind in processing the stream of events that are used to indicate player progression. As we passed through our peak daily player count, many millions of messages were waiting to be processed where ordinarily we would process them all immediately.
Given our popularity over this period, this was the first time that this service had experienced load at this level. As a result, it was taking longer and longer for the service to record and report completion of an event by the player. Ordinarily, we have several mitigations that we use to affect the performance of a service in response to the load applied to it. However, in this case those mitigations had little-to-no effect on the amount of events that the service processed, and because of this the queue of messages increased.
Throughout the hours and days that followed, our engineers shipped several performance updates to the affected service in an effort to resolve the incident or at least minimise the impact - however, whilst we were managing to make improvements, we were unable to make sufficient improvement to meet the demand being placed on the service, and the problem persisted.
As our analysis and incident response continued, it became apparent that no matter what changes we made, we kept hitting a ceiling of performance meaning that something else was actually limiting the amount of work that this service could perform. Eventually we managed to determine that an unrelated, downstream service was causing our events system to limit the amount of work that could be completed by the impacted service.
This unrelated service is a new under production service that we were auditioning behind the scenes to test loads, ahead of releasing new functionality in 2021. The purpose of auditioning the service was to validate that it would perform in retail conditions. It had been deployed late November 2020 long before we saw any issues, and our telemetry was giving no indication that it was struggling to keep up or that it was quietly applying back pressure to upstream services.
As the service that was causing the issue was only being auditioned and not actually in use by players yet, we disabled it and the impacted service immediately responded by clearing down the backlog and returning to normal operating performance. However, when we switched the service back on last week, we saw the same scenario unfold again despite the mitigations we had taken against it.
Here's how we are moving forward from this:
- A retrospective and root cause analysis of this type of incident.
- Monitoring new services more closely and having a natural suspicion of them during an impacting event.
- Develop a better understanding and visibility of how services under pressure are impacting other services.
- Looking at our architecture to break the chain of impact where one service can have an impact on another's performance.
This is one of the highest impacting incidents we have had on Sea of Thieves since launch, and there's a lot to learn from what we experienced over this period. We know it wasn't a great period for Sea of Thieves players, and we're working hard to ensure the game's stability in future.