Metaplay’s Manifesto: Four Design Principles Guiding Metaplay's Scalable Game Backend Technology
Tuomo Jorri •
Learn about the four design principles guiding Metaplay's cloud infrastructure design. Discover how stateful design, cherry-picked compute consumption, persisted state and data, and observability and openness contribute to scalable and cost-effective infrastructure for large games.
In previous articles, we broke down the differences between IaaS, PaaS, SaaS and their cost dynamics for your infrastructure stack.
Now it’s time to relate this back to our own infrastructure design at Metaplay.
From the outset, Metaplay’s objective has been to build an offering that is scalable to very large games. This means making technology choices in a way which allows us to be as efficient as we can be, while not making sacrifices in important areas.
Let’s dive into some concrete examples around what we've designed and why, and how it fits into the broader cost discussion. Here are our four core principles:
1. Stateful Design Works
Our backend design at Metaplay stands out for its stateful nature, a departure from the prevalent trend favoring stateless backends in the past decade due to their simplicity.
In the context of gaming, we believe there are significant advantages to be gained from stateful designs. In practice, this approach employs an actor-based model for games, with individual game servers serving as the authoritative source of state once an actor (e.g., a player or guild) awakens from the database.
Subsequently, data is periodically persisted back into the database to prevent data loss, allowing all actors to interact within the clustered game server backend.
Critics may argue that scaling becomes more complex with this approach. However, this design yields substantial efficiency gains and empowers us to create a more robust framework for long-running actors.
Efficiency gains are evident in our reduced reliance on the database, which now functions as a simplified key-value store. This eliminates the need for every game operation to query the database, significantly reducing the database workload.
For instance, instead of the typical 9-to-1 ratio of read to write operations, we often achieve a 1:1 ratio, eliminating the need for eight additional read replicas.
To address scalability, we support the expansion of the database layer by sharding it into multiple database clusters. This ensures that we can scale database clusters without hitting capacity limits.
Furthermore, this design simplifies our architecture. When database performance becomes a concern, many opt for in-memory caching systems like Redis or Memcached. In our architecture, we can eliminate these components entirely. Our game server application layer ensures that actors always reflect the latest, up-to-date state of the game world, removing the need for additional caching layers.
2. Compute consumption should be cherry-picked
Often the cheapest form of compute is available as IaaS. This is also the approach that offers the most flexibility and control (e.g. if you want to do UDP on AWS Lambda or get access to a wonky execution environment to take ad hoc memory dumps).
IaaS means that under the hood, the game servers are running on virtual machines. That said, due to operational reasons, we would much rather not spend too much time and effort in monitoring and observing the health of virtual machines.
This is why, despite consuming IaaS virtual machines, the orchestration and coordination of them is left to dedicated components like Karpenter - and all virtual machines get registered to Kubernetes.
The tradeoff here is the increased complexity that Kubernetes brings to the table. Anyone who has ever worked with Kubernetes can attest to the constant evolution of the platform and the running it requires to even stay in the same place.
However, at the same time, Kubernetes is an increasingly compelling and extensible container orchestration platform, and allows us much easier and granular control over the stateful payloads that we run.
All the above leads us to a balancing act where we want the control and cost-efficiency of IaaS compute, but we will gladly consume the Kubernetes control plane as a PaaS service offered in the form of AWS Elastic Kubernetes Service (EKS).
Piggy-backing off of the latter, we outsource the control plane, operations, and updates to AWS for a fixed price of about 72 USD/month/cluster for this ease of mind.
But Why Not Lambda?
We occasionally get asked why we do not use AWS Lambda or similar serverless functions. By this point the stateful nature of our payloads should offer the main clue to this.
Additionally, because of the rather constant stream of actions being processed by game backends and potential long-lived nature of actors, serverless functions tend to not be the ideal execution platform.
Lambda also requires a minimum execution context of at least 128 MB of memory and has traditionally had non-zero wakeup times. So for our core payloads, AWS Lambda is not an optimal solution.
3. Persisted state and data is critical
We already alluded to it earlier, but in our simplified worldview the most important item in the backend is the game data, which in our case lives in MySQL.
In the same manner as with Kubernetes, managing MySQL clusters is something that we would rather not do ourselves, and we would also suggest our customers not to do either.
So instead of IaaS, we take the PaaS approach here and leverage AWS’s Aurora MySQL backend.
This not only gives us much more peace of mind in knowing that it just works, but also that it provides us tools simplifying operations like backups, recovery, and rollbacks.
This decision is further supported in our mind by the easy integration with the broader ecosystem of tooling that AWS has in the form of AWS Backup, which allows us to also easily back up other key resources in our stacks.
4. Observability and openness must be assured
Auxiliary systems like logs, metrics, and tracing can get fairly expensive fairly quickly in the event that PaaS/SaaS systems are used. But it is a critical area and absolutely necessary when operating, developing, and debugging software.
Typically, the more data and higher granularity that can be achieved will directly impact the ability for technical teams in reacting to issues quickly, which further guides our decision to prefer solutions which allow us to ingest larger amounts of data while keeping the financial costs low.
We also recognize that observability tooling area is an area which is very opinionated. This makes it a good example of where we apply the philosophy to trying to find not only the right balance between how to consume some set of tools, but also leave open the side door to allow an opinionated customer to easily deactivate and move to using something entirely different.
While our default infrastructure stack relies heavily on tools like Prometheus, Grafana, and Loki, we also make it easy to opt-out of them and instead configure data collectors (log collectors, metric scrapers) to send data to a third-party location.
We realize that especially bigger organizations might already have an observability solution in place which their operations teams use, and in cases like these we want to make the integration as easy as possible.
Final takeaways
Metaplay's mobile game backend design principles prioritize efficiency and cost-effectiveness without compromising essential functionality. By embracing stateful designs, cherry-picking compute options, leveraging managed services, and ensuring observability, Metaplay delivers scalable solutions for very large games. These principles reflect our commitment to delivering high-quality infrastructure for the mobile gaming industry.
To learn more about server costs for mobile games, read our comprehensive guide.