Return to site

Architecting For Dedicated Game Servers With Unreal, Part 1

  In the age of cloud infrastructure, dedicated servers are being chosen over peer-to-peer with increasing frequency for multiplayer games. You can read more about the pros and cons of each model (e.g. here). What we discuss less frequently are the many traps developers can fall into when they've committed to integrating dedicated servers into their game's online platform. I have fallen into many of these traps over the years. I'm sharing part 1 here to save others from the same fate. Many years ago, while working on The Maestros, an indie game written in Unreal Engine, my team was faced with incorporating dedicated servers into our back-end services ecosystem. The issues we faced were the same ones I would see in games using Unity, Unreal or custom engines. Later, in Part 2, I'll discuss the major choices you'll face when running dedicated servers for your own game: datacenter vs cloud, bare metal vs VMs vs containers, etc. Getting into game - The Flow Let me briefly show you how a Maestros player gets into The Maestros so you can get a sense of what we are discussing. 1 - Create a lobby 2 - Players Join and Choose Characters 3 - Wait for a game server to start 4 - Join the game server Phase 1: Make It Work We were clear about what we wanted and we began to build our tech stack. This was Node.js, Windows (required for Unreal at the time), and Microsoft Azure cloud VMs. First, the maestros.exe executable on a player’s machine made HTTP calls to a Node.js Web service called Lobbies. These requests would create/join a Lobby or choose characters. When all the players were connected and ready, the Lobbies service made an HTTP call to another Node.js service called the Game Allocator. This would cause the Game Allocator service to start another process on the same VM for the Game Server. In Unreal, a game server is just another maestros.exe process with some special parameters like so: maestros.exe /Game/Maps/MyMap -server Our Game Allocator then watched for the Game Server to complete startup by searching the Game Server's logs for a string like Initializing Game Engine Completed. When it saw the startup string, the Game Allocator would send a message back to the Lobbies service which would then pass along the IP & port to players. Players, in turn, would connect to the Game Server, emulating the open 192.168.1.99 you might type in the Unreal console. Phase 2 – Scaling Up This is where we can play a game. With a couple more lines of JavaScript, our Game Allocator was also able to manage multiple Game Server processes simultaneously on it's VM. Eventually, we would need to run more game server processes than 1 VM could handle, and we'd want the redundancy of multiple game server VMs as well. We had multiple VMs with Game Allocators that would periodically report their status to the Lobbies service. Lobbies code would then determine the best game allocator to create a new game. Phase 3: Software Bug Fixing This architecture worked and served us well through many years of development. It's similar to how many developers implement game server allocation on their first try too. It's also plagued with problems. For The Maestros, we kept running into issues that required us to intervene manually. Despite our cleverest code, we dealt with Game Server processes that never exited, or game server VMs getting overloaded, or games being assigned to VMs that were in a bad state or even shut down (Azure performs regular rolling maintenance). Our engineers would need to manually kill the game instances, restart or restart the entire VM. These headaches have been reported on many different games, so let's examine the root causes. The first problem is that starting new processes can be messy. Unreal is a slow process that loads a lot from disk. In addition, any process can fail for a variety reasons (e.g. Insufficient RAM, CPU, Disk We have no structural options to fix this other than to test extensively and to write the best code we can. Second, it is our practice to constantly observe these processes at a distance. In order to tell a Game Server process had completed startup, we read it's logs (yuck). In order to detect a game had ended, we use Node to parse wmic commands (God save our souls). Even more problematic, Lobbies makes decisions about which game server VMs can handle a new game. Lobbies makes decisions about which game server VMs can handle a new game. This separate process runs on a separate VM. It takes several milliseconds to complete (in the most ideal case). If your heart rate has not increased to a dangerous point by this point, you haven’t experienced networked races-conditions before. Even if the Game Allocator parsed the OS information on a Game Server process correctly, the Game Server's state could change before the Game Allocator acted upon it. What's more, even if the Game Server's state didn't change before the Game Allocator reported it to Lobbies, the game server VM could get shut down by Azure while Lobbies tries to assign it a game. If we wanted to scale up our Lobbies service and add redundancy it would make the problem worse as multiple Lobbies could assign games each to a single Game Allocator without noticing each other's games. This would overload the machine. We tried a few fixes for a couple of months, but we couldn't solve the race conditions until our thinking changed. We put the decision making power in control of the process with the most information. This was the moment that breakthroughs occurred. When it came to game startup, the Game Server process had the best information about when it was done initializing. Therefore, we let the Game Server tell the Game Allocator when it was ready (via a local HTTP call), instead of snooping through it's logs. When it came to determining whether a game server VM was ready to accept new games, the Game Allocator process had the best information. Lobbies set up a game-start task (RabbitMQ) in a message queue. When a Game Allocator was ready, it would pull tasks off the queue, instead of being told by some other process with out of date information. We were able to add multiple Lobbies instances without regard to race conditions. Manual intervention on game servers reduced from weekly to a couple times a year. Phase 4 – Bug Fixing in the... Hardware The next problem we noticed was a terrible one. During our regular Monday night playtests, we saw great performance for our game servers. Units were responsive and hitching was rare. However, hitching and latency were unacceptable when we playedtest with alphas during weekends. Our investigations found that packets weren't making it to our clients - even those with strong connections. Although the Maestros is very bandwidth-intensive, according to their specifications, our Azure virtual machines should have maintained a high level of CPU and bandwidth. We optimized wherever possible, only to have the problem resurface in our next weekend's playtest. The only thing that seemed to eliminate the issue completely was using huge VMs that promised 10x the bandwidth we needed, and those were vastly less cost-efficient on a per-game basis than a handful of small/medium instances. We started to become suspicious over time. What differed between our regular playtests and our external playtests wasn't location or hardware (devs participated in both tests), it was actually the times. We played during development testing hours, but alpha tests were always scheduled for peak times to attract testers. gservers More poking and prodding seemed to confirm the correlation. Our hypothesis was that the network of our VM's was not performing as advertised when traffic became heavy in the datacenter. This could be due to other tenants saturating it. This is known as a noisey neighbor problem and is often discussed. But many argue that it doesn’t really matter because you can dynamically allocate more servers. These issues are mitigated by Microsoft Azure using overprovisioning. Unfortunately, this strategy doesn't work for our Unreal game servers which are single processes with latency-sensitive network traffic that could not be distributed across machines, and certainly cannot be interrupted mid-game. With plenty of evidence but little means of confirming, we decided to run a test. We bought unmanaged, bare metal servers from a provider and ran them alongside Azure VMs during peak time playtests. The bare metal games ran smooth despite the doubled latency (40ms to 80ms), whereas Azure VMs had near-unplayable lag. Although the switchover seemed inevitable, there were pros and con's. For one, there was a full day turn-around on getting new servers from our provider. If we went all in on bare metal, we would lose our ability to scale up quickly to meet demand. However, bare metal was already cost-effective at around 50% per game. We decided to provide enough bare metal servers to support daily load, and to use larger, more costly Azure VMs when needed. Conclusion and Future Topics I hope our story helps you or other developers looking to use dedicated servers for your game. In Part 2, I'll discuss the trade-offs in cost, maintenance, and complexity of the major choices around dedicated game server architectures. This includes datacenters vs cloud, containers vs VMs, and existing solutions like Google's new, container-based dedicated server solution, Agones.

gservers