Accepted Answer

September 8, 2025 Score: 20 Rep: 18,648 Quality: Expert Completeness: 30%

How many parallel games do you want to handle? Millions? Or more likely thousands? If millions, see how you can make money from this, and hire real experts rather than asking random people on Stack Exchange :-)

If thousands, I'd go for a much simpler architecture: Forget about Redis if you're concerned about its availability. Instead, have sharded game servers where each game happens on just one server. A game state on the server does not consume much space, so it should be easy to handle hundreds if not thousands per server. For each ongoing game, just record the moves in a file to enable recovery in case your game server goes down (handling that will take a short while, but so will likely any other availability solution).

This approach will enable the whole move handling to happen within one server, minimizing latency considerably as you normally don't need to serialize/deserialize game states.

If you want to open source the move verifier while keeping the game server itself closed source, you could publish it as a library to either be linked into other projects or wrapped by a thin RPC layer if people want to keep it in a separate process.

September 8, 2025 Score: 10 Rep: 85,137 Quality: High Completeness: 70%

Really only one of your challenges is challenging and that is scalable low latency move communication.

Each player will have to have an open websocket or similar channel to receive push messages from the server. Open channels are a bottleneck so you will want to design to minimise their use.

Only have push channels open for bullet and blitz games
1 pre calculate valid moves on the client, so you have no round trip for checks
Premoves sent to server in advance, so it can respond immediately without involving the opponents client
Have some server side adaptive scaling and sharding for the open channel connections. ie spin up more servers or a managed cloud service.

All the other queuing and databases etc should barely hit the limits of an out of the box postgres DB

Clarification/explanation of Push channel limitations.

A standard website will receive a request for a page from a client, send that page and close the connection. The connection is only open briefly, so 100s or even thousands of users can be served pages, while at any given time only a few network connections are open.

The downside is, if a client wants an update, it must initiate a network requests to the server "Have you any updates for me?"

If you want a Push message from the server the connection must be kept open so the client can immediately pick up on any new messages sent over it.

This means that your 100s or 1000s of users now have 100s or 1000s of open connections.

SSE is better than websockets, but you might find you hit limits sooner than you expect.

https://www.linkedin.com/blog/engineering/archive/instant-messaging-at-linkedin-scaling-to-hundreds-of-thousands-

September 9, 2025 Score: 3 Rep: 4,368 Quality: Medium Completeness: 60%

You appear to have approached this by identifying requirements, then picked a standard product for each requirement. However given the scale at which you are operating, you should have the engineering budget to go beyond standard products, and leverage the specifics of your problem to simplify the design to reduce operational complexity and cost. I'd therefore not start by picking products, but by picking a deployment archicture that ideally supports the requirements, and only then consider which products can help implement that architecture.

Requirements

Good job specifying the number of concurrent users and the maximum interaction frequency. However, the considerable variation of interaction frequency (some are playing blitz, others take their time with every move) means that we have only a vague idea of average interaction frequency, and how many requests per second the system will need to handle. In the following, I am going to assume the worst case, i.e. 1 million requests per second.

Also good job specifying the reliability requirements, though I would have liked a little more precision about what "the server dies" means. What's the extent of the damage? Is this just the server process restarting, or a permanent hardware failure? Also, how often does this happen, i.e. how many servers can "die" at the same time? And how often are you allowed to lose game states? You may also want to explore what makes the servers die, and reduce that, because failover will likely exceed your stringent latency limits and therefore be visible to users. For the remainder of this post, I will assume servers fail independently, relatively rarely, and totally.

Deployment Architecture

1 million concurrent users with low latency implies 1 million open network connections, which definitely requires horizontal scaling of the servers the clients are talking to.

Moreover, we may be confronted with up to 1 million requests per second. This probably requires horizontal scaling of everything needed for turn processing.

The main challenge is reliability in the presence of server failure, meaning that we need to replicate moves before acknowledging them. That server failures are total means we can't replicate to a local file system, but must replicate to a different server, which either takes over if our server dies, or enables an other server to take over. The former allows the replacement to be ready faster since it already has all the data, so we should favor that design.

Therefore, a game server should replicate each move to k different game servers (which we call its designated alternates), and acknowledge only once they have all acknowledged. Moreover, the clients should know the identity of the designated alternates, and switch to those if the original server doesn't respond in time.

Since reliability is achieved by replication, data need not be stored durably, i.e. game servers are free to keep it in memory, where it can be accessed very quickly.

In designing the replication protocol, it's worth noting that chess moves are tiny, probably about 10 bytes, far smaller than the packet size of most networks. By batching replication messages (sending a single message with all moves that happened in the last, say, 0.5 seconds), we can make much more efficient use of available bandwith. To enable this, the designated alternates should be the same for all games mediated by a game server. It's also easy to piggyback a heart beat signal on these messages, whose absence allows alternates to detect that a server went down with minimal overhead.

Discovery and management of game servers is managed by lobby servers, where players meet and pair up to play chess. Since lobby servers are not involved in making moves (even if a game server has died), they can be far less numerous than game servers.

Comparision to your deployment architecture

In your architecure, you have separate high-availability clusters for NATS, web servers, Redis, Move Verifier, and Postgres. Those are 5 different clusters, each with their own management tools, resulting in a rather high operational complexity. Moreover, a request to process a move flows through 4 different clusters (NATS, web server, Redis, and Move Verifier), creating challenges for latency, and significantly increasing overall bandwith consumption even if all these channels support batching. (In particular, shipping game state from redis to web server to move verifier for every move is quite wasteful. I get that you want to make the move verifier stateless and reusable, and you can, but you should be content with a logical separation, such as a shared library, rather than requiring a separate deployment)

In my architecture, the communication topology is greatly simplified. The client directly communicates with the game server, who communicates with k other game servers. Only a single cluster is involved once the game begins, minimizing latency and bandwidth required.

September 9, 2025 Score: 2 Rep: 183 Quality: Low Completeness: 30%

I really doubt that you need such an infrastructure on a centralized server. Chess is, from the rules point of view, a very simple game and only 1 vs. 1.

The validation of of allowed moves is very simple. The validation should be less effort than the code to handle the communication or to display the moves on the screen.
Why not keep the validation on every client?
It does not hurt to validate every move by both clients, this way a client can also ensure that it does not receive any wrong data.

As chess is a 1 vs. 1 game, there is no need to do complex synchronization of many-to-many connections with parallel inputs where a server makes much sense.
I don't see why you need a server to handle the game moves.
For a simple 1-to-1 connection you can let the clients connect directly to each other and let them play on their own.
If you like you can still send a backup of the entire game history to the server every some seconds, but for that you don't require any fancy server performance or latency. The amount of data to save an entire game history is tiny and nothing to worry about. It is even tiny enough that a client can easily keep the history of many games.

My recommendation:
Let the clients communicate and play directly with each other. Use the server only to find the different clients before they establish a direct connection to each other.
This way you remove the server as a bottleneck and don't have to worry about server performance or latency. To establish a new connection between two players it is usually fine to have a delay of several seconds. I even don't see a real need to keep a backup of the game history on the server unless you want to allow spectators.

Edit: Thanks @kaya3 for indicating that direct peer-to-peer connections may be problematic with firewall settings on the listening side and this reveals the IP adresses of the players, which may not be desired. These can be downsides of this approach.

September 10, 2025 Score: -2 Rep: 107 Quality: Low Completeness: 50%

Don't verify (calculate) every single move

instead have a hashtable/lookup of most common valid moves and only call your verifier on a miss. The vast majority of moves will be valid and be in the most common 100.000 moves. While it sounds daunting data-wise, chess databases are widely available and/or you could add this feature after a few months of operation, from which you would have the early-moves (which are the most common) data, maybe with bi-monthly updates to perfect the cache population. To get a feel for why this would work, click through any Opening Analyzer and see how quickly the less-popular moves reach single digit counts out of a million games. Depending on the numbers it may make sense to stop trying for the cache either after a fixed number of moves or after the first miss of a valid move (because then the game state is in unpopular territory). Another thing that datamining might reveal would be to disable it in more error-prone formats like 1|0 bullet (or have a different cache population for those).

Regarding Server Maintenance: For (shorter) time-limited formats I would simply allow them to finish, maybe implement a recover-game option for other formats that can stretch on.

Regarding Redis dying: If you're running Redis in a cluster on a high-availability server I would simply take the remaining risk for a full-on failure. It is incredibly low and guarding against it entails complexity and overhead that - from my perspective - is simply not worth it.

[Edit] There is no need for a centralised server, Redis or whatever

All you need centralised is a single matchmaking (if even) and stat-pulling server to compile definitive scores etc.

All the individual matches, anything in-progress (aka latency-relevant) can run on any secondary server/instance, because for the ecosystem, only the results of a game really matter (well, game histories will be cool too, but all of this is without latency constraints). This just follows up on my "just use clusters" above, but on a far less constrained setup. You could (in theory) spin up a Redis instance & server for each match, thereby limiting the losses on crash to a single game. That likely won't be economical, but I hope you get the point. Another benefit of this would be localised servers, so your server is always latency-wise in (somewhat) the middle of both players. Only after the game you write the results back to the central servers with almost no time constraints. This drastically lowers your main-server complexity and engineering, latency overall (except for cross-continent matches I guess) as well as loss on Redis crash (because only that one instances running games are gone, you can tweak how many this could be by your server network scale, including moving tournament or other important matches to single-match instances with clustered failsaves, which may be too costly or complex for everyday operations.

System Design for low latency reliable online chess game

Question Details

Tags

Answers (5)

Analysis Metrics

Question Information

Actions

Related Questions

Export Question Data