The errors of a design flaw

GetDealt will be live any day now. That means it is safe to admit that there are a few design flaws that will be there when it launches. (No major flaws, mind you, just hindsight 20/20.) One flaw has to do with the transition mechanism in the state machine engine.

The game is designed around a state machineState Machines on wikipedia: implementation that restricts actions and transitions at the state-level instead of the machine level. When it was being designed, it did not occur to me that the game would require automated transitions from one state to another. Once that was discovered, a daemon feature was added which moves a game from state to state over time. (ie: Move to state X, wait 3 seconds, move to state Y, wait 1 second, move to state Z). [The next iteration will have a replacement for that mechanism]

There are many ways to achieve high-availability for real-time applications, all of which depend heavily on the budget allocated for it. For us, it meant multiple load-balanced nodes for every layer. When you connect to the game server called “alpha” you are really connecting to one of at least two computers. (likely 3 or 4) Each of these computers will eventually spin up its own game engine to process your requests. That tiny detail is where the design flaw takes shape: It’s possible for multiple computers to “automatically move the game from state to state”.

Since multiple computers will be trying to move the game from one state to another, it is highly probable that the game itself will get corrupt – and we can’t have the game getting corrupt!. This, however, is not a new problem in computing. In fact, it is a very easy thing to fix with a dash of synchronization code, a semaphore, and maybe a little coffee. So, you might ask, if figuring out how to synchronize the game, wasn’t a big deal, then what was? To answer swiftly: Errors.

Not everyone logs every error that occurs in their application. And of the people that do, not many of those people process every error that occurs. Most simply wait for a user complaint, then bring up the logs related to the event and use the logs to isolate the issue. I am the type that looks at every error that occurs with or without a user complaint.

Sneak peek at the GetDealt error console

Error Console Sneak Peek

Error Console Sneak Peek

The problem I was having while browsing the logged errors was the constant stream of errors when multiple daemons were trying to move the game forward. Server 1 would start to move it forward, then not be able to acquire the most recent version, because it was changing too fast, and throw an exception. With one game, this is not an issue. With 1000 games running in the load test – this was a huge issue. How can one possibly find a “real” error amongst so many “expected” errors.

You see, there is no difference to the game engine between an “automated” and a “standard” transition. If a transition is attempted that is invalid, an exception is raised. Period. The error handler logs it, re-writes it, and continues on. Adding a mechanism to the state-machine that took into account external contextual environment variables seemed like a hack at best, and a design taboo at worst. That is when I came up with a out-of-the box solution: named threads.

The error handler is really the problem, not the game engine. The error handler needed to understand the difference in weight between an error during a daemon attempt, and an error during a user attempt. Daemons always execute on a thread that is not authenticated by the requestor. A background thread, to be exact. All that needed to be done was name the threads by their intent.

Voila! There is now a mechanism to determine the difference between a daemon-created thread and a user-created thread. Add in a filter to the error handler and our problem is solved. Instead of 1000 expected errors, there are 30 unexpected errors. The design-flaw that caused all of this, however, is not so easily solved..

Submit a Comment

Your email address will not be published. Required fields are marked *