Forum Settings
       
Reply To Thread

SOE System ArchitectureFollow

#1 Feb 16 2004 at 10:38 PM Rating: Good
**
405 posts
I'm doing this memo without spell checker, so please be kind.

Last night was not one of SOE's more impressive moments. Around 7:30pm EST tons of users started to complain about 1017 errors upon logging in or when a character tried to zone. Like everyone else, I entered chat and had the golden opportunity to "talk tech" to one of the GMs handling the crisis. Since I had absolutely nothing better to do we got into a little discussion about how the SOE systems operate. Although my understanding of their systems is far from complete, I did find the conversation fascinating and thought I would share a few details.

It may surprise some that each named server is not actually a single server, but rather a federation of physical servers (yes...federation is actually an official IT term (not derived from Star Trek)). The server "Rallos Zek" is actually a logical name for a collection of approximately 6-8 servers. Each physical server is assigned various zones depending upon the size & popularity of the zone, and the actually hardware that makes up the physical server.

When SOE takes down a zone, they are in fact turning off one of the servers in the federation. This architecture allows SOE to work on a zone that may be having a serious problem, without effecting the rest of the zones that are on the logical server. Typically new expansion zones (i.e. GoD zones) are placed on a new box when they are incorporated into the federation (a.k.a logical server). This helps isolate "new" problems that may arise because of the expansion to a particular server.

Of course in order to move between servers within the federation there has to be some "front-end" system that communicates to each server in the federation. The front-end system handles authentication, some logging, traffic management, and possibly load balancing. When a character moves between zones it is the responsibility of the front-end systems to figure out what zone & server you currently are on, and then possibly move the character to a different server based upon the desired zone. Although we didn't go into it, my guess is when this happens you may sometimes be assigned a different TCP port.

From our conversation (and the inevitable conclusion because everyone was down), it appears that SOE uses one shared front-end system for all virtual servers. In other words logical servers Rallos Zek and Mithaniel Marr use the exact same front-end system to control zoning and traffic management. Why is this important? Well 1017 connectivity errors for one. When the front-end system is having problems and can't communicate to the back-end federation servers, users will get this error message. The patch servers may be part of this front-end system, or could be a tier of servers in front of this system (creating a three tiered system).

Although we may still want to spit venom at SOE for the recent system outages, it does help having a bit of an understanding of the overall system and what effectively is the cause of the outage. If anyone can fill in more detail on SOE's architecture I'm sure there are plenty of other interested people.
#2 Feb 17 2004 at 12:37 AM Rating: Decent
*
63 posts
I would give anything to see the SOE command post. It would be nice if they would do something along the line of Willy Wonka and put a gold ticket in one expansion box, the finder getting an exclusive tour of the facilities.
#3 Feb 24 2004 at 7:53 PM Rating: Default
*
52 posts
Yeah but how do you get the Golden Ticket out of your ***?
#4 Feb 24 2004 at 8:13 PM Rating: Decent
*
140 posts
Quote:
Yeah but how do you get the Golden Ticket out of your ***?



Very carefully, as to avoid paper cuts. =)
#5 Feb 25 2004 at 9:12 AM Rating: Good
I've never heard them called federations, it's always been clusters or farms.

But just about every MMORPG today uses that basic system. One primary CPU that handles controlling the cluster and transferring everything between X other systems in the cluster, with each system handling specific areas of the game. AC's claim to fame at the time was the ability of their clusters to load balance, meaning that the systems in the cluster did not have specific geagraphic areas, the load was divided equally among all of the systems. No idea how successful they were with that.

Edit to add:

Most duping exploits involve causing a subserver in a cluster/farm/federation to crash. Character A with Item X gives item to Character B. B then does something to force the game to save the character (Such as logging out or transferring to a different sub-server), then exploiter A does something to cause his sub-server to crash. When A logs back in, the last valid save still shows him having item X. Usually it can be detected by giving each item in the game unique ID's so the system would detect two items existing with the same ID, but for money its harder as you can't assign an ID to each plat without making your database several terabytes in size.

Another bit of trivia: Originally in UO (possibly still true), the data for the server was saved in a single flat uncompressed text file. No idea who thought that was a good idea, but it very quickly resulted in server backups taking 3-4 hours. Some players actually liked it that way, once the notification that the server was going to be backed up went out, it meant that the next few hours of play would not be saved so they could do massive PK wars with no cares that what happened would be saved :)

Edited, Wed Feb 25 09:28:37 2004 by Toasticle
Reply To Thread

Colors Smileys Quote OriginalQuote Checked Help

 

Recent Visitors: 253 All times are in CST
Anonymous Guests (253)