I'm doing this memo without spell checker, so please be kind.
Last night was not one of SOE's more impressive moments. Around 7:30pm EST tons of users started to complain about 1017 errors upon logging in or when a character tried to zone. Like everyone else, I entered chat and had the golden opportunity to "talk tech" to one of the GMs handling the crisis. Since I had absolutely nothing better to do we got into a little discussion about how the SOE systems operate. Although my understanding of their systems is far from complete, I did find the conversation fascinating and thought I would share a few details.
It may surprise some that each named server is not actually a single server, but rather a federation of physical servers (yes...federation is actually an official IT term (not derived from Star Trek)). The server "Rallos Zek" is actually a logical name for a collection of approximately 6-8 servers. Each physical server is assigned various zones depending upon the size & popularity of the zone, and the actually hardware that makes up the physical server.
When SOE takes down a zone, they are in fact turning off one of the servers in the federation. This architecture allows SOE to work on a zone that may be having a serious problem, without effecting the rest of the zones that are on the logical server. Typically new expansion zones (i.e. GoD zones) are placed on a new box when they are incorporated into the federation (a.k.a logical server). This helps isolate "new" problems that may arise because of the expansion to a particular server.
Of course in order to move between servers within the federation there has to be some "front-end" system that communicates to each server in the federation. The front-end system handles authentication, some logging, traffic management, and possibly load balancing. When a character moves between zones it is the responsibility of the front-end systems to figure out what zone & server you currently are on, and then possibly move the character to a different server based upon the desired zone. Although we didn't go into it, my guess is when this happens you may sometimes be assigned a different TCP port.
From our conversation (and the inevitable conclusion because everyone was down), it appears that SOE uses one shared front-end system for all virtual servers. In other words logical servers Rallos Zek and Mithaniel Marr use the exact same front-end system to control zoning and traffic management. Why is this important? Well 1017 connectivity errors for one. When the front-end system is having problems and can't communicate to the back-end federation servers, users will get this error message. The patch servers may be part of this front-end system, or could be a tier of servers in front of this system (creating a three tiered system).
Although we may still want to spit venom at SOE for the recent system outages, it does help having a bit of an understanding of the overall system and what effectively is the cause of the outage. If anyone can fill in more detail on SOE's architecture I'm sure there are plenty of other interested people.