Networking is
weird. I just want to put that up front.
The basic idea is pretty simple, and also pretty sound. Having two computers is more or less the same as having one computer with two cores, and it's pretty common knowledge that more cores = less lag. Easy enough. The problem is making sure every connected computer is on the same page. Ideally, we want a decentralized model - where no one server is more important than any other - because that way, one server crash can't bring down the whole network.
I have a lot of time and nothing better to do right now, so I'm just going to write a specification for how you might accomplish this. All of this is going to hinge on one really convenient fact: no user can be in more than one place at any one time. This is a distinct advantage over, say, an IRC network, where one user can receive messages from thousands of channels at once.
I'm gonna call this the first draft, because I probably forgot something.
Spread the load
Assume all servers in the network have all of the world information all the time. We're going to actually solve this problem later, but for now just pretend like we already have.
Each server in the network is responsible for one or more isolated areas of space. An "isolated area of space" is a contiguous group of loaded sectors. So, imagine, for example, that you have two players in 0,0,5, and three players in 5,0,0. They can't interact with each other, because they're too far away - they don't have any loaded sectors in common - so they can be handled by different servers, and no one would ever know the difference.
In order to determine what servers handle what players, two
heuristics are used:
- "load": How complicated an isolated area is to simulate. Example:
Code:
Number of players * combined mass of all entities * number of projectiles fired in the last 3 minutes
- "power": How powerful the server is. Example:
Code:
(Total RAM * CPU speed * CPU cores) / current average ping
Not gonna lie, heuristics are basically computer magic. The definition is pretty vague and hand-wavy, but picking good ones is
really important, which is just about the most annoying thing. I group them with matrices in the category of "math that someone literally just made up". I will fully admit that I suck at heuristics, someone who's better with them might be able to come up with better ones; there's also no reason why you couldn't outsource this expression to a config file.
In any case, the load:power ratio of a server indicates how much work it's doing.
Servers under high load may attempt to pass off some of their isolated areas to servers with a low load - this is called a "netsplit". To do this, they first select an area they wish to pass off (ideally something small). The high load server then broadcasts a message to all servers saying that it wishes to pass off that area, and it includes the load calculation for the area. Other servers in the network reply by either accepting or denying the request. In addition to accepting a request to move a sector group, a server may also request priority. It may do this because it believes it's going to need those clients soon anyway; for example, if the area being moved is close to some sectors the acceptor already has loaded. If at least one server accepts, the high load server selects the acceptor with the lowest load:power ratio and synchronizes all information about the players and entities in those sectors with the accepting server.
Clients are informed of the split and moved to the accepting server. Once the high load server has finished the synchronization, it simply unloads the area. The accepting server ensures everything is loaded before beginning to simulate the area and send information to players.
Periodically, all servers broadcast their load:power ratios to all other servers in the network. If convenient, two servers may elect to move sectors in order to more evenly distribute the load - for example, if one server is new to the network and has no load, the highest load server may try to give it some work.
The entire netsplit process is going to suck for players in the area, but ideally splits should only affect a small portion of players for a brief period of time. There's not really any way around it, though; sometimes you just have to move people. Ideally, being moved should result in your ping going down anyway, since you're moving from a high load server to a low load server.
Synchronization
When you see "server autosaving", along with writing information to disk, all servers attempt to synchronize information with each other.
New servers joining the network are always brought up to speed (see
Registering new servers), so all we need to do here is inform everyone else about all the stuff that's changed (everything built, blown up, mined, etc.).
The easiest way to do this is to broadcast a list of all recently changed chunks, their SHA1 hashes, and change timestamps. Servers on the network check this list and determine whether or not they need to download an update. If they do, they request the full chunk files from the originating server.
This is going to involve
a lot of information going
a lot of places. To prevent ping spikes, the easiest thing to do is just throttle everything - this isn't a time sensitive process, every server has all the information it needs to serve all of its current clients. They can upload and download relevant information at their leisure. Realistically, if a server is under high load, it can afford to skip a synchronization and just wait for the next one to come around. There's also no reason why every server needs to synchronize at the same time as every other server, they can stagger the process. There's a lot of room for config options here.
Moving clients
Sometimes, a client will have to change servers. This is most likely to happen if a client enters a sector that is loaded by a different server. Imagine going through a warp gate from your base to your friend's base - if your base is on one server, and your friend's base is on another server, you'll have to connect to their server when you go through the gate.
When a server determines that it needs to move a client, the client is given the address for the new server, and the old server closes the connection. The client displays some message and effectively pauses the game (ideally, if you're jumping, you just stay in the warp tunnel). The old server then synchronizes any relevant data with the new server (player inventory, chunk data for the player's ship, etc.). Once the new server has received and loaded all information, it synchronizes with the client. This completes the move.
Players joining
Whenever a player joins, they will contact one server (probably the one on the global server list, but not necessarily). When this happens, there are two possibilities (players that are entirely new to the world are effectively the same as players whose last known location is the spawn point):
- The player's last known position is in a sector that is currently loaded by some server on the network. In this case, if the player's last known location is currently loaded by the server contacted, it is responsible for the new player. Otherwise, the server contacted simply refers the client to the correct server and closes the connection. The client then automatically connects to the correct server, which is responsible for the new player.
- The player's last known position is not in any currently loaded sector. In this case, the server contacted is responsible for the new player if it believes it can handle the load. If it can't, it attempts to pass the connection off to the server with the lowest load. If it finds another server that can handle the request, it refers the client to that server and closes the connection. The client then automatically connects to the other server, which is then responsible for the new player.
The server responsible for the new player informs all other servers on the network that the player has joined, loads in adjacent sectors (if required), and continues normally.
Registering new servers
Just to make things easy, a "network" is one or more servers simulating the same universe.
When you have a network already (which may be only one computer), and you want to add a new server to it, the outside server asks one server in the network if it's allowed to join. As part of this handshake process, ideally a password should be involved, so that some random asshole can't connect to your network and pipe all of his traffic to /dev/null.
If the network accepts the outside server, the contacted server sends to the outside server a full list of all servers in the network. The outside server then connects to every other server in the network. The outside server must then synchronize all of its world data the same way you torrent "legal" movies. Each server in the network sends the outside server all of its modified player and sector data. For the sake of simplicity, any data that isn't currently loaded by any server (sectors that just don't have any people in them right now, but still have stuff associated with them) should come from the initially contacted server.
Once the outside server has received all world data, it broadcasts a message to all servers that it is ready to handle requests. This completes the process of adding the server to the network. Note that the server may spend a while doing nothing; this is because it's probably not worth it to cause a netsplit just to load up the new server if the network is currently under mild load. Instead, since its load is effectively 0, the new server will be the first to take sectors from high load servers, or accept new clients.
To keep the global server list clean, each network should only be listed once. Without some kind of round robin style thing going on, the easiest way to accomplish this is to keep the list of all servers in the network ordered by connection time - that is, the first server on the network is first, then the second server to connect, then the third, and so on. The server at the front of the list is responsible for the global listing. If that server goes down, responsibility moves to the second server, and so on. As a side effect, this means that one server will handle the majority of all player join requests, but those aren't hard to handle (nor do they happen very often), so that's no big deal.
Graceful shutdown
If a server is to be terminated, it broadcasts a message that it will shut down soon and needs to move all of its clients. It includes in this message a list of each isolated area, and the load score for each. Servers reply by attempting to take a particular area, ideally the largest they can accept; sectors are then assigned in a way that minimizes the average load:power ratio.
If any sectors are left over (no one accepted them), the closing server reserves the right to assign the sectors to another server in the network (with a message that basically says "you need to take this and it's an emergency"). The assigned server must accept, but it may try to move the sectors again later.
Crashes
If any server on the network crashes, all servers on the network must attempt to synchronize the data it was responsible for amongst themselves. It's unlikely that any of them have new information, but it's important for all of them to be on the same page, because the clients that crashed with the server are likely to reconnect. It is, unfortunately, up to each client to reconnect.
Chat
Turns out this is actually really easy. The new chat system uses an IRC backend, and IRC was basically designed for this.
I would bet there's already something in the IRC library schema is using that would allow users to chat globally across the whole network, although it's not really any big deal if there isn't - just broadcast every chat message to all the other servers. Or, even better, if all servers know all user information all the time - what users are connected to which servers, what chat channels are open, what users are in each chat channel - servers that receive a chat message can determine themselves which other servers have to be notified of the message.
Conclusion
If this seems like a lot of work, it's not actually that bad. People tend to stay pretty dispersed in my experience; most of the time you probably won't ever have to move servers. When you do, it probably won't take long, and you'll probably be jumping - so the warp tunnel is a great loading screen. Unless you frequently fly into other people's space, you're unlikely to hop servers very often. Plus, the system is set up such that if you group a lot of people together, your server is likely to try to ditch its other loads and give you its undivided attention.
This won't protect you from pingspikes from massive fleet battles with 50 titans, but it
will protect everyone
else from pingspikes when one stupid asshole is mining a planet. And that's worth something.
In any case, this would be the mod to end all mods.