As a European player based solely on Karak-Norn I’ve been blissfully unaware of the problems that the Badland players have been experiencing these last x months, that is until I saw this post on the issue by Andy Belford (Why does that name sound so familiar, coincidence I think not).
Now I haven’t had to give this out for a while, but you are being given a WALL OF TEXT warning, with a more technical flavour than normal, so don’t press the read more unless you’re really bored.
It was nice to get confirmation of what I was 99% sure on already, which was that the backend was based on multiple servers fronted by a server name e.g. Badland, Karak Norn. Something which I talked about ages ago, in “What Does The Warhammer Server Backend Really Look Like“. Here’s the relevant line from Andy.
Zoning behind the scenes is a process which passes the player from one physical server to another physical server.
From Andy’s long post, it seems that the fundamental problem is that the backup process is interfering with ongoing game server requests, where disk access is requested by both, but cannot satisfy either and therefore becomes a performance bottleneck.
In the old days before us European players arrived within the EA Mythic network, we had the Berlin wall between us and so the badland backups were scheduled to happen in periods of lower server utilisation. Now as Andy says there is no quiet time for this poor beleaguered server, its hammer and tongs all the time. Such is the beauty of a 24 hour world.
Okay here comes the stick. If you run a 24/7 service relying on quiet time to perform crucial functions, then you’re f**king crazy, yes it makes for a cheaper solution for backing up, but it’s still crazy. You will get caught with your trousers down, as Mythic have done, this isn’t hindsight, this is commonsense.
Here’s the carrot and it’s the main thrust of this post.
This doesn’t mean that the Mythic techs went down this road willingly, because we all know hardware budgets are not infinite. Hell if it was, we’d all be running gold-plated gaming rigs. Businesses run under the same cost restrictions. As much as we demand otherwise, MMOs are long-term businesses (Most of the time), which need to make a profit, and last time I looked EA weren’t listed as a registered charity.
If you don’t already know, which 99.9% of you wouldn’t, storing and backing up data becomes expensive very fast, the more data you want to store/backup, when no down time and marginal impact on server performance are factored in, which is situation we have here. The hardware is expensive, but just as is expensive is the software that runs it and allows you do find the data you have backed up.
Now, I had to ask some advice on this from a good friend, since I am just a database developer who’s getting a bit long in the tooth. He pointed me to this product page from HP.
To give you a reference frame, his main SAN, is 220 Disk(s), with 4 incoming fibre connections, with each connection supporting I think 4 Gbps of data. Now a number of servers would be connected to this SAN, with each server having a fibre connection into a fibre switch, with the SAN supporting any combination of data configuration you’d like. Again to give you a reference, we are talking about £500K for that, so when you look at the product page, if you view the P9000 product, which is a rebadged Hitachi, he just said that costs intergalactic money. I would be surprised if Mythic were anywhere near the £500K range. The P2000, is the HP entry-level model at about £10K. So yes, it fair to ask Mythic to fix it, but you can’t always throw money at the problem, especially if you don’t have it (Subscribers).
To be honest, I would only expect the transaction logs, that the CSRs use to be dumped directly on the SAN, if one was being used, since performance is crucial. Now remember, the transaction logs could be extensive, for instance when you asked about why that gold bag didn’t turn up, then a log would have been created to say what dropped when Malekith died, who won it and who didn’t get theirs at the time.
The main data, which I would refer to the State of Play would be held locally, that’s the data which if a server blade crashed would be integrated quickly to get people back up and running. Imagine when you boot your PC, you get the OS, the drivers and then the application, but the application remembers what you were doing, what Word doc you had open, only a tad bit more complicated.
The thing to remember is that SAN (Depending on how much you spent) would be able backup with no effect on its performance, it would also be able to integrate a new disk into a RAID without any over head. Those of you with home/SMB NAS, ever noticed the Read/Write performance drop when you put in a replacement disk? The better the SAN £££, the less this comes into play.
So imagine Mythic gets an open cheque book for only 1 server, because that’s what we are talking about, 1 server. This isn’t WoW with about 300+ servers in various locations around the world (It was a year or 2 ago, since I counted up the servers). I’d love to have a look at Blizzards capital costs, but that’s a wish for another day.
Mythic get their hardware, that the techs have longed for from day 1. You now need to change the layout of the hardware and where everything points to and somehow duplicate disk access patterns of a Badlands server, while performing a backup to make sure that when you drop in this new hardware, you don’t end up worse than before. You’d be surprised how a little file, written once every 10 milliseconds, but now on a SAN (write time of 50 milliseconds) could shag up your afternoon, because you never realised how crucial this file was to your whole process.
It’s so difficult to duplicate a live environment like Badlands, so you can be 100% sure that you’ve covered all the bases. That’s why, as people have said before, PTS testing is one thing, real world is another situation entirely.
Of course I haven’t gone near the data and how it’s split, but Andy refers to protecting the integrity of all the data, all the players, all the guilds. Making sure that CSRs have the data they need to maintain a constant service.
You have to ask yourself, surely all that data doesn’t need on the tier 1 storage. The only thing you should have on that server are the characters that are currently being played. The rest should be pulled from the tier 2 storage/database when required (running from a SAN).
Badlands Server CSR Software Backup Solution
Tier 1 (Local Disks) Tier 2 (SAN) Tier 3
State Of Play Data Transactional Logs For Tier 1 & 2
Archived Characters &
As Andy says, they have done many changes each one chipping away at the problem, you have to give them time, which I don’t think they have, given the impending arrival of the RvR packs. It’s one time when more users on a server wouldn’t be welcomed.
Do I hear you cry SSDs? Yes they are fast, but wear rates come into play, which is why those enterprise versions always cost more. We also don’t know much data currently needs to be stored in tier 1 to maintain the State Of Play, so SSDs may not be a viable solution. They maybe useful as a storage place for the transaction logs, before they are moved off to my fictional SANs.
My final thought would be, how did Mythic not see this coming? If I was being generous, I would argue that when Mythic were building the systems, it was a just get it working mentality, you can tune later, but later never comes and the proto type goes live, and you end up living with the decisions made at 11pm on a Sunday night.
I know it’s easy to criticise from the sidelines, but I’ve been there, up to my neck in crap, rueing decisions I’ made years earlier, on how I would structure my data. Knowing that a bit more time on a design, could potentially pay dividends in the future. It’s funny, but sometimes what looks hard to fix is easy, what should be easy to fix is bloody impossible.
If you got this far without being bored shitless, then reward yourself with a trip to the toilet and a cup of cheap coffee, I’m not paying.
EDIT: If this SAN talk has given wood, then may I say 2 things. 1 You’re very wierd 2. Go to Anandtech and read this article on ZFS, it appeared by magic.