Dyescape - Where Adventures Begin

A unique Minecraft MMORPG experience featuring a vibrant fantasy world to explore, fight, and make friends in.

Dear community, as some of you may have noticed, we've suffered from several infrastructure outages over the recent weeks. This caused our website, Minecraft servers and internal tooling to be unavailable. We wish to be transparent with everyone on what is going on, why these outages happened and how we plan to tackle them moving forward. The post below will involve some technical details. You can skip to the summary at the bottom for a short, simplified version.

Storage
Within the hosting, and especially the cloud hosting industry, storage comes in many different shapes, sizes, pros and cons. Some setups value simplicity, some value scalability, others value performance or integrity. When setting up our infrastructure, we had to choose between these options and decided to go with a storage solution that had integrity and was scalable. As such, we are running an internal block storage solution called Longhorn, with volume replication across all of our servers in geographically different data centers. This means that all machines are constantly replicating each other's volumes. This is great for keeping data integer and fault-tolerant, as well as being able to quickly move software deployments such as databases to new machines without having to wait for file transfers.

That sounds great, so why bring it up? Everything works great as long as everything remains connected. As soon as these nodes lose connection with each other, the machine entirely will be marked as unhealthy. This is okay because then we can deploy software on different machines and quickly recover. The problem, however, is that the node becoming unhealthy, results in the storage solution shutting itself down on this machine. This means the replica status is lost and no health can be communicated anymore. As a result, the data replica on the unhealthy node is immediately considered degraded. This doesn't mean data loss, nor does it necessarily cause any damage, but it does cause the storage solution to rebuild the entire replica on that particular node from a node that was healthy. We have got data replication across our American and European servers as some of you may know. Repairing a volume across the Atlantic ocean comes with a big drop in network capacity. To save budget and reserve extra capacity for alpha and beta development, excessive hardware has been removed and thus we cannot have replicas for these volumes be reliably placed across multiple machines in the same continent, and we're forced to replicate across continents. A full replication rebuild thus causes time. Doing this with dozens of volumes puts considerable strain on our connection between the two continents. In some cases, volumes cannot be attached to workloads before they are fully healthy. This is what was the cause of lengthy outages the past few weeks. Why machines lost connection in the first place will be disclosed later in this post.

The above wouldn't be a concern if a short connection drop wouldn't immediately kill the replica instance on the other machine that loses connection. The developers of said storage solution agree, and a bug/feature ticket has been created on their Github as a result. We were aware of this issue, and have been tracking it for some time. Unfortunately, the priority wasn't high enough initially and a resolution has been pushed back to the next major update. Since the outage last weekend, we have tweaked several settings in this storage solution and are monitoring the results. We have an alternative storage plan drawn out in case this solution remains problematic.

Nodes losing connection
As discussed above, issues occur when nodes lose connection. This by itself has several causes, and we'll be sharing three of those causes here.

First, a worldwide OVH outage on the 13th of October caused a connection loss between all of our servers. Exact details are missing, but the general understanding is that OVH failed a large scale BGP routine update, causing all routes to disappear. The same happened with the worldwide Facebook outage not long before that. In a case like this, we are powerless, and with outages like this, the Minecraft industry as a whole takes a beating. The outage was solved by OVH within a reasonable time, but it caused our storage solution to panic and all volume replications had to be rebuilt as a result.

Secondly, we only got an understanding of the magnitude of the bug (or at least, limitation) of our storage solution somewhat recently. We've performed several rolling updates on our orchestration platform in production to keep everything up to date. This can normally be done without any downtime, but because of the storage limitation, does cause downtime. We weren't aware of this until it happened a few times.

Thirdly, the most recent one was a problem with our CNI (Container Network Interface) plugin. We are still unsure of the exact cause, but a sudden crash occurred on the plugin on one of our machines. This caused a connectivity loss between processes on that particular machine. Rebooting & reinstalling part of the systems did not help. We could not trace the cause but eventually tackled the issue by updating the kernel and operating system on the machine. We suspect that an unknown, undiscovered bug occurred as a result of an edge cause of our specific kernel, operating system and CNI plugin versions.

Summary
In short, a combination of sudden connectivity drops and a current limitation of an internal storage solution caused several unexpected, lengthy outages. Moving forward, we are adjusting settings appropriately to get the desired behavior of our systems and are continuously monitoring the results. An alternative plan for storage is ready in case this setup remains problematic.

We sincerely apologize for the outages. We are still actively learning from all of the feedback we've gotten since launch; on a functional and technical level. Thank you for your continued understanding and patience.
After a tremendously long development period, Dyescape has finally been able to put out a first release; v0.1.0. Those who were active in the Discord or the Twitch livestream yesterday will have already seen that we ran into a few technical struggles. This thread serves as transparency towards the community to explain what happened, why it happened, how it could happen and what has been done to address it. On a positive note, we'll also list a few technical things that did go well.

Database cluster
In a project like this, being dynamic with content changes is crucial. Software and content are two completely separate complexities, and in order to ensure that it doesn't become a single, even larger complexity, we've made everything configurable. From items, to skills, creatures, quests, random encounters; everything is configurable. In order to facilitate for this, a JSON document based file storage system was set up. For the past years during development, this has been done through flatfile & SFTP to allow the content to quickly make changes, test these and commit them to version control. A few months ago, a MongoDB cluster was set up to make it production ready.

Having set up the database cluster, everything was operational. We could import our content, we could play the game, save characters, swap to a different server, and everything would be there. However, before launch, we still had a few important content & software changes to process. While doing so, something in the database got upset, causing cluster sharding initialisation to fail and the cluster to become unusable. This is the first major technical downside we ran into, which is what caused the initial release to be two hours late. Thankfully, the team quickly jumped to support and we managed to solve the issue.

GEO load balancing
GEO load balancing is a technical setup that automatically routes users to the nearest server. This is setup by our Anti-DDoS & Load Balancing provider. From what we could see, this load balancing worked for most users. North American users see on average roughly 40 to 60ms ping from earlier playtests, and west European users get a ping of around 15 to 25ms. This GEO load balancing is setup for two main reasons; to give users the best possible connection and reduce lag, and to limit cross-continent bandwidth usage.

However, due to a misconfiguration on the proxy service discovery side, players were often sent to a fallback server in a different region. They would connected to a North American proxy, but a European fallback server for instance. In some unlucky cases, some people didn't get the expected GEO load balancing on the proxy to begin with. This caused for example American users to connect to a European proxy, and then to an American fallback server; doubling the already bad latency as a result. However, this was only for a handfull of people and was likely caused by some fallback servers crashing (more on this later).

In order to fix a few latency issues some players were having, two new domains have been created. These domains are eu.dyescape.com and na.dyescape.com created. In order to keep things as stable as possible for the time being, the proxy & fallback setup has fully been split into two separate networks now, so there's no chance of ever being routed to a fallback server of the wrong region. The play.dyescape.com domain should still provide accurate GEO load balancing. If not, please contact the team.

Cross-continent bandwidth
We have explicitly set up the infrastructure to have considerably strong network bandwidth. At our hosting provider, a large private network consisting of 4 solid dedicated servers with 8 physical cores, 64GB memory each, and a 2gbit connection. Despite initial thought yesterday, we currently see no signs of bandwidth coming short. The timeout errors we were seeing yesterday was actually caused by an issue on the software. The resulting issues for players however made us think it was a network related issue.

Crashing fallback servers & timeouts
While the Dukes were playing, we were notified of numerous timeout & crashing issues. After some investigation, it seemed to be caused by a recent software change in our interactive chat plugin. A code issue caused an infinite loop, exhausting the CPU capacity and killing the instance. This issue was fixed at around 02:15 AM our time. Afterwards, the Dukes could continue playing.

Remaining issues & alpha queue
The question which we see posted in Discord extremely often since the release, is why the queue is not processing. There's a good reason for this; although we've fixed any infrastructure & fatal software related issue we see, there's still a few in-game issues causing quests to become stuck, blocking progression. These issues are scheduled for a v0.1.1 patch release, which is planned to go live in the upcoming days. The team is working round the clock to get this patch out.

Because these remaining issues prevent gameplay progression, we've decided not to progress the queue until the v0.1.1 goes live. We deemed the best decision would be to have only Dukes test the game. It's a small group of people that can help us efficiently identify critical issues.

Positive notes
Down to some positive comments, because despite the technical issues, there are also multiple compliments worth mentioning. I'll go over some of these below;

Gameplay feedback (from Dukes); while currently only Dukes are able to play, we've received very positive feedback from them. The game is smooth, skill usage is a bliss, content is interesting & understandable, and there's multiple qualify of life features that are very well appreciated. After having fixed the regional connections & freezes, ping seems to be good, and the average milliseconds per tick seems to be healthy. Once the in-game issues are fixed in the v0.1.1 launch, we can likely start processing the queue.

[​IMG]

Conclusion & thank you
We want to close this off with a massive thank you. Albeit rough around the edge; Dyescape has launched. It has taken us well over 4 and a half years to get to this point. It has been an incredible ride to reach this state. We've had our ups, we've had our downs. We've seen team members come and go, we've seen content be revamped, software be overhauled and we've seen a community grow.

Despite the messy infrastructure launch, in all of our years of playing Minecraft, we have not seen a more considerate, friendly & heartwarming community. The support is incredible. We will continue to work hard getting the fixes out and to have everyone be able to join. Hope to see everyone on the server in v0.1.1!
Alpha
While some still cannot believe it, we are actually launching on the 7th of August! Note that there are only limited alpha slots available. Sign up before it's too late. And join the discord to receive all kinds of unique sneak-peeks!

[​IMG]
Thank you for the image, Trook.

Have you seen our alpha teaser yet?
Alpha Release Date
The ancient old question; when will Dyescape release? The wait is soon over. Please read the post below and sit tight for just a little longer!

[​IMG]

The entrance of Clemens, located on Phala.

Alpha Teaser

While some still cannot believe it, we are actually launching! On the 17th of July, we are releasing an alpha teaser. Note that there are only limited alpha slots available. Sign up before it's too late.



If you're looking for the alpha applications (takes 2 min) click here!
And after this Alpha Teaser, we will be hosting a QnA among other things in the discord! Questions for the QnA can be submitted in the discord. The full schedule of Saturday the 17th of July is:

  • 21:00 - Alpha Teaser
  • 21:05 - QnA with yours truly, Dennis & Aeky.
  • 22:00 - Class vote
  • 22:05 - Skill showcase of the most voted class and general gameplay showcase
  • 15:00 - Alpha Teaser
  • 15:05 - QnA with yours truly, Dennis & Aeky.
  • 16:00 - Class vote
  • 16:05 - Skill showcase of the most voted class and general gameplay showcase.
Sunday 18th of July
  • 05:00 - Alpha Teaser
  • 05:05 - QnA with yours truly, Dennis & Aeky.
  • 06:00 - Class vote
  • 06:05 - Skill showcase of the most voted class and general gameplay showcase.

Class Descriptions
We have revealed some new imagery in regards to classes! These images and descriptions can be found on the wiki thanks to some community members! If you want to always stay up to date on any development progress or sneak-peeks join the discord!

[​IMG]
Gif of the Warrior class.

Progress Percentage
We have introduced progress percentages a few months back. This percentage shows the ratio of open tickets vs closed tickets. These tickets are not weighted, which means that the percentage could quickly go up or stay stagnant for days on end. Even worse, it can go down if we find new issues. The current percentages are:
  • Content = 77%
  • Development = 93%
Release Date Contest
Lastly, there is a small contest going on. Predict the release date! If you have the closest guess to the actual release date, you will receive the Noble rank! Fill in your guess in the discord!

We hope to see everyone on the 17th of July! Thanks, everyone for years of support!

Hello! It's been a while since we posted here. Let's catch up shall we?

Alpha Release Date
The ancient old question; when will Dyescape release? Well, for the first time we will shed some light on this and make you a promise. The release will happen in 2021. But one might say that this does not indicate much. Well, no it does not. Therefore, we introduced 'the percentage'.

If you're looking for the alpha applications (takes 2 min) click here!

[​IMG]
Clemen's Graveyard; playable region within Alpha
The Percentage
This percentage shows the ratio open tickets vs closed tickets. These tickets are not weighted, which means that the percentage could quickly go up or stay stagnant for days on end. Even worse, it can go down if we find new issues. The current percentages are:
  • Content = 65%
  • Development = 86%
These percentages will be updated frequently in our discord! But let's move on to the more important information.

Alpha Slots
The alpha will have a queue system. While this queue system will aid in the launch, there's one edge case not covered by the queue system. What happens when there are too many people in the queue?

People have asked us about our plans regarding the upcoming (2021) launch. How will we deal with the masses trying to join all at the same time. Well, to prevent this, we already have a limited amount of alpha application available. These can be submitted on the website. https://www.dyescape.com/threads/how-to-sign-up-for-the-alpha.23/

But now. How will this work? We will have a queue in place, more likely a priority queue. This queue is on first come, first serve basis. They are not linked to the date you created your alpha application on. It is based on how quickly you enter one of the hub servers. Upon entering you will be given a number which reflects your position in the queue. You will be able to play as soon as your number drops low enough and a batch of players is released. Batches of players are let into the server without a defined period of time. This truly depends on how well the servers handle the load given at that time. Batches are, therefore, manually approved and can vary in size. Expected batch size will roughly be 25 players at a time. This means that if the queue is 600 players long you will have to wait 24 batches which is simply an undefined amount of waiting time.

Thus we introduce priority queue for our supporters. Any rank comes with a level of priority queue. Actually, not any. Dukes completely bypass the priority queue and will immediately be able to play. But now for the rest.

Within every batch there are reserved slots for donators. Lets take the example of 10 out of 25 slots.

Scenario 1: There are no or not enough donators in the queue to fill the donator slots. These slots will then be given to normal players.

Scenario 2: There are 300 people in queue and the last 12 people are donators. Depending on rank the highest ranks will be placed into the batch first. This means that 10 people will immediately be put into the first batch (thus skipping 290 people in queue).

Scenario 3: There are 10 viscounts in the queue but you are a noble at position 11. These 10 viscounts will occupy the donator slots but the other 15 are given out on first come, first serve basis. Considering you are at position 11, you will still be apart of the first batch.

These scenarios should provide some understanding of the launch queue. Having supported us by purchasing a rank can save you a lot of time waiting. Please do note that we will prioritise the health of the server and the ability to play once on the server. Therefore, there will never be time indication given of how long the wait will be.
Thank you for reading and hopefully see you soon.

The likeliest scenario would be that people would not be able to play at all the day they enqueue for the alpha release. To make sure that everyone can play in the first few days of release we will be limiting the amount of alpha slots available. Over the past 4.5 years we have amassed over 650 alpha applications. To prevent this number from raising too quickly as more information in regards to the release becomes known, the amount of available alpha slots have been capped at 1000. This means that between now and the release only 350 people will be able to sign up for free. An application takes 2 minutes, so better be quick!

[​IMG]
A street within Clemens; a playable town within Alpha

Performance Testing
With the alpha nearing, our infrastructure is being stress tested. During our stress testing, we test how well the server can handle load originating from different scenarios; from bots standing AFK in a tutorial, to playing an entire quest. By putting this load on the server, we can identify inefficient code and bugs causing poor performance. Doing this iteratively, we keep finding and fixing more issues. Currently the aim is a soft limit of 150 people per server, but may be reduced on the fly when we otherwise foresee issues. Additional slots will be granted to donators, bypassing this soft limit.

[​IMG]
  1. This site uses cookies to help personalise content, tailor your experience and to keep you logged in if you register.
    By continuing to use this site, you are consenting to our use of cookies.