WinterNode - MC.LON3 Unavailable – Incident details

mc.lon3.winternode.com under maintenance

MC.LON3 Unavailable

Resolved
Major outage
Started 19 days agoLasted 2 days

Affected

London - Minecraft

Major outage from 6:21 AM to 12:36 AM, Operational from 6:21 AM to 7:05 PM, Major outage from 12:36 AM to 5:49 PM, Operational from 7:05 PM to 5:49 PM

mc.lon3.winternode.com

Major outage from 6:21 AM to 5:49 PM

mc.lon5.winternode.com

Operational from 6:21 AM to 5:49 PM

Updates
  • Resolved
    Resolved

    Dear Customers of MC.LON3,

    We are pleased to inform you that the MySQL database has been successfully migrated to MC.LON5.

    If you are using our MySQL Databases, you will need to take the following actions:

    • Reset your MySQL passwords by visiting the Databases page.

    • Reconfigure any plugins or mods to the new MySQL Database host, as well as update the password to the newly assigned one from the above bullet point. The current passwords that were used on MC.LON3 no longer work.

    Your IPs have changed, but your ports remain the same. 💙 If you are using a WinterNode provided subdomain, you will need to re-create them to ensure they point to the new IP and Port combination assigned to your server. If you are using your own domain, please ensure your DNS records are updated accordingly. Additionally, for customers with a Dedicated IP, we’ve added extra dedicated IP ports for your usage. 🤣

    As part of our effort to resolve this issue, we will automatically apply a 50% credit of your service renewal amount to your account within the next 24 hours. No action is required on your part, and the credit will be applied to your next invoice, adjusting any automatic payments.

    We are also happy to announce that MC.LON5 boasts improved hardware, offering better performance for your services.

    We truly appreciate your patience and understanding as we worked through this incident over the past few days.

    All access has been restored. If you need any assistance, our Support Team is happy to assist you in our Discord Server.

  • Identified
    Update

    We have successfully switched over the connection of MC.LON3 to MC.LON5. Server files have been restored and you may able to see them within the File Manager (not SFTP at the moment since we still need to make changes on the admin-side) assuming the node is not currently on maintenance mode.

    Please do not interact with your server yet, as we still need to re-allocate the IPs (unfortunately, you will have new IPs assigned to you), you will also need to delete and re-create your subdomains to point to the new IPs, and we still need to restore the MySQL database.

    We're crossing our fingers 🤞 and knocking on wood that we are at the home stretch! As always, we will keep you posted.

  • Identified
    Update

    We have successfully email subscribed all customers of MC.LON3 to this incident.

    We are currently in the process of restoring customer data of MC.LON3 from the August 28th @ 2AM Central Time backup to our new MC.LON5 node. We were also fortunate enough to slowly grab a MySQL backup as well in-between the constant unexpected restarts of MC.LON3, but would like to remind customers that MySQL backups are not taken.

    At this time, we are working with our Panel Provider to restore customer access to go through our new instance.

    Compensation in the form of account credit is already planned for those customers who are affected by this incident.

    We appreciate your patience and support as we get through this incident together.

    The MC.LON3 machine is also still currently offline as of this update. If it does come back online, it is not recommended to make any changes as access is being switched over to MC.LON5 and/or any data on MC.LON3 will not be transferred over.

  • Identified
    Identified

    Bringing this incident up to speed...

    Throughout the night and morning of August 29th, we have been observing numerous machine restarts. At the speed of the unexpected reboots, it is difficult to diagnose the node or disable services from starting. This issue is still occurring at this time.

    Around 9AM Pacific Time today, we used every communication channel with our provider to express our frustration and to provide more information to have our service provider re-evaluate the issue as NOT software related.

    At around 11:59 AM, our provider's hardware diagnosis flagged the test of rebooting back into the Customer OS as "DOWN!!!!". When they attempted to swap to another spare server, they discovered SMART errors on both drives. You can read more about what SMART errors are in this article provided by Seagate - https://www.seagate.com/support/kb/my-system-reported-a-smart-error-on-the-drive-184619en

    Within a few minutes of receiving this notice, our team made the decision to have the intervention team attempt to replace one drive to we can assess the situation and at least bring the Operating System back online. We have also asked in a separate ticket to attempt to replace the RAM hardware as well. This request was closed as the drive replacement request was still underway.

    We have just received the following communication regarding the drive replacement request.

    Date 2024-08-29 21:38:05 BST (UTC +01:00), Component replacement:

    After deep troubleshooting, the smart errors in the disks have been caused by raiser card,

    Replaced the raiser card, tested multiple times the disk in the server and no errors have been shown,

    sent back to rescue customer

    ping ok

    ipmi ok

    However, at this time, we are still observing unexpected reboots/machine down alerts. We are still discussing internally our options.

    We will be keep our customers updated through this incident. We apologize for the inconvenience, however, this issue is not in our direct control.

    We recommend subscribing to this incident through email - https://status.winternode.com/cm0fynyjy00271jjf1rhsvohj/subscribe/email

  • Investigating
    Investigating

    Following the previous incident, an automated alert has been triggered by our monitoring system at 11:21 PM Pacific Time, August 28th and our team has been notified. We are currently investigating this incident and monitoring our communications with our service provider to ensure intervention takes place.