Affected
Major outage from 6:27 PM to 9:23 PM, Operational from 9:23 PM to 11:07 PM, Degraded performance from 11:07 PM to 3:30 PM
Major outage from 6:27 PM to 9:23 PM, Operational from 9:23 PM to 11:07 PM, Degraded performance from 11:07 PM to 3:30 PM
- ResolvedResolved
At 7:20 AM today, we received a response back from our service provider that they still believe this is a software related issue and that we need to consult a "Linux professional" for further assistance.
We have applied the latest patches to the node to reduce the likelihood of it still being a weird software bug. However, further intervention may be required on our end.
Since the last reboot detected at 4:46 PM Pacific Time, August 25th, 2024, we have not received any new notifications of unexpected reboots or issues with customer's services. For this reason, we will be closing this incident and evaluating our options going forward.
- MonitoringMonitoring
The technical support team of our service provider is only available from 8AM-6PM Monday to Friday. While we have tried other avenues, we cannot access the same team who handles interventions, unless one is active. However, our ticket was prioritized. and should hopefully be answered at the beginning of their shift in about 3 hours from now. Since our last update, a few more unexpected reboots occurred but has stabilized the last few hours.
We will still await the response from our service provider to ensure we can reduce the likelihood of another long incident.
- UpdateUpdate
We have requested more information from our service provider about this issue. While it appears a reboot of the machine has re-established network connectivity and allowing customers to turn on their server. It is asked that customers on MC.LON3 refrain from turning on your server if you are expecting it to stay online for awhile, to avoid data loss from non-graceful shutdowns.
We will continue to monitor the situation from our side and relay any additional updates as they arise. We appreciate your patience.
- InvestigatingInvestigating
Another automated alert has been triggered by our monitoring system at 2:55 PM. An intervention request has already been automatically opened and subsequently closed at 3:05 PM with the test results and that it is believed to be "software related and cannot be fixed by the DC technicians".
For context, when the service provider closed their intervention at 1:43 PM Pacific Time, they conducted a hardware replacement to one that has "recently passed our extensive preparation checks"/"known-working spare server" and moved our drives over from the previous machine.
Our team will now take a look at the IPMI and investigate why this might be the case.
- MonitoringMonitoring
Our service provider has concluded their intervention as of 1:43 PM Pacific Time. Since then, we have diligently made the necessary changes on our side to bring back networking to the machine. We will be monitoring for any additional issues. All customers on MC.LON3 are encouraged to power-on their server and notify our Support Team if they face any issues starting their server.
Further information about this incident will be provided in the "Resolved" Incident Update.
- UpdateUpdate
As of 11:33 AM Pacific Time, we are still waiting for intervention by the service provider. Our monitoring system has been reporting brief moments of up and down statuses for ping requests. We are unable to remotely check the server status until the intervention has concluded. We apologize for the inconvenience.
- InvestigatingInvestigating
An automated alert has been triggered by our monitoring system and our team has been notified. We are currently investigating on this incident our side and with the provider.