Serious downtime on our server on August 5th, 2020

Hello dear clients,
Yesterday a client have reported that they cannot connect to the internal messenger system (via Pidgin) around 8:45 AM.
Investigating the server, all seemed to be ok except that, at some point, I’ve discovered in the logs that the MySQL server was failing in retrieving data from its connection channels.
The server’s continuous uptime was of over 2 months and I thought that would be good to restart it, so it cleans connection pools/etc.

However, at restart, I have identified that 3 out of the 8 HDDs installed on the system where (apparently) failing, which was suspicious since they are all SSD drives, and its very unlikely for this to happen.
So I’ve contacted the technical guy in the datacenter to remove and re-insert the HDDs, so they get “reloaded” (since its a server machine with plug’n’play functionality, shouldn’t be a problem).

But something weird happened, as when re-inserted in the server, the internal RAID system has failed to re-build the MIRRORs on those hard-disks, leaving me with 2 damaged partitions (they had the dirty-bit set on them).
So, Windows was trying (at startup) to check and fix the drives for any possible errors, a process though that wasn’t showing me any ETA or progress information.

I’ve attempted 2 restarts of the server, hoping that interrupting the CHECK DISKS process might speed-up having the server back on its feet. But this attempt was unsuccessful.

At some point around 7PM last night, I successfully loaded the server into a “live OS” in order to see if data integrity was a problem or not and to attempt other means of fixing the issue.
So I used Hiren’s BootCd in order to load various tools to attempt to fix the issue faster.
Around 2AM last night I’ve had the 2 partitions successfully repaired but unfortunately the Windows partition had serious damage on some windows-required files and loading the operating system has failed, leaving me with the option to reinstall Windows on the server. Which I started.

Around 4AM my eyes where leaking so I went to a 2 hours nap, as its better to be focused so I will not damage something else due to lack of sleep.

This morning around 6:30 AM i resumed installing and configuring Windows on the server and in a few hours I expect that everything will be up and running again.

Later edit: on 6th August, around 18:00 CET, the server was online serving the pages again.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.