Hot Swapping

Introduction

With hot swapping, you can replace drives (HDDs/SSDs) while the system is running to minimize server downtime if a drive fails. Please read this article to help you prepare and perform a hot-swap exchange.

Compatibility

The majority of our new server models are hot-swap capable.

You can check whether your server is hot-swap capable on Robot. Go to the server and click on the "Support" tab. Then, in the new window, click on the bottom on "Technical". Under "What kind of technical problem are you facing ?", click on "Drive is broken". Now scroll down until you see "Replacement options". If you see the option "Swap while the system is running", your server is hot-swap cabable.

Important notes

Generally, you should first remove the drive that you want replaced from the RAID. You should do this before you start the rest of the hot-swap process. This will help prevent any further damage to the drive during the exchange. Please also be very careful that you enter the correct serial number for the defective drive. If you can no longer see the serial number for the defective drive, then tell us this clearly, and then give us the serial numbers for all of the drives that are functional.

Procedure

Hardware RAID

If you are using a Raid controller with the server, you can exchange the drives via hot-swap; this is true for all operating systems. Currently at Hetzner, we have Adaptec and LSI RAID controllers.

You can find information about the controllers here:

To request a drive exchange, write a support request as normal via your Robot account.

Below are some examples:

Important: These are examples only. You need to adapt the steps and especially the command parameters to YOUR specific system!

LSI controller

Example configuration: Debian installation on a RAID 1 array with two SSDs The command line tools MegaCli64 and StorCLI are available.

StorCLI:
- You can find "StorCLI", for example, at http://mirror.hetzner.com/tools/LSI/tools/StorCLI/MR_SAS_StorCLI_1.17.08.zip. (You can convert the RPM package to a deb package using alien and then install it).
- Create an alias to make it easier to use:
```
alias storcli='/opt/MegaRAID/storcli/storcli64'
```
In this example, let's imagine that there is a defective SSD at slot 0.
1. You can find the status and serial numbers (Inquiry Data) with the following command, for example:
```
storcli /c0/eALL/sALL show all | egrep 'Device attributes|SN = | Intf | SATA'
```
2. If the defective drive does not yet have the status 'offline', tjos this to 'offline' with storcli:
```
storcli /c0/e252/s0 set offline
```
3. Now the SSD is marked as missing ...
```
storcli /c0/e252/s0 set missing
```
4. Now write a support request via Robot and ask for the drive exchange.
5. After our team has exchanged the drive, check the new drive's status:
```
storcli /c0/eall/sall show
```
6. If the rebuild does not start on its own, start the rebuild manually.
```
storcli /c0/e252/s0 start rebuild
```
MegaCli64:
- You can find MegaCli64 at http://download.hetzner.com/tools/LSI/tools/MegaCLI/8.07.10_MegaCLI_Linux.zip. (You can convert the RPM package to a deb package using alien and then install it).
- The tool is quite tolerant regarding the notation of parameters. You can enter parameters with or without a hyphen, and they are case-insensitive.
- Create an alias to make it easier to use:
```
alias megacli='/opt/MegaRAID/MegaCli/MegaCli64'.
```
In this example, let's imagine that there is a defective SSD at slot 0.
1. You can find the status and serial numbers (Inquiry Data) with the following command, for example:
```
megacli pdlist a0 | grep -Ei 'enclosure|slot|firmware state|inquiry'
```
2. If the defective drive does not yet have the status (firmware state) 'offline', MegaCli will set it to 'offline':
```
megacli pdoffline physdrv[252:0] a0
```
3. Now the SSD is marked as missing ...
```
megacli pdmarkmissing physdrv[252:0] a0
```
4. ...and prepared for the exchange
```
megacli pdprprmv physdrv[252:0] a0
```
5. Now write a support request via Robot and ask for the drive exchange.
6. After our team has exchanged the drive, check the new drive's status:
```
megacli pdlist a0 | grep -Ei 'enclosure|slot|firmware state|inquiry'
```
7. If the rebuild does not start on its own, start it manually.

Adaptec controller

Example configuration: Debian installation on a RAID 1 array with two drives.

You need the command line tool arcconf. You can find this tool and the required C++ library at http://download.hetzner.com/tools/Adaptec/tools/.
The defective drive is connected to slot 0.

You can find the status and serial numbers with the following command, for example:

arcconf getconfig 1 pd|egrep "Device #|State\>|Reported Location|Reported Channel|Serial|S.M.A.R.T. warnings"

If the defective drive does not yet have the status 'failed', this status is set.
```
arcconf setstate 1 device 0 0 ddd
```
Now write a support request via Robot and ask for the drive exchange.

After our team has exchanged the drive, check the new drive's status:

arcconf getconfig 1 pd | egrep "Device #|State\>|Reported Location|Reported Channel|Serial|S.M.A.R.T. warnings"

If the rebuild does not start on its own, start it manually.

Software RAID

In principle, hot swapping is also possible for drives on the SATA controller. The operating system recognizes the change of the connection status at the respective controller and recognizes the new drive as soon as it is connected. The steps you need to take differ depending on the operating system and configuration.

Below are some examples:

Important: These are just examples. You need to adjust the steps and especially the command parameters to YOUR specific system!

Linux

You can find information and a detailed example scenario for replacing drives in Linux software RAID at: Hard disk replacement in software RAID

Windows

Important: With Windows, it is not possible to hot-swap the start plex. Therefore, you need to boot the system from the intact Plex before starting the hot-swap process. (Microsoft also refers to mirroring as plexing, so a "plex" is a part of a mirrored volume).

The following example, let's imagine that the server has a Hetzner standard installation of Windows Server in UEFI mode with two drives and mirroring. The defective drive is disk 1 (secondary Plex). The system was started from the primary plex.

Remove HDD/SSD from the RAID.

In Disk Management (diskmgmt.msc), open the context menu of Volume C: and select "Remove Mirroring".

Read the serial number of the defective or intact HDD/SSD with diskid32.exe.
Make a support request and ask our team to replace the drive (hot swapping).
After our team has exchanged the drive, start diskpart.
Prepare drive / create partitions based on the intact HDD/SSD.

If replacement HDD/SSD is not detected:
```
DISKPART> rescan
```
Display drive:
```
DISKPART> list disk
```
If the defective drive is displayed as M1 (missing):
```
DISKPART> select disk M1
DISKPART> delete disk
```
Convert removable drive to dynamic media with GPT.
Create and format the EFI partition and assign drive letter E to it.

Add HDD/SSD to mirror C and wait until synchronization is complete.

DISKPART> select disk 1
DISKPART> convert gpt
DISKPART> create partition efi size=200
DISKPART> format fs=fat32 quick
DISKPART> assign letter=e
DISKPART> convert dynamic
DISKPART> select volume c
DISKPART> add disk 1 wait

Assign the letter x to the EFI partition of the intact HDD/SSD.

DISKPART> select disk 0
DISKPART> select part 1
DISKPART> assign letter=x
DISKPART> exit

EFI partition and boot manager:

In the example, the EFI partitions have been assigned the following drive letters: x: existing EFI partition e: newly created EFI partition on the replacement drive

First of all, you should save the system BCD memory (here in the file BCD_backup in the current directory), so that you can undo any changes you make later using bcdedit /import:
```
bcdedit /export BCD_backup
```
Recursively copy the EFI partition, but skip the BCD memory and the System Volume Information folder:
```
robocopy x:\ e:\ * /e /copyall /dcopy:t /xf BCD.* /xd "System Volume Information"
```
Now export the system BCD memory to the replacement drive with bcdedit:
```
bcdedit /export e:\EFI\Microsoft\Boot\BCD
```

Now you can start both boot managers from either of the two boot plexes.

Under certain circumstances, you may need to make further adjustments to the BCD memory (e.g. if there is still an orphaned start entry). You can find more information at: http://download.microsoft.com/download/6/E/E/6EE26977-FAA0-41CC-8BDA-7A0C5E6EB9CC/Configuring%20Disk%20Mirroring%20for%20Windows%20Server%202012.docx.

FreeBSD

gmirror + UFS:

Example configuration: FreeBSD installation with UFS and gmirror with the following arrays:
```
/dev/mirror/boot (ada0p1 + ada1p1)
/dev/mirror/swap (ada0p2 + ada1p2)
/dev/mirror/root (ada0p3 + ada1p3)
```
The defective HDD/SSD is ada1.
1. Remove the defective HDD/SSD from the RAID.
- Check the status:
```
gmirror status
```
- Disable partitions of the defective HDD/SSD if necessary:
```
gmirror deactivate boot ada1p1
gmirror deactivate swap ada1p2
gmirror deactivate root ada1p3
```
- "Forget" partitions of the defective HDD/SSD:
```
gmirror forget boot
gmirror forget swap
gmirror forget root
```
1. Find the serial number of the defective HDD/SSD:
- For example, with smartctl from the smartmontools package:
```
smartctl -a /dev/ada1 |grep -i serial
```
- Or using camcontrol:
```
camcontrol identify /dev/ada1 |grep -i serial
```
1. Now write a support request via Robot and ask for the drive exchange.
2. After the exchange is complete, copy the partition table from ada0 to ada1:
```
gpart backup ada0 | gpart restore ada1
```
NOTE: Currently, there appears to be a bug in FreeBSD 11 that prevents FreeBSD from restoring the partition table, which may prevent booting from the replaced drive. If you encounter this problem, please see the FreeBSD Forum post.
1. Add partitions of the swap HDD/SSD to gmirror:
```
gmirror insert boot ada1p1
gmirror insert swap ada1p2
gmirror insert root ada1p3
```
2. Install boot code on the replacement HDD/SSD:
```
gpart bootcode -b /boot/pmbr -p /boot/gptboot -i 1 ada1
```

ZFS

Sample configuration: FreeBSD installation using ZFS with the following arrays:

/dev/mirror/boot (ada0p1 + ada1p1)
/dev/mirror/swap (ada0p2 + ada1p2)

ZFS pool zroot with mirroring via gpt/root0 (GPT label for ada0p3) and gpt/root1 (GPT label for ada1p3)

The defective HDD/SSD is ada0.

(The two gmirror mirrors boot and swap are handled according to the above procedure).

If you want to use ZFS for mirroring, you have to check the state of the mirror before replacing it, too, and if necessary, set the corresponding partition (in the following example gpt/root0) to offline:

zpool status
 pool: zroot
state: ONLINE
 scan: none requested
config:
       NAME           STATE     READ WRITE CKSUM
       zroot          ONLINE       0     0     0
         mirror-0     ONLINE       0     0     0
           gpt/root0  ONLINE       0     0     0
           gpt/root1  ONLINE       0     0     0
zpool offline zroot gpt/root0
zpool status
 pool: zroot
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
       Sufficient replicas exist for the pool to continue functioning in a
       degraded state.
action: Online the device using 'zpool online' or replace the device with
       'zpool replace'.
 scan: none requested
config:
       NAME                     STATE     READ WRITE CKSUM
       zroot                    DEGRADED     0     0     0
         mirror-0               DEGRADED     0     0     0
           8894732708877724737  OFFLINE      0     0     0  was /dev/gpt/root0
           gpt/root1            ONLINE       0     0     0

gmirror deactivate boot ada0p1
gmirror deactivate swap ada0p2
gmirror forget boot
gmirror forget swap

If you use GPT labels like in the example, you can find the assignment to the drive using gpart:

gpart list | grep -Egg 'geom|label'
Geom name: ada0
label: boot0
label: swap0
label: root0
Geom name: ada1
label: boot1
label: swap1
label: root1

Find the serial number of the defective HDD/SSD:

For example, with smartctl from the smartmontools package:
```
smartctl -a /dev/ada0 |grep -i serial
```

Or via camcontrol:

camcontrol identify /dev/ada0 |grep -i serial

Write a support ticket via Robot to ask and ask our team to replace the drive. Make sure to include the correct serial number of the drive. After the exchange, tranfer the partition table via gpart, repair the gmirror mirror, and install the boot code:
```
gpart backup ada1 | gpart restore ada0
gmirror insert boot ada0p1
gmirror insert swap ada0p2
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0
```
Then adjust the GPT label of the ZFS partition (in this case the third, i.e. ada0p3) of the replacement drive (gpt/root0):
```
gpart modify -i 3 -l root0 ada0
```

The new device can now replace the failed part of the mirror:

zpool replace zroot gpt/root0
zpool status -x
all pools are healthy

For detailed information on configuring and managing the ZFS file system, see the Oracle documentation: Oracle ZFS Documentation (English)