Hot Swapping

Last change on 2023-01-30 • Created on 2020-03-20 • ID: RO-B3311

Introduction

With hot swapping, you can replace drives (HDDs/SSDs) while the system is running to minimize server downtime if a drive fails. Please read this article to help you prepare and perform a hot-swap exchange.

Compatibility

The majority of our new server models are hot-swap capable.

You can check whether your server is hot-swap capable on Robot. Go to the server and click on the "Support" tab. Then, in the new window, click on the bottom on "Technical". Under "What kind of technical problem are you facing ?", click on "Drive is broken". Now scroll down until you see "Replacement options". If you see the option "Swap while the system is running", your server is hot-swap cabable.

Important notes

Generally, you should first remove the drive that you want replaced from the RAID. You should do this before you start the rest of the hot-swap process. This will help prevent any further damage to the drive during the exchange. Please also be very careful that you enter the correct serial number for the defective drive. If you can no longer see the serial number for the defective drive, then tell us this clearly, and then give us the serial numbers for all of the drives that are functional.

Procedure

Hardware RAID

If you are using a Raid controller with the server, you can exchange the drives via hot-swap; this is true for all operating systems. Currently at Hetzner, we have Adaptec and LSI RAID controllers.

You can find information about the controllers here:

To request a drive exchange, write a support request as normal via your Robot account.

Below are some examples:

Important: These are examples only. You need to adapt the steps and especially the command parameters to YOUR specific system!

LSI controller

Example configuration: Debian installation on a RAID 1 array with two SSDs The command line tools MegaCli64 and StorCLI are available.

  • StorCLI:

    In this example, let's imagine that there is a defective SSD at slot 0.

    1. You can find the status and serial numbers (Inquiry Data) with the following command, for example:

      storcli /c0/eALL/sALL show all | egrep 'Device attributes|SN = | Intf | SATA'
    2. If the defective drive does not yet have the status 'offline', tjos this to 'offline' with storcli:

      storcli /c0/e252/s0 set offline
    3. Now the SSD is marked as missing ...

      storcli /c0/e252/s0 set missing
    4. Now write a support request via Robot and ask for the drive exchange.

    5. After our team has exchanged the drive, check the new drive's status:

      storcli /c0/eall/sall show
    6. If the rebuild does not start on its own, start the rebuild manually.

      storcli /c0/e252/s0 start rebuild
  • MegaCli64:

    • You can find MegaCli64 at http://download.hetzner.com/tools/LSI/tools/MegaCLI/8.07.10_MegaCLI_Linux.zip. (You can convert the RPM package to a deb package using alien and then install it).
    • The tool is quite tolerant regarding the notation of parameters. You can enter parameters with or without a hyphen, and they are case-insensitive.
    • Create an alias to make it easier to use:
      alias megacli='/opt/MegaRAID/MegaCli/MegaCli64'.

    In this example, let's imagine that there is a defective SSD at slot 0.

    1. You can find the status and serial numbers (Inquiry Data) with the following command, for example:

      megacli pdlist a0 | grep -Ei 'enclosure|slot|firmware state|inquiry'
    2. If the defective drive does not yet have the status (firmware state) 'offline', MegaCli will set it to 'offline':

      megacli pdoffline physdrv[252:0] a0
    3. Now the SSD is marked as missing ...

      megacli pdmarkmissing physdrv[252:0] a0
    4. ...and prepared for the exchange

      megacli pdprprmv physdrv[252:0] a0
    5. Now write a support request via Robot and ask for the drive exchange.

    6. After our team has exchanged the drive, check the new drive's status:

      megacli pdlist a0 | grep -Ei 'enclosure|slot|firmware state|inquiry'
    7. If the rebuild does not start on its own, start it manually.

Adaptec controller

Example configuration: Debian installation on a RAID 1 array with two drives.

  1. You can find the status and serial numbers with the following command, for example:

    arcconf getconfig 1 pd|egrep "Device #|State\>|Reported Location|Reported Channel|Serial|S.M.A.R.T. warnings"
  2. If the defective drive does not yet have the status 'failed', this status is set.

    arcconf setstate 1 device 0 0 ddd
  3. Now write a support request via Robot and ask for the drive exchange.

  4. After our team has exchanged the drive, check the new drive's status:

    arcconf getconfig 1 pd | egrep "Device #|State\>|Reported Location|Reported Channel|Serial|S.M.A.R.T. warnings"
  5. If the rebuild does not start on its own, start it manually.

Software RAID

In principle, hot swapping is also possible for drives on the SATA controller. The operating system recognizes the change of the connection status at the respective controller and recognizes the new drive as soon as it is connected. The steps you need to take differ depending on the operating system and configuration.

Below are some examples:

Important: These are just examples. You need to adjust the steps and especially the command parameters to YOUR specific system!

Linux

You can find information and a detailed example scenario for replacing drives in Linux software RAID at: Hard disk replacement in software RAID

Windows

Important: With Windows, it is not possible to hot-swap the start plex. Therefore, you need to boot the system from the intact Plex before starting the hot-swap process. (Microsoft also refers to mirroring as plexing, so a "plex" is a part of a mirrored volume).

The following example, let's imagine that the server has a Hetzner standard installation of Windows Server in UEFI mode with two drives and mirroring. The defective drive is disk 1 (secondary Plex). The system was started from the primary plex.

  1. Remove HDD/SSD from the RAID.

In Disk Management (diskmgmt.msc), open the context menu of Volume C: and select "Remove Mirroring".

  1. Read the serial number of the defective or intact HDD/SSD with diskid32.exe.

  2. Make a support request and ask our team to replace the drive (hot swapping).

  3. After our team has exchanged the drive, start diskpart.

  4. Prepare drive / create partitions based on the intact HDD/SSD.

  • If replacement HDD/SSD is not detected:

    DISKPART> rescan
  • Display drive:

    DISKPART> list disk
  • If the defective drive is displayed as M1 (missing):

    DISKPART> select disk M1
    DISKPART> delete disk
  • Convert removable drive to dynamic media with GPT.

  • Create and format the EFI partition and assign drive letter E to it.

  • Add HDD/SSD to mirror C and wait until synchronization is complete.

    DISKPART> select disk 1
    DISKPART> convert gpt
    DISKPART> create partition efi size=200
    DISKPART> format fs=fat32 quick
    DISKPART> assign letter=e
    DISKPART> convert dynamic
    DISKPART> select volume c
    DISKPART> add disk 1 wait
  • Assign the letter x to the EFI partition of the intact HDD/SSD.

    DISKPART> select disk 0
    DISKPART> select part 1
    DISKPART> assign letter=x
    DISKPART> exit
  1. EFI partition and boot manager:

    In the example, the EFI partitions have been assigned the following drive letters: x: existing EFI partition e: newly created EFI partition on the replacement drive

  • First of all, you should save the system BCD memory (here in the file BCD_backup in the current directory), so that you can undo any changes you make later using bcdedit /import:

    bcdedit /export BCD_backup
  • Recursively copy the EFI partition, but skip the BCD memory and the System Volume Information folder:

    robocopy x:\ e:\ * /e /copyall /dcopy:t /xf BCD.* /xd "System Volume Information"
  • Now export the system BCD memory to the replacement drive with bcdedit:

    bcdedit /export e:\EFI\Microsoft\Boot\BCD

Now you can start both boot managers from either of the two boot plexes.

Under certain circumstances, you may need to make further adjustments to the BCD memory (e.g. if there is still an orphaned start entry). You can find more information at: http://download.microsoft.com/download/6/E/E/6EE26977-FAA0-41CC-8BDA-7A0C5E6EB9CC/Configuring%20Disk%20Mirroring%20for%20Windows%20Server%202012.docx.

FreeBSD

  • gmirror + UFS:

    Example configuration: FreeBSD installation with UFS and gmirror with the following arrays:

    /dev/mirror/boot (ada0p1 + ada1p1)
    /dev/mirror/swap (ada0p2 + ada1p2)
    /dev/mirror/root (ada0p3 + ada1p3)

    The defective HDD/SSD is ada1.

    1. Remove the defective HDD/SSD from the RAID.
    • Check the status:

      gmirror status
    • Disable partitions of the defective HDD/SSD if necessary:

      gmirror deactivate boot ada1p1
      gmirror deactivate swap ada1p2
      gmirror deactivate root ada1p3
    • "Forget" partitions of the defective HDD/SSD:

      gmirror forget boot
      gmirror forget swap
      gmirror forget root
    1. Find the serial number of the defective HDD/SSD:
    • For example, with smartctl from the smartmontools package:

      smartctl -a /dev/ada1 |grep -i serial
    • Or using camcontrol:

      camcontrol identify /dev/ada1 |grep -i serial
    1. Now write a support request via Robot and ask for the drive exchange.

    2. After the exchange is complete, copy the partition table from ada0 to ada1:

      gpart backup ada0 | gpart restore ada1

    NOTE: Currently, there appears to be a bug in FreeBSD 11 that prevents FreeBSD from restoring the partition table, which may prevent booting from the replaced drive. If you encounter this problem, please see the FreeBSD Forum post.

    1. Add partitions of the swap HDD/SSD to gmirror:

      gmirror insert boot ada1p1
      gmirror insert swap ada1p2
      gmirror insert root ada1p3
    2. Install boot code on the replacement HDD/SSD:

      gpart bootcode -b /boot/pmbr -p /boot/gptboot -i 1 ada1
  • ZFS

    Sample configuration: FreeBSD installation using ZFS with the following arrays:

    /dev/mirror/boot (ada0p1 + ada1p1)
    /dev/mirror/swap (ada0p2 + ada1p2)

    ZFS pool zroot with mirroring via gpt/root0 (GPT label for ada0p3) and gpt/root1 (GPT label for ada1p3)

    The defective HDD/SSD is ada0.

    (The two gmirror mirrors boot and swap are handled according to the above procedure).

    1. If you want to use ZFS for mirroring, you have to check the state of the mirror before replacing it, too, and if necessary, set the corresponding partition (in the following example gpt/root0) to offline:

      zpool status
       pool: zroot
      state: ONLINE
       scan: none requested
      config:
             NAME           STATE     READ WRITE CKSUM
             zroot          ONLINE       0     0     0
               mirror-0     ONLINE       0     0     0
                 gpt/root0  ONLINE       0     0     0
                 gpt/root1  ONLINE       0     0     0
      zpool offline zroot gpt/root0
      zpool status
       pool: zroot
      state: DEGRADED
      status: One or more devices has been taken offline by the administrator.
             Sufficient replicas exist for the pool to continue functioning in a
             degraded state.
      action: Online the device using 'zpool online' or replace the device with
             'zpool replace'.
       scan: none requested
      config:
             NAME                     STATE     READ WRITE CKSUM
             zroot                    DEGRADED     0     0     0
               mirror-0               DEGRADED     0     0     0
                 8894732708877724737  OFFLINE      0     0     0  was /dev/gpt/root0
                 gpt/root1            ONLINE       0     0     0
      
      gmirror deactivate boot ada0p1
      gmirror deactivate swap ada0p2
      gmirror forget boot
      gmirror forget swap
    2. If you use GPT labels like in the example, you can find the assignment to the drive using gpart:

      gpart list | grep -Egg 'geom|label'
      Geom name: ada0
      label: boot0
      label: swap0
      label: root0
      Geom name: ada1
      label: boot1
      label: swap1
      label: root1
    3. Find the serial number of the defective HDD/SSD:

    • For example, with smartctl from the smartmontools package:

      smartctl -a /dev/ada0 |grep -i serial
    • Or via camcontrol:

      camcontrol identify /dev/ada0 |grep -i serial
    1. Write a support ticket via Robot to ask and ask our team to replace the drive. Make sure to include the correct serial number of the drive. After the exchange, tranfer the partition table via gpart, repair the gmirror mirror, and install the boot code:

      gpart backup ada1 | gpart restore ada0
      gmirror insert boot ada0p1
      gmirror insert swap ada0p2
      gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0
    2. Then adjust the GPT label of the ZFS partition (in this case the third, i.e. ada0p3) of the replacement drive (gpt/root0):

      gpart modify -i 3 -l root0 ada0
    3. The new device can now replace the failed part of the mirror:

      zpool replace zroot gpt/root0
      zpool status -x
      all pools are healthy

    For detailed information on configuring and managing the ZFS file system, see the Oracle documentation: Oracle ZFS Documentation (English)

Table of Contents