LINUX SCSIRASTOOLS USER GUIDE DOCUMENT ---------------------- CONTENTS ---------------------- 1.0 Overview 2.0 Dependencies 3.0 Tools 3.1 sgmode 3.2 sgdskfl 3.3 sgdefects 3.4 sgdiag 3.5 sgraidmon 3.6 sgdiskmon 4.0 Software RAID Configuration 5.0 Use Case 5.1 Statistics 6.0 Problems 6.1 Frequently Asked Questions 7.0 More Information ---------------------- 1.0 OVERVIEW ---------------------- These scsirastools were designed to add to the Serviceability of SCSI devices under Linux so that the system does not have to be rebooted or taken out of service to perform common maintenance or service functions. These tools are currently provided, and man(8) pages for each of the tools are also included. sgdefects = tool to read the primary and grown defect lists sgdskfl = tool to load disk firmware to SCSI disks under Linux sgdiag = tool to perform format & other test functions sgmode = tool to get and set SCSI mode pages sgraidmon = tool to monitor software RAID disks for hot-insertion/removal mdadm = tool to administer and configure md raid devices alarms = tool to set disk fault LEDs on mBMC platforms sgdefects.8 = man page for sgdefects sgdskfl.8 = man page for sgdskfl sgdiag.8 = man page for sgdiag sgmode.8 = man page for sgmode sgraidmon.8 = man page for sgraidmon mdadm.8 = man page for mdadm Other scripts are also referenced: getmd = tool to parse /etc/raidtab for devices (used in mdmon) mdmon = script to monitor md raid for remove/insert events (uses mdadm) Note that this does not handle insertion events. mdevt = script to take actions when remove/insert events occur (uses mdadm) ---------------------- 2.0 DEPENDENCIES ---------------------- Each of the Linux scsirastools do depend on the CONFIG_CHR_DEV_SG parameter being on (y or m) in the kernel configuration (see /usr/src/linux/.config), which enables the SCSI Generic interface. The default kernel config files supplied with most distributions should already have this parameter turned on. The utility mdadm is used by the mdevt & mdmon scripts for sgraidmon. This makefile will build mdadm version 1.3.0 in a subdirectory. This version is available at: http://www.cse.unsw.edu.au/~neilb/source/mdadm/mdadm-1.3.0.tgz http://www.cse.unsw.edu.au/~neilb/source/mdadm/RPM/mdadm-1.3.0-1.i386.rpm For version updates, etc., see http://www.cse.unsw.edu.au/~neilb/source/mdadm/ The utility alarms is used by the mdevt script to set disk LEDs. The source for it is available from http://ipmiutil.sourceforge.net. If you have an Adaptec 79XX U320 chipset, you should use the aic79xx driver version 1.3.2 or greater. See http://people.freebsd.org/~gibbs/linux/ for the current aic79xx version. There are also scsiras kernel patches (kern/*) associated with this project, which have additional hardening and enhancements, but the scsirastools utilities are not dependent upon the scsiras kernel patches. Note that the sgraidmon utility assumes that the partition tables on each disk in the raid is the same. It is possible to configure /etc/raidtab so that this is not the case. If /etc/raidtab shows partitions on the mirrored disks that do not match, sgraidmon will display an error message. The max number of devices for scsirastools utilities is defined in src/sgcommon.h as MAX_DEVLIST_DEVS. It defaults to 128, and could be easily increased for users in larger configurations. Memory consumption is 140 bytes per device in the table. ---------------------- 3.0 TOOLS ---------------------- ---------------------- 3.1 SGMODE ---------------------- Note that when the mode pages are updated with this utility, the changes go into effect immediately and no reset is required. The changes will not affect the data on the disk unless a decrease in capacity is made in the Block Descriptor of the mdf file. The sgmode utility can use the following *.mdf files supplied with this software package: atlasu32.mdf = mode pages for MAXTOR ATLASU320_36_SCA ddys-t09.mdf = mode pages for IBM DDYS-T09170M dpss-309.mdf = mode pages for IBM DPSS-309170N dk32dj-1.mdf = mode pages for HITACHI DK32DJ-18MW dk32dj-3.mdf = mode pages for HITACHI DK32DJ-36MW man3184m.mdf = mode pages for FUJITSU MAN3183MP man3367m.mdf = mode pages for FUJITSU MAN3367MP map3147n.mdf = mode pages for FUJITSU MAP3147NC st39173w.mdf = mode pages for SEAGATE ST39173W st39173l.mdf = mode pages for SEAGATE ST39173LW st318406.mdf = mode pages for SEAGATE ST318406LC st318452.mdf = mode pages for SEAGATE ST318452LW st336605.mdf = mode pages for SEAGATE ST336605LW st336752.mdf = mode pages for SEAGATE ST336752LW st373405.mdf = mode pages for SEAGATE ST373405LW * Note that all mdf files included herein have certain parameters set for maximum reliability of the disk device in a server environment. These settings are: 01 0a c4 <-- AWRE, ARRE, PER on 08 12 00 <-- WCE off, not 04 1c 0a 88 00 <-- SMART off, MRIE off See #3 below for explanations. New *.mdf files can be created by 1) Using "sgmode -r -a" to display the existing mode pages 2) From the screen output use cut/paste, or get the output from sgmode.log and save the lines beginning with numbers, starting with the ":Block Descriptor" line and ending with the last mode page (usually "1c 0a ..."), not including any trailing lines that begin with "00 ...". 3) I believe the following 3 bytes should be modified, if they are not already set: 01 0a c4 <-- AWRE, ARRE, PER on The top two bits are the most important (0xc0). These enable automatic bad-spot recovery. If other bits are different in the factory defaults, it should be ok. 08 12 00 <-- WCE off, not 04 Write-Cache-Enable will sometimes enhance performance for benchmarks, but often does not help for real transaction performance, and introduces risk for data integrity if a power fault occurs. 1c 1a 88 00 <-- SMART off, MRIE off SMART Informational Exceptions are disk-resident functions to attempt to predict disk-based faults. The algorithms used are often too aggressive, plus they introduce additional complexity to the disk firmware processing that can impact performance and even cause problems. Further, most systems are not running software that is capable of correctly recognizing SMART exceptions if they are sent. 4) Save these lines into a file which is named xxxxxxxx.mdf, where xxxxxxx is the first 8 characters of the device model (in lower case). The device model name is displayed by sgmode, and in sgmode.log. 5) Put the *.mdf files into /usr/share/scsirastools directory for convenience. At run time, sgmode looks first in the current directory for the *.mdf files, then in /usr/share/scsirastools. For more information about SCSI mode pages, refer to the following official SCSI documentation at http://www.t10.org/drafts.htm. SCSI-3 Block Commands (SBC), section 7.1.3, Mode parameters SCSI-3 Primary Commands (SPC), section 8.3, Mode parameters SGMODE(8) SGMODE(8) NAME sgmode - SCSI Mode Page utility SYNOPSIS sgmode [-aeinorswxI -f modefile -m diskmodel] DESCRIPTION Sgmode is a program that uses the SCSI Generic interface to send specific SCSI commands to read the device mode page configuration and also write mode pages. The SCSI Generic interface requires that the kernel .config file have CONFIG_CHR_DEV_SG set. A log file, named sgmode.log is created in the current directory which logs the status of the functions and any errors. Below is the sequence of events for this utility: * List each device on the system with firmware ver­ sions. * User selects a device for mode page (automatic if using -m or -a). * Attempt to read the mode page file for the selected disk (see option -f). * Verify that the disk is present and ready. * Write the mode pages via mode select (unless read- only is specified). * If the '-m' or '-a' option was used, repeat writing the pages for each specified disk. OPTIONS Command line options are described below. -? This option displays a summary of the commands accepted by sgmode -a Causes the functions to be performed automatically on all con- nected SCSI devices. -e Causes any file writes to be avoided, such as the log file. Usually a log file is created and written to. -f Specify this filename for the mode pages. Normally, this option is not used and the filename is formed using the first 8 characters of the model, with the ".mdf" extension. For exam- ple: "st39173w.mdf". This utility will look for the mdf file first in the current directory, then in the default directory (/usr/share/scsirastools). -i N Autorun for the device at Index N, as shown in the ¿Num¿ col- umn. -m Automatically update all drives that match this model string. -n Naming. By default, the device names are displayed as numeric sequences (/dev/sg0). This option shows the device names as alphabetic sequences (/dev/sga). -o Allow it to overwrite the disk capacity, if different. -r Read-only. Do not try to write any mode pages. By default, if -a or -m is specified, the utility will attempt to open an .mdf file for each drive, and write modepages if a matching file exists. -s Automatically turn off SMART on all disks. This reads in each disk mode page, and turns off SMART on page 0c, and writes the changes to each disk. -w N Automatically sets the WriteCacheEnable bit on or off on all disks. If N = 0, turn off WCE, or if N = 1, turn on WCE. This reads in each disk mode page, and sets WCE on page 08, and writes the changes to each disk. -x Causes extra debug messages to be displayed. -I Starts the utility in Interactive mode, requiring user input. ---------------------- 3.2 SGDSKFL ---------------------- The sgdskfl utility has been tested with the following firmware image files for various disk models. atlasu32.lod = disk firmware ver B300 for MAXTOR ATLASU320_36_SCA dpss-309.lod = disk firmware ver S9HA for IBM DPSS-309170N dk32dj-1.lod = disk firmware ver AAAA for HITACHI DK32DJ-18MW dk32dj-3.lod = disk firmware ver AAAA for HITACHI DK32DJ-36MW man3184m.lod = disk firmware ver 0107 for FUJITSU MAN3183MP man3367m.lod = disk firmware ver 0107 for FUJITSU MAN3367MP st39173l.lod = disk firmware ver 6632 for SEAGATE ST39173LW st39173w.lod = disk firmware ver 6621 for SEAGATE ST39173W st39173w.tms = disk servo ver 6015 for SEAGATE ST39173W st318452.lod = disk firmware ver 0002 for SEAGATE ST318452LW st336752.lod = disk firmware ver 0002 for SEAGATE ST336752LW st373405.lod = disk firmware ver 0003 for SEAGATE ST373405LW Note that these firmware versions could easily become out of date and are listed here for baseline test purposes. Current disk firmware image files can be obtained from the respective disk manufacturer. They can be used with "sgdskfl -f imagefile", or named xxxxxxxx.lod to be automatically recognized by sgdskfl, where xxxxxxxx is the first 8 characters of the device model. The sgdskfl utility will look for these files first in the current directory, then, if not found, will look for them in the default installed directory, which is /usr/share/scsirastools. SGDSKFL(8) SGDSKFL(8) NAME sgdskfl - SCSI Disk Firmware Load utility SYNOPSIS sgdskfl [-aenrx -d devname -f imagefile -m diskmodel -t secdelay ] DESCRIPTION Sgdskfl is a program that uses the SCSI Generic interface to send specific SCSI commands to download firmware to disk and tape devices. The appropriate algorithm is cho­ sen, depending on the vendor and product ID of each SCSI device. The SCSI Generic interface requires that the ker­ nel .config file have CONFIG_CHR_DEV_SG set. A log file, named sgdskfl.log is created in the current directory which logs the status of the functions and any errors. Below is the sequence of events for this utility: * List each device on the system with firmware ver­ sions. * User selects a device for firmware load (automatic if using -m) * Read the firmware image file for the selected disk and verify that it is valid. * Verify that the disk is present and ready * Close all open files, flush the adapter, sync any data to the SCSI disks. * Write the firmware image to the disk using one or more 'write buffer' SCSI commands. * Wait 5 (or specified number) seconds * Verify that the disk comes ready again using SCSI test_unit_ready commands, and start_unit or scsi_reset to recover if not. * If the '-m' or '-a' option was used, repeat writing the firmware for each specified disk. OPTIONS Command line options are described below. -? This option displays a summary of the commands accepted by sgdskfl -a Causes the functions to be performed automatically on all connected SCSI devices. -d Specify a unix device name. Only perform these functions on the specified unix device name. -e Causes any file writes to be avoided, such as the log file. Usually a log file is created and writ­ ten to, up until the firmware download begins on root. -f Specify this filename for the firmware image. Nor­ mally, this option is not used and the filename is formed using the first 8 characters of the model, with the ".lod" extension. For example: "st39173w.lod". Note that this utility uses the raw firmware image without any added headers. The utility will look for the firmware image file first in the current directory, then in the default directory (/usr/share/scsirastools). -m Automatically download all drives that match this model string. -n By default, the device names are displayed as alphabetic sequences (/dev/sga), This option shows the device names as numeric sequences (/dev/sg0). -r Try to recover a non-ready drive by updating its firmware. Don't test if the drive is ready or not. -t Specifies the number of seconds to delay after the firmware is written and the program attempts to test if the unit is ready again. Default is 10 seconds. -x Causes extra debug messages to be displayed. ---------------------- 3.3 SGDEFECTS ---------------------- SGDEFECTS(8) SGDEFECTS(8) NAME sgdefects - get SCSI device defect lists SYNOPSIS sgdefects [-aelnvx] DESCRIPTION Sgdefects is a program that uses the SCSI Generic inter­ face to send specific SCSI commands to obtain the device defect lists. This is useful for analyzing a device's grown defects over time to predict when a failure may occur. The SCSI Generic interface requires that the ker­ nel .config file have CONFIG_CHR_DEV_SG set. A log file, named sgdefects.log is created in the current directory which logs the status of the functions and any errors. OPTIONS Command line options are described below. -? This option displays a summary of the commands accepted by sgdefects -a Causes the functions to be performed automatically on all connected SCSI devices. -e Causes any file writes to be avoided, such as the log file. -l By default, the primary defect list is only obtained if the kernel version is 2.4 or greater. If this option is specified, both the primary defect list and the grown defect list will be obtained, regardless of the kernel version. -n By default, the device names are displayed as alphabetic sequences (/dev/sga), This option shows the device names as numeric sequences (/dev/sg0). -v Causes the defect values to be listed in addition to the defect counts. The hex values will be writ­ ten to the log file sgdefects.log -x Causes extra debug messages to be displayed. ---------------------- 3.4 SGDIAG ---------------------- Note that for certain functions that cause the disk to be busy for any length of time, such as scsi format and disk firmware load, the following procedure is recommended for service. 1) Remove the disk from active participation any RAID. This would entail using 'raidhotremove' for Linux software RAID. Often the partitions must be marked faulty before they can be removed. See "mdevt Fail ..." for this procedure. 2) Perform the function (such as sgdskfl, or sgdiag option f). 3) If the disk has been formatted, and you are using software RAID, use 'fdisk' or 'sfdisk' to set up the partitions just as they were before. This is not necessary for sgdskfl, or if the disk is in a a hardware RAID configuration with a RAID adapter. 4) If you are using either software or hardware RAID, restore the disk to active participation in the RAID. This would be done via 'raidhotadd' for Linux software RAID. Note that "mdevt Insert ..." can do this automatically. SGDIAG(8) SGDIAG(8) NAME sgdiag - Do Diagnostic functions on SCSI devices SYNOPSIS sgdiag [-aentx] DESCRIPTION Sgdiag is a program that uses the SCSI Generic interface to send specific SCSI commands to specified SCSI devices. The SCSI Generic interface requires that the kernel .con­ fig file have CONFIG_CHR_DEV_SG set. A log file, named sgdiag.log is created in the current directory which logs the status of the functions and any errors. Several different functions can be performed with this utility. c = Compose Command to send This function allows the user to compose a custom SCSI command to send to one or more devices. It prompts the user for the command length, each com­ mand byte, the write data length, each write data byte, and the receive data length. r = Reset SCSI bus This function will send a bus reset signal to the SCSI bus of the selected device. This may be use­ ful to free up a hung device. i = Special Ser# Inquiries This issues several special SCSI inquiry to obtain the serial number of the device. Normally this is not needed, since the serial number of most modern devices is included in the standard inquiry. f = Format SCSI disk This function issues a low-level SCSI format com­ mand to the selected device. This will remap any media defects. w = Wipe SCSI disk This function issues a low-level SCSI format com­ mand to the selected device. This will NOT include the previous grown defect list. This may be useful in testing to cause a disk to expose some media defects. This is dangerous and should not be used on production systems. d = Send Diagnostic self-test This function issues a SCSI Send Diagnostic self- test command to the selected device. The device returns a sense error if the self-test does not succeed. s = Start Unit This function issues a SCSI Start Unit command. Some disk drives may not be configured to automati­ cally start after power up. A reported sense of 02-04-02 indicates that this command is needed. t = Stop Unit This function issues a SCSI Stop Unit command. This might be used in testing to simulate a device failure. 1 = Do bug 1 (sense_len = 18) The standard sg structure only defines 16 bytes of sense data, whereas 18 bytes are needed to parse some illegal request sense keys. This function attempts to request 18 bytes of sense data from the standard sg structure, which should cause a 5/24/00 sense error, but should not cause the driver or the utility to hang or stop working. 2 = Do bug 2 (INQ hang) In versions of the old adaptec driver that were delivered with RedHat 6.2, this function will hang the driver by requesting a receive buffer less than 96 bytes. Newer versions of the driver should han­ dle this function without error (status = 0). 3 = Do bug 3 (format w short timeout) A SCSI format command can take several minutes to complete. If the SCSI timeout is set too short, it should cause several retries, followed by marking the SCSI device offline. The device should return to normal after a reboot. OPTIONS Command line options are described below. -? This option displays a summary of the commands accepted by sgdiag -a Automatically send command for all devices. If this option is specified, a default illegal SCSI LOG_SENSE command will be sent to each device. -e Causes any file writes to be avoided, such as the log file. -n Naming. By default, the device names are displayed as numeric sequences (/dev/sg0). This option shows the device names as alphabetic sequences (/dev/sga). -t Timeout for the format operation, in minutes. The default is 150 minutes, which should suffice for 150 GB disks. Larger disks require more time. Figure approximately 1 minute per GB of disk capacity. -x Causes extra debug messages to be displayed. ---------------------- 3.5 SGRAIDMON ---------------------- You can use the instructions provided in the RAID CONFIGURATION section below to configure your system in a RAID-1 root mirror configuration. The sgraidmon can be run as a foreground application or as a daemon. It uses the /sbin/mdevt script to control what actions take place when an insertion or removal event occurs. The default action is to automatically partition and remirror the Linux partitions on any disk that is inserted, which is similar to what a hardware RAID adapter would do. This behavior can be changed by editing /sbin/mdevt to comment out or modify the "Insert" case as desired. One additional function that may be desired in mdevt is to use 'snmptrap' to send an SNMP trap to a network control center. Another custom function might be to explicitly copy any non-Linux partition contents (such as a Service Partition) to the new disk using the 'dd' command. If mdevt is changed to do nothing on hot-insertion events, the administrator would need to manually set up and remirror the disk. To set up sgraidmon to autostart as a daemon, do the following: ln -s /etc/rc.d/init.d/sgraid /etc/rc.d/rc3.d/S84sgraid ln -s /etc/rc.d/init.d/sgraid /etc/rc.d/rc5.d/S84sgraid ln -s /etc/rc.d/init.d/sgraid /etc/rc.d/rc0.d/K84sgraid ln -s /etc/rc.d/init.d/sgraid /etc/rc.d/rc6.d/K84sgraid The 'chkconfig' utility, or the SuSE 'install_init.d' utility sets up these links automatically. To start it manually as a daemon, you can run /etc/rc.d/init.d/sgraid start Note that, while a failure of a disk is equivalent to hot-removal, hot-insertion of a disk cannot occur unless the hardware supports it. Hardware that supports hot-insertion has either: - A system with a hot-swap disk backplane (SAF-TE), or - A system connected to an external disk unit with its own power. If the partition tables on each disk do not match in /etc/raidtab, sgraidmon will display an error. SGRAIDMON(8) SGRAIDMON(8) NAME sgraidmon - SCSI Generic RAID Monitor SYNOPSIS sgraidmon [-bermnx -t time] DESCRIPTION Sgraidmon is a program that uses the SCSI Generic inter­ face to monitor SCSI devices which may be part of a soft­ ware RAID, so that hot-removal and hot-insertion events can be detected and acted upon. It also queries the hot- swap backplane, if present, with SAF-TE commands to detect hot-insertions. If an insertion or removal event occurs, this utility invokes a script (/sbin/mdevt). For Insert, mdevt partitions the new disk and remirrors the Linux par­ titions. The mdevt script can be customized as desired. The SCSI Generic interface requires that the kernel .con­ fig file have CONFIG_CHR_DEV_SG set. A log file, named /var/log/sgraidmon.log, is created which logs the status of the sgraidmon functions and any errors. A separate log file, named /var/log/mdevents, logs mes­ sages and errors from the mdevt action script. OPTIONS Command line options are described below. -? This option displays a summary of the commands accepted by sgraidmon -b Causes this utility to run in background as a dae­ mon. -e Causes any file writes to be avoided, such as the log file. -m Max serial number length of 12 instead of 8 (default). This may be needed if you have Fujistu drives. -n Naming. By default, the device names are displayed as numeric sequences (/dev/sg0). This option shows the device names as alphabetic sequences (/dev/sga). -r Run once. By default, this utility runs in a con­ tinuous loop. This option causes it to only run one pass. -t Time interval. The default polling interval is 5 seconds. This option can set the interval to a different number of seconds. -x Causes extra debug messages to be displayed. ---------------------- 3.5 SGDISKMON ---------------------- NAME sgdiskmon - SCSI Generic RAID Monitor SYNOPSIS sgdiskmon [-bermnx -t time] DESCRIPTION Sgdiskmon is a program that uses the SCSI Generic inter- face to monitor SCSI devices which may be part of a soft- ware RAID, so that hot-removal and hot-insertion events can be detected and acted upon. It also queries the hot- swap backplane, if present, with SAF-TE commands to detect hot-insertions. If an insertion or removal event occurs, this utility invokes a script (/sbin/mdevt). For Insert, mdevt partitions the new disk and remirrors the Linux par- titions. The mdevt script can be customized as desired. The SCSI Generic interface requires that the kernel .con- fig file have CONFIG_CHR_DEV_SG set. A log file, named /var/log/sgdiskmon.log, is created which logs the status of the sgdiskmon functions and any errors. A separate log file, named /var/log/sgevents, logs mes- sages and errors from the sgevt action script. OPTIONS Command line options are described below. -? This option displays a summary of the commands accepted by sgdiskmon -b Causes this utility to run in background as a dae- mon. -e Causes any file writes to be avoided, such as the log file. -m Max serial number length of 12 instead of 8 (default). This may be needed if you have Fujistu drives. -n Naming. By default, the device names are displayed as numeric sequences (/dev/sg0). This option shows the device names as alphabetic sequences (/dev/sga). -r Run once. By default, this utility runs in a con- tinuous loop. This option causes it to only run one pass. -t Time interval. The default polling interval is 5 seconds. This option can set the interval to a different number of seconds. -x Causes extra debug messages to be displayed. ERRORS Please supply the contents of the /var/log/sgdiskmon.log and /var/log/sgevents files with any problem report. Note that sgdiskmon has an option for additional debug output (-x). See http://scsirastools.sourceforge.net for a bug list, ChangeLog, and any later versions of sgdiskmon. There are some unique error codes defined in sgdiskmon as follows: -1 is an error from a generic system function, check errno value. -2 means that the SCSI request returned a check condi- tion (sense error) -3 means that the SCSI request failed while writing the command. -4 means that the SCSI request failed while reading the response. -5 means that the device did not respond to basic SCSI ioctl functions. -6 means that the SCSI Inquiry returned an invalid size, so that device may be an impostor (failed). SEE ALSO sgdiag(8) sgdskfl(8) sgmode(8) sgraidmon(8) WARNINGS See http://sourceforge.net/projects/scsirastools/ for a bug list and any later versions of this utility. ---------------------------------- 4.0 SOFTWARE RAID CONFIGURATION ---------------------------------- Setting up a software RAID-1 root mirror Some Linux distributions offer the option to specify the root devices as RAID devices during the initial setup when the disks are partitioned. Otherwise, you can configure an existing Linux system for root mirroring via the following procedure. This procedure shows a RAID-1 configuration for a system that only contains two disks. Other configurations may vary. Note that partition numbers on your system may be different, and various device names would be different for DEVFS as well. 4.1) Install Linux on the first disk. Make sure that the Linux installation includes the software RAID option in the kernel. Usually this includes the raidtools package. #-------------------------------------------------------- # Script 1 - md1.sh # Make sure the system was installed with raidtools rpm -qa |grep raidtools if [ $? -eq 0 ] then echo "OK, continue" echo "" else echo "Error: raidtools should be installed." exit 1 fi exit 0 # end script 1 #-------------------------------------------------------- 4.2) Build a new kernel with md & raid1 enabled. #-------------------------------------------------------- # Script 2 - md2.sh # Check if md is linked into the running kernel cat /proc/ksyms | grep md_wakeup_thread if [ $? -eq 0 ] then echo "CONFIG_BLK_DEV_MD=y" echo "OK, continue" echo "" else echo "Error: CONFIG_BLK_DEV_MD must be =y, rebuild kernel." exit 1 fi cat /proc/ksyms | grep gamap_ if [ $? -eq 0 ] then echo "Warning: CONFIG_SCSIFCHOTSWAP & CONFIG_GAMAP should be off." echo "Check for compatible versions, or rebuild kernel with these off." exit 1 fi if [ -d /dev/md ] then echo "DEVFS is configured" fi exit 0 # end script 2 #-------------------------------------------------------- # If a kernel rebuild is indicated above, first # make sure the kernel-source is installed, then: cd /usr/src/linux # Save .config file make mrproper # Restore .config file # run make menuconfig or check that the .config file contains: CONFIG_CHR_DEV_SG=m (or =y) CONFIG_MD=y CONFIG_BLK_DEV_MD=y CONFIG_MD_RAID1=y (or =m) CONFIG_SCSI_RESCAN=y (if external raid enclosure without SAF-TE) # Note that the kernel must be built with CONFIG_BLK_DEV_MD=y # (instead of =m) in order to use the md driver for root mirroring. # Also check that the following features are turned off, i.e.: # CONFIG_SCSIFCHOTSWAP is not set # CONFIG_GAMAP is not set # These two are incompatible (at present) with root mirroring, # but this may change in subsequent versions. make menuconfig make oldconfig make dep make bzImage make modules make modules_install mkinitrd -f /boot/initrd-${kver}.img ${kver} make install reboot # Footnote: # CONFIG_SCSI_RESCAN -- SCSI rescan on reset # This parameter is added by the scsi_rescan patch. # If you have an external SCSI disk enclosure that allows hot-insertion # of SCSI devices and does a SCSI bus reset when the hotplug occurs, # say Y here to allow the kernel to rescan the bus on which the reset # happens and attach any new devices that may have been inserted. 4.3) Partition the second disk like the first disk Then set the partitions to be mirrored to type 0xfd (Linux raid), including the swap partition. #-------------------------------------------------------- # Script 3 - partition /dev/sdb if [ -d /dev/scsi ] then deva=/dev/scsi/host1/bus0/target0/lun0/disc devb=/dev/scsi/host1/bus0/target1/lun0/disc else deva=/dev/sda devb=/dev/sdb fi tmpf=/tmp/fdisk.in tmps1=/tmp/sda.sfdisk tmpr=/tmp/raid.sfdisk echo "Repartitioning $devb ..." # wipe any existing partitions on sdb dd if=/dev/zero of=$devb bs=512 count=1 cat /proc/mdstat | grep md0 if [ $? -eq 0 ] then echo "Reboot to stop pre-existing raid, then do md3.sh again." exit 1 fi # save partition info from sda sfdisk -d $deva >$tmps1 # set the selected partition types to fd (Linux raid) sed -e 's/Id=83/Id=fd/' -e 's/Id=82/Id=fd/' $tmps1 >$tmpr # write partition info to sdb sfdisk --force --no-reread $devb <$tmpr # end script 3 #-------------------------------------------------------- The sfdisk utility will create the partitions on /dev/sdb, and the following messages are normal/expected: sfdisk: ERROR: sector 0 does not have an msdos signature /dev/sdb: unrecognized partition When this completes successfully, you will see this message: Successfully wrote the new partition table When you are done, the "fdisk -l /dev/sdb" should look something like this: Device Boot Start End Blocks Id System /dev/sdb1 * 1 9 72261 fd Linux raid autodetect /dev/sdb2 10 1115 8883945 5 Extended /dev/sdb5 10 26 136521 fd Linux raid autodetect /dev/sdb6 27 1115 8747361 fd Linux raid autodetect 4.4) Build the /etc/raidtab file. It should look something like this example. #-------------------------------------------------------- # Script 4: md4.sh - build the raidtab bootpart=2 rootpart=3 swappart=4 if [ -d /dev/scsi ] then deva=/dev/scsi/host1/bus0/target0/lun0/part devb=/dev/scsi/host1/bus0/target1/lun0/part devm=/dev/md/ else deva=/dev/sda devb=/dev/sdb devm=/dev/md fi rtabfile=/etc/raidtab cat - <<%%% >$rtabfile # raidtab # md0 is the root array raiddev ${devm}0 raid-level 1 nr-raid-disks 2 chunk-size 32 nr-spare-disks 0 persistent-superblock 1 device ${devb}${rootpart} raid-disk 0 device ${deva}${rootpart} failed-disk 1 # md1 is the /boot array raiddev ${devm}1 raid-level 1 nr-raid-disks 2 chunk-size 32 nr-spare-disks 0 persistent-superblock 1 device ${devb}${bootpart} raid-disk 0 device ${deva}${bootpart} failed-disk 1 # md2 is the swap array raiddev ${devm}2 raid-level 1 nr-raid-disks 2 chunk-size 32 nr-spare-disks 0 persistent-superblock 1 device ${devb}${swappart} raid-disk 0 device ${deva}${swappart} failed-disk 1 # raidtab end %%% # end script 4 #-------------------------------------------------------- 4.5) Set up the raid partitiions #-------------------------------------------------------- # Script 5 - example of setup raid partitions mkraid -R /dev/md/0 mkraid -R /dev/md/1 swapoff -a # intended to disable all swapping temporarily mkraid -R /dev/md/2 cat /proc/mdstat # end script 5 #-------------------------------------------------------- Note that the /dev/sdb* partitions will succeed and show the raid superblock address "raid superblock at ...", but the /dev/sda* partitions will return "failed", since they are marked failed in the raidtab above. 4.6) Make filesystems on the raid partitions #-------------------------------------------------------- # Script 6 - example of making filesystems if [ -d /dev/md ] then mdev=/dev/md/ else mdev=/dev/md fi mke2fs ${mdev}0 mke2fs ${mdev}1 mkswap ${mdev}2 swapon -a # end script 6 #-------------------------------------------------------- 4.7) Mount the raid partitions #-------------------------------------------------------- # Script 7 to mount the raid partitions if [ -d /dev/md ] then mdev=/dev/md/ else mdev=/dev/md fi mkdir -p /mnt/b mount -t ext2 ${mdev}0 /mnt/b mkdir -p /mnt/b/boot mount -t ext2 ${mdev}1 /mnt/b/boot # end script 7 #-------------------------------------------------------- 4.8) Copy the Linux files to the raid partitions. #-------------------------------------------------------- # Script 8 to copy Linux files to mounted tree todir=/mnt/b notf=/tmp/dirnot.tmp cat - <<%%% >$notf tmp proc mnt boot lost+found %%% # dirs="bin dev etc home lib misc opt root sbin usr var" dirs=`ls / |grep -vf $notf` cd / # Copy each directory in the list # could use "find . -print | cpio -pdumv" if grep -v extra stuff. for d in $dirs do echo "Copying $d to $todir/$d..." cp -a $d $todir done # Do /tmp, /proc, /mnt, and /boot specially echo "Creating $todir/tmp, mnt, proc..." mkdir -p $todir/tmp chmod 777 $todir/tmp chmod +t $todir/tmp mkdir -p $todir/proc # makes /mnt directories like cdrom and floppy for i in `ls -l /mnt |grep "^dr" |cut -c57-80` do mkdir -p $todir/mnt/$i done echo "Copying boot to $todir/boot..." cp -f /boot/* $todir/boot 2>/dev/null # This will usually give an "omitting lost+found" stderr message. # end script 8 #-------------------------------------------------------- 4.9) Modify the /etc/fstab on the RAID device to reflect the new mount points: Sample result in /etc/fstab: /dev/md0 / ext2 defaults 1 1 /dev/md1 /boot ext2 defaults 1 1 /dev/md2 swap swap defaults 0 0 #-------------------------------------------------------- # Script 9 to modify fstab f=/mnt/b/etc/fstab tmp=/tmp/f tmpe=/tmp/ed echo "Modifying new fstab for md devices" if [ -d /dev/md ] then # DEVFS, so add another '/' sed -e 's/.* \/boot /\/dev\/md\/1 \/boot /' -e 's/.* \/ /\/dev\/md\/0 \/ /' -e 's/.* swap/\/dev\/md\/2 swap/' $f >$tmp else sed -e 's/.* \/boot /\/dev\/md1 \/boot /' -e 's/.* \/ /\/dev\/md0 \/ /' -e 's/.* swap/\/dev\/md2 swap/' $f >$tmp fi cp $tmp $f # end script 9 #-------------------------------------------------------- 4.10) Enable mirrored swap (/etc/rc.sysinit uses "swapon -a" from /etc/fstab) #-------------------------------------------------------- # Script 10 if [ -d /dev/md ] then swapon /dev/md/2 else swapon /dev/md2 fi # end script 10 #-------------------------------------------------------- 4.11) Modify lilo to include the new raid partition. #-------------------------------------------------------- # Script 11 - edit the lilo files to boot to md0 kver=`uname -r` tmpe1=/tmp/ed1 tmpe2=/tmp/ed2 lilo1=/etc/lilo.conf lilo2=/mnt/b/etc/lilo.conf devm=/dev/md initrdmsg="initrd=/boot/initrd-${kver}.img" if [ -d /dev/scsi ] then # assume MV CGE 3.0 devm=/dev/md/ ktag=`uname -r |cut -f2 -d'-'` kver="intel-${ktag}" initrdmsg="" fi cat - <<%%% >$tmpe1 H $ a image=/boot/vmlinuz-${kver} label=linux-md0 root=${devm}0 $initrdmsg read-only . w q %%% cat - <<%%% >$tmpe2 H 1 /boot= c boot=${devm}1 . /default= c default=linux-md0 . $ a image=/boot/vmlinuz-${kver} label=linux-md0 root=${devm}0 $initrdmsg read-only . w q %%% ed $lilo1 <$tmpe1 ed $lilo2 <$tmpe2 # end script 11 #-------------------------------------------------------- 4.12) Run the raid lilo.conf to write boot sectors, etc. #-------------------------------------------------------- # Script 12 lilo # run lilo sda version lilo -C /mnt/b/etc/lilo.conf # run lilo md0 version from sdb # end script 12 #-------------------------------------------------------- 4.13) Dismount and reboot to linux-md0 on sdb. (or test first by changing SCSI scan to boot to the other hard disk ) #-------------------------------------------------------- # Script 13 umount /mnt/b/boot umount /mnt/b swapoff /dev/md/2 raidstop /dev/md/2 raidstop /dev/md/1 raidstop /dev/md/0 shutdown -r now # end script 13 #-------------------------------------------------------- At the lilo prompt, make sure to enter the "linux-md0" label. If the kernel wasn't built correctly, the linux-md0 will not be able to mount root. If this happens, reboot to the other label and start over at step 2. 4.14) Partition the first disk to set the raid partition type (0xfd). #-------------------------------------------------------- # Script 14 if [ -d /dev/scsi ] then deva=/dev/scsi/host1/bus0/target0/lun0/disc devb=/dev/scsi/host1/bus0/target1/lun0/disc else deva=/dev/sda devb=/dev/sdb fi sfdisk -d $devb >/tmp/sdb.sfdisk sfdisk --force $deva /tmp/x cp /tmp/x /etc/raidtab raidhotadd ${devm}1 ${deva}${bootpart} raidhotadd ${devm}0 ${deva}${rootpart} raidhotadd ${devm}2 ${deva}${swappart} cat /proc/mdstat # Run 'mdevt Save' to save sfdisk snapshot for use by sgraidmon later. mdevt Save $discb # May need to run lilo -M, depending on the version of lilo. # Also need this if errors occurred in md12.sh. lilo -V |grep 22 >/dev/null newlilo=$? if [ $newlilo -eq 0 ] then lilo -M $discb fi # wait for bootpart recovery to run a bit echo "waiting ..." sleep 15 if [ $newlilo -eq 0 ] then lilo -M $disca fi lilo # end script 15 #-------------------------------------------------------- The result should look something like this: Personalities : [raid1] read_ahead 1024 sectors md1 : active raid1 sdb1[0] sda1[1] 72192 blocks [2/2] [UU] md0 : active raid1 sdb5[0] sda5[1] 7686976 blocks [2/2] [UU] md2 : active raid1 sdb6[0] sda6[1] 136448 blocks [2/2] [UU] unused devices: ---------------------- 5.0 USE CASE ---------------------- Suppose a disk is reporting SCSI errors in the system log (or event log), and the cause is determined to be a fault in the disk firmware, which now needs to be upgraded on all of the disks of this type. Before, all systems would need to be taken out of service and manually upgraded (expensive labor cost, lots of downtime). Now, this upgrade can take place automatically while Linux is running, and all services are maintained. For each disk (2 per system) a script could be executed to: 1) Remove the disk from the active software RAID-1 (via raidhot* and/or mdadm commands) 2) Run sgdskfl to download the firmware to the disk 3) Add the disk back into the software RAID-1 (via raidhot* or mdadm commands) 4) When the remirror is complete (check with /proc/mdstat), repeat for the other disk. Since the root disk is protected by RAID-1, the system can remain in service during the firmware upgrades, and the Linux tools can allow this to occur without user intervention, via a script. ---------------------- 5.1 SCSI Statistics ---------------------- Some anomalies are counted by the scsi layer if the SCSIRAS patch is included in the Linux kernel, usually for the Adaptec driver. These incidents are counted, but are not persistent across a reboot, of course. See the sample output below. # cat /proc/scsi/scsi Attached devices: Host: scsi0 Channel: 00 Id: 00 Lun: 00 Vendor: FUJITSU Model: MAN3184MP Rev: 0107 Ser#: UFS0P150 Type: Direct-Access ANSI SCSI revision: 03 Tallies: timeouts 0 resets 4 par_errs 0 disk_errs 0 trans_errs 0 user_errs 4 Host: scsi0 Channel: 00 Id: 01 Lun: 00 Vendor: SEAGATE Model: ST318452LW Rev: 0002 Ser#: 3EV00PRK Type: Direct-Access ANSI SCSI revision: 03 Tallies: timeouts 0 resets 4 par_errs 0 disk_errs 0 trans_errs 0 user_errs 4 # Where: timeouts = number of SCSI timeouts resets = number of SCSI resets par_errs = number of SCSI parity errors disk_errs = number of disk errors reported i.e. SCSI sense errors with keys 1, 3, 4, b, e trans_errs = number of transient errors reported i.e. SCSI sense errors with keys 2, 5, 6, 8, a user_errs = number of user errors reported i.e. SCSI sense errors with keys 7, 9, c, d, f ---------------------- 6.0 PROBLEMS ---------------------- Please supply the contents of the /var/log/sg*.log file with any problem report. Note that each tool has an option for additional debug output (-x). There are some unique error codes defined in the scsirastools as follows: -2 means that the SCSI request returned a check condition (sense error) -3 means that the SCSI request failed while writing the command -4 means that the SCSI request failed while reading the response -5 means that the device did not respond to basic SCSI ioctl functions -6 means that the SCSI Inquiry returned an invalid size, so that device may be an impostor (failed). See http://sourceforge.net/projects/scsirastools/ for a bug list and any later versions of this utility. -------------------------------------- 6.1 FREQUENTLY ASKED QUESTIONS (FAQ) -------------------------------------- Q: I hot-inserted a new disk into my external disk cabinet (brand X), and the new disk did not show up on any scsirastools nor in /var/log/messages. What should I do to make it show up? A: The SCSIRAS kernel changes to rescan the bus for new devices is currently only triggered when a SCSI reset occurs. Some external disk cabinets do not generate a reset or any other SCSI event when a hot-insertion occurs. You can use sgdiag to cause a SCSI reset, and the new device will then be rescanned by the kernel. If you have one of these disk cabinets, and do not have an existing device for sgdiag to use to cause a reset, then you may be restricted to hot-inserting only devices with the same SCSI ID as the device that was removed. Q: I inserted a new disk and it is visible in sgdiag menus and in 'fdisk -l', but it didn't automatically remirror. Why not? A: First, make sure that sgraidmon is running (see section 3.5). When you insert a new disk, if it has the same SCSI ID as the disk that was previously removed, it will show up as the same device name and be remirrored. If it has a different SCSI ID, the new disk will have a new device name and will only be remirrored if the new device name is also pre-configured in the /etc/raidtab file as part of the RAID. Also, it is assumed that there will be more than 5 seconds between the hot-removal and the hot-insertion (5 sec is the default, see sgraidmon -t to change this). Q: I don't want a disk that I hot-insert to automatically be remirrored because I want to do something else with it. A: You can easily do this in one of several ways: - Stop the sgraidmon service temporarily by doing "/etc/init.d/sgraid stop" - If this is frequent, or you wish to have different policy for what to do when hot-insertion occurs, edit the Insert case in the /sbin/mdevt script. - Insert the new disk with a SCSI ID that is different from the disks that are configured in /etc/raidtab. Q: When starting one of the scsirastools, I get the message "Cant open any SCSI devices". How do I fix this? A: The scsirastools depend on the CONFIG_CHR_DEV_SG kernel configuration parameter. Do "cat /proc/ksyms |grep sg_big_buff" to see if the SG driver is configured in your kernel. If not, the kernel needs to be rebuilt with this parameter set. It can be set to either y or m. Q: I configured a RAID-1 root mirror and rebooted, then I get a panic with "VFS: Unable to mount root fs on ...". How do I fix this? A: Most likely this is due to a problem in the kernel configuration. See section 4.0, step 2 for how to build a kernel with RAID enabled. If the kernel is properly configured, there have been rare cases in a kernel without the SCSIRAS patches that the raid superblock becomes corrupted, causing this same symptom. If this occurs, boot to either the emergency diskette or the Linux installation CD and use mdadm to repair the RAID configuration. You would then want to apply the SCSIRAS patches to your kernel and rebuild it. Q: I want to make sure that reconstruction of a mirror doesn't slow my system down. How can I tune the reconstruction? A: The reconstruction takes place via the mdrecoveryd thread, and detects when the IO subsystem is idle to minimize the impact, and for most systems this should not need tuning. The current RAID-1,4,5 parallel reconstruction 'guaranteed speed limit' is 100 KB/sec, so the extra system load does not show up that much. Increase it if you want to have more _guaranteed_ speed. Note that the RAID driver will use the maximum available bandwidth if the IO subsystem is idle. There is also an 'absolute maximum' reconstruction speed limit - in case reconstruction slows down your system despite idle IO detection. You can change it via /proc/sys/dev/raid/speed_limit_min and _max. e.g.: "echo 100 >/proc/sys/dev/raid/speed_limit_min" "echo 100000 >/proc/sys/dev/raid/speed_limit_max" ---------------------- 7.0 MORE INFORMATION ---------------------- scsirastools project http://sourceforge.net/projects/scsirastools/ Carrier-Grade Linux Enhancements http://developer.osdl.org/ Intel Telecomm Linux Technology http://carrierlinux.org/ Intel iSCSI project http://www.sourceforge.net/projects/intel-iscsi Intel RAID adapter drivers http://support.intel.com/support/motherboards/server/srcu31/index.htm http://support.intel.com/support/motherboards/server/srcu31l/index.htm http://support.intel.com/support/motherboards/server/srcu21/index.htm SCSI Draft Standards http://www.t10.org/drafts.htm Linux RAID http://linas.org/linux/raid.html Linux LVM http://www.sistina.com/products_lvm.htm Justin Gibbs Adaptec driver http://people.freebsd.org/~gibbs/linux/ Linux SCSI Generic Driver http://gear.torque.net/sg/ Linux mdadm utility http://www.cse.unsw.edu.au/~neilb/source/mdctl/ Linux SG utility "SCU" http://www.zk3.dec.com/~rmiller/scu.html (another useful tool for SCSI testing & debug) Linux SCSI subsystem http://www.kernel.org http://mirrors.kernel.org/LDP/