Announcement

Collapse
No announcement yet.

Proactively monitoring hard drive health using smartd

Collapse
This topic is closed.
X
This is a sticky topic.
X
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Proactively monitoring hard drive health using smartd

    Monitoring Hard Drive Health on Linux with smartmontools

    S.M.A.R.T. is a system in modern hard drives designed to report conditions that may indicate impending failure. smartmontools is a free software package that can monitor S.M.A.R.T. attributes and run hard drive self-tests. Although smartmontools runs on a number of platforms, I will only cover installing and configuring it on Linux.

    Why Use S.M.A.R.T.?

    Basically, S.M.A.R.T. may give you enough of a warning that you can safely backup all your data before your hard drive dies. There is some amount of conflicting information on the internet about how reliable the warnings are. The best source of research that I found is a paper from Google that describes an internal study of hard drive failure. A quick summary: certain events greatly increase the chance of hard drive failure including reallocation events and failed self-tests, but only about 60% of the drives that failed in the study had any negative S.M.A.R.T. attributes. Obviously, nothing replaces regular backups.

    A good source for more information is the S.M.A.R.T. wikipedia page.
    Installation

    On Debian or Ubuntu systems:

    $ sudo apt-get install smartmontools

    On Fedora:

    $ sudo yum install smartmontools
    Capabilities and Initial Tests

    smartmontools comes with two programs: smartctl which is meant for interactive use and smartd which continuously monitors S.M.A.R.T. Let’s look at smartctl first:

    $ sudo smartctl -i /dev/sda

    Replace /dev/sda with your hard drive’s device file in this command and all subsequent commands. If there’s only one hard drive in the system, it should be /dev/sda or /dev/hda. If this command fails, you may need to let smartctl know what type of hard drive interface you’re using:

    $ sudo smartctl -d TYPE -i /dev/sda

    where TYPE is usually one of ata, scsi, or sat (for serial ata). See the smartctl man page for more information. Note that if you need -d here, you will need to add it to all smartctl commands. This should print information similar to:

    === START OF INFORMATION SECTION ===
    Model Family: SAMSUNG SpinPoint T133 series
    Device Model: SAMSUNG HD300LJ
    Serial Number: S0D7J1UL303628
    Firmware Version: ZT100-12
    User Capacity: 300,067,970,560 bytes
    Device is: In smartctl database [for details use: -P show]
    ATA Version is: 7
    ATA Standard is: ATA/ATAPI-7 T13 1532D revision 4a
    Local Time is: Fri Jan 2 03:08:20 2009 EST
    SMART support is: Available - device has SMART capability.
    SMART support is: Enabled

    Now that smartctl can access the drive, let’s turn on some features. Run the following command:

    $ sudo smartctl -s on -o on -S on /dev/sda

    -s on: This turns on S.M.A.R.T. support or does nothing if it’s already enabled.
    -o on: This turns on offline data collection. Offline data collection periodically updates certain S.M.A.R.T. attributes. Theoretically this could have a performance impact. However, from the smartctl man page:

    Normally, the disk will suspend offline testing while disk accesses are taking place, and then automatically resume it when the disk would otherwise be idle, so in practice it has little effect.

    -S on: This enables “autosave of device vendor-specific Attributes”.

    The command should return:

    === START OF ENABLE/DISABLE COMMANDS SECTION ===
    SMART Enabled.
    SMART Attribute Autosave Enabled.
    SMART Automatic Offline Testing Enabled every four hours.

    Next, let’s check the overall health:

    $ sudo smartctl -H /dev/sda

    This command should return:

    === START OF READ SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED

    If it doesn’t return PASSED, you should immediately backup all your data. Your hard drive is probably failing. Next, let’s make sure that the drive supports self-tests. I have yet to see a drive that doesn’t, but the following command also gives time estimates for each test:

    $ sudo smartctl -c /dev/sda

    I won’t list the complete output because it’s somewhat lengthy. Make sure “Self-test supported” appears in the “Offline data collection capabilities” section. Also, look for output similar to:

    Short self-test routine
    recommended polling time: ( 2) minutes.
    Extended self-test routine
    recommended polling time: ( 127) minutes.

    These are rough estimates of how long the short and long self-test’s will take respectively. Let’s run the short test:

    $ sudo smartctl -t short /dev/sda

    On my drive, this test should take 2 minutes, but this obviously varies. You can run:

    $ sudo smartctl -l selftest /dev/sda

    to check results. Unfortunately, there’s no way to check progress, so just keep running that command until the results show up. A successful run will look like:

    === START OF READ SMART DATA SECTION ===
    SMART Self-test log structure revision number 1
    Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
    # 1 Short offline Completed without error 00% 21472 -

    Now, do the same for the long self-test:

    $ sudo smartctl -t long /dev/sda

    The long test can take a significant amount of time. You might want to run it overnight and check for the results in the morning. If either test fails, you should immediately backup all your data and read the last section of this guide.
    Configuring smartd

    We’ve now enabled some features and run the basic tests. Instead of repeating the previous section daily, we can setup smartd to do it all automatically. If your system has an /etc/smartd.conf file, check for a line that begins with DEVICESCAN. If you find one comment it out by adding a ‘#’ to the beginning of the line. DEVICESCAN doesn’t work on my system and specifying a device file is easy. Add the following line to /etc/smartd.conf:

    /dev/sda -a -d sat -o on -S on -s (S/../.././02|L/../../6/03) -m root -M exec /usr/share/smartmontools/smartd-runner

    Here’s what each option does:

    /dev/sda: Replace this with the device file you’ve been using in smartctl commands.
    -a: This enables some common options. You almost certainly want to use it.
    -d sat: On my system, smartctl correctly guesses that I have a serial ata drive. smartd on the other hand does not. If you had to add a “-d TYPE” parameter to the smartctl commands, you’ll almost certainly have to do the same here. If you didn’t, try leaving it out initially. You can add it later if smartd fails to start.
    -o on, -S on: These have the same meaning as the smartctl equivalents
    -s (S/../.././02|L/../../6/03): This schedules the short and long self-tests. In this example, the short self-test will run daily at 2:00 A.M. The long test will run on Saturday’s at 3:00 A.M. For more information, see the smartd.conf man page.
    -m root: If any errors occur, smartd will send email to root. On my system, mail for root is forwarded to my normal email account. If you don’t have a similar setup, replace root with your normal email address. This option also requires a working email setup. Most Linux distributions automatically have working outbound email.
    -M exec /usr/share/smartmontools/smartd-runner: This last part may be specific to the Debian and Ubuntu smartmontools packages. Check if your system has /usr/share/smartmontools/smartd-runner. If it doesn’t, remove this option. Instead of sending email directly, “-M exec” makes smartd run a different command when errors occur. On Debian, smartd-runner will run each script in /etc/smartmontools/run.d/, one of which emails the user specified by the “-m” option.

    If you have more than one hard drive in your system, add a line for each one replacing /dev/sda with a different device file.

    Update on 2009-01-06:

    Thanks to commenter robert for pointing out an omission on my part. If your system has the file /etc/default/smartmontools, uncomment the “#start_smartd=yes” line by removing the “#”.

    Finally, restart smartd:

    $ sudo /etc/init.d/smartmontools restart

    If this command fails, the end of /var/log/daemon.log should have some diagnostic information. If smartd started fine, we should still test that email notifications are working. Add “-M test” to the end of the configuration line in /etc/smartd.conf. This will make smartd send out a test notification when it’s next started. Once again, restart smartd:

    $ sudo /etc/init.d/smartmontools restart

    You should receive an email similar to:

    This email was generated by the smartd daemon running on:

    host name: polar
    DNS domain: shadypixel.com
    NIS domain: (none)

    The following warning/error was logged by the smartd daemon:

    TEST EMAIL from smartd for device: /dev/sda

    For details see host's SYSLOG (default: /var/log/syslog).

    Afterward, you can delete “-M test”.

    Source: http://blog.shadypixel.com/monitorin...smartmontools/
    www.AYKsolutions.com
    From Shared to Dedicated
    Professional. Painless. Polite.
Working...
X