OneFS Drive Statistics

Received the following question from the field recently:



“What text command shows me if a disk is failed or how busy it is and what it’s doing?”



Fortunately, OneFS offers several tools to inspect and report on both drive health and performance, and we’ll take a quick look at some of these in this article. Let’s start with some drive failure and wear reporting tools….



The following cluster-wise command will indicate any drives that are marked as smartfail, empty, stalled, or down:



# isi_for_array -sX ‘isi devices list | egrep -vi “healthy|L3”’



Usually, any node that requires a drive replacement will have an amber warning light on the front display panel. Also, the drive that needs swapping out will typically be marked by a red LED.



Alternatively, isi_drivenum will also show the drive bay location of each drive, plus a variety of other disk related info, etc.



# isi_for_array -sX ‘isi_drivenum –A’



This next command provides drive wear information for each node’s flash (SSD) boot drives:



# isi_for_array -sSX “isi_radish -a /dev/ad* | grep -e FW: -e ‘Percent Life’ | grep -v Used”



However, the output is in hex. This can be converted to a decimal percent value using the following shell command, where <value> is the raw hex output:



# echo “ibase=16; <value>”|bc



Alternatively, the following, uh, lengthy command will do this for you:



# isi_for_array -s ‘isi_radish -a /dev/ad[2,3,4,7] | grep -E “^Internal.*|Total Wear|Lifetime Left|Life Remain|^Carrier board.*”‘ | awk -F ‘[(]’ ‘{ if(match($0,”Wear”)) { printf “%s%d%sn”,” Life remaining: “,100 – (“0x” substr($0,match($0,”/”)-2,2)),”% (SanDisk – Firmware issue causes inaccurate SMART wear data)” } else if(match($0,”Life”)) { printf “%s%d%sn”,” Life remaining: “,”0x” substr($2,15,2),”%” } else { printf (“%s%s%s”,substr($0,0,match($0,”:”)-1),” “,substr($0,match($0,”/”)-2,6)) } }’



General disk activity stats are available via the isi statistics command. The following drive statistics can be useful for both performance analysis and troubleshooting purposes. For example:



# isi statistics system –-nodes=all –oprates –nohumanize



This output will give you the per-node OPS over protocol, network and disk. On the disk side, the sum of DiskIn (writes) and DIskOut (reads) gives the total IOPS for all the drives per node.



For the next level of granularity, the following drive statistics command provides individual SATA disk info. The sum of OpsIn and OpsOut is the total IOPS per drive in the cluster.



# isi statistics drive -nall -–long –type=sata –sort=busy | head -20



And the same info for SSDs:



# isi statistics drive -nall –long –type=ssd –sort=busy | head -20



The primary counters of interest in drive stats data are often the ‘TimeInQ’, ‘Queued’, OpsIn, OpsOut, and IO and the ’Busy’ percentage of each disk. If most or all the drives have high busy percentages, this indicates a uniform resource constraint, and there is a strong likelihood that the cluster is spindle bound. If, say, the top five drives are much busier than the rest, this suggests a workflow hot-spot.



# isi statistics pstat



The read and write mix, plus metadata operations, for a particular protocol can be gleaned from the output of the isi statistics pstat command. In addition to disk statistics, CPU and network stats are also provided. The –protocol parameter is used to specify the core NAS protocols such as NFSv3, NFSv4, SMB1, SMB2, HDFS, etc. Additionally, OneFS specific protocol stats, including job engine, platform API, IRP, etc, are also available. For example, the following will show NFSv3 stats in a ‘top’ format, refreshed every 6 seconds by default:



# isi statistics pstat –protocol nfs3 –format top



The uptime command provides system load average for 1, 5, and 15 minute intervals, and is comprised of both CPU queues and disk queues stats.



# isi_for_array -s ‘uptime’



It’s worth noting that this command’s output does not take CPU quantity into account. As such, a load average of 1 on a single CPU means the node is pegged. However, that load average of 1 on a dual CPU system means the CPU is 50% idle.

The following command will give the CPU count:



# isi statistics query current –nodes all –degraded –stats node.cpu.count



The sum of disk ops across a cluster per node is available via the following syntax:



# isi statistics query current –nodes=all –stats=node.disk.xfers.rate.sum



There are a whole slew of more detailed drive metrics that OneFS makes available for query. These include:



drive_stats_1.png



Disk time in queue provides an indication as to how long an operation is queued on a drive. This indicator is key if a cluster is disk-bound. A time in queue value of 10 to 50 milliseconds equals is concerning, whereas a value of 50 to 100 milliseconds indicates a problem. To obtain the maximum, minimum, and average values for disk time in queue for SATA drives, run the following command.



# isi statistics drive –nodes=all –degraded –no-header –no-footer | awk ‘ /SATA/ {sum+=$8; max=0; min=1000} {if ($8>max) max=$8; if ($8<min) min=$8} END {print “Min = “,min; print “Max = “,max; print “Average = “,sum/NR}’



The following command displays the time in queue for 30 drives sorted highest-to-lowest:



# isi statistics drive list -n all –sort=timeinq | head -n 30



Queue depth indicates how many operations are queued on drives. A queue depth of 5 to 10 is considered heavy queuing. To obtain the maximum, minimum, and average values for disk queue depth of SATA drives, run the following command.



# isi statistics drive –nodes=all –degraded –no-header –no-footer | awk ‘ /SATA/ {sum+=$9; max=0; min=1000} {if ($9>max) max=$9; if ($9<min) min=$9} END {print “Min = “,min; print “Max = “,max; print “Average = “,sum/NR}’



If there’s a big delta between the maximum number and average number in the queue, it’s worth investigating further to determine whether an individual drive is working excessively.



To display queue depth for twenty drives sorted highest-to-lowest, run the following command:



# isi statistics drive list -n all –sort=queued | head -n 20



The disk percent busy metric can he useful to determine if a drive is getting pegged. However, it does not indicate how much extra work may be in the queue. To obtain the maximum, minimum, and average disk busy values for SATA drives, run the following command.



# isi statistics drive –nodes=all –degraded –no-header –no-footer | awk ‘ /SATA/ {sum+=$10; max=0; min=1000} {if ($10>max) max=$10; if ($10,min) min=$10} END {print “Min = “,min; print “Max = “,max; print “Average = “,sum/NR}’



For information on SAS drives, substitute SAS for SATA in the syntax above:



Finally, to display disk percent busy for twenty drives sorted highest-to-lowest issue, run the following command.



# isi statistics drive -nall –orderby=busy | head -n 20

So there you have it – a variety of commands for querying OneFS for drive info.

Related:

Leave a Reply