Isilon: No response from isi_stats_d accompanied by performance issues

Article Number: 498278 Article Version: 3 Article Type: Break Fix



Isilon

Performance issue reported by client up to and including data unavailable. Some troubleshooting methods impede quick time to resolution by focus on data collection of symptoms rather than expose during live engagement a “common cause” characterized by all or most of following pattern:

Slow response and timeouts to admin WebUI and CLI commands, especially commands requiring node statistics with many errors per minute in messages, job, and celog (OneFSv7x) or tardis (OneFSv8x) found by running:

isi_for_array -s “tail /var/log/messages”

displays messages such as

“Failed to send worker count stats”

“No response from isi_stats_d after 5 secs”

“Error while getting response from isi_celog_coalescer” (OneFSv7x), or “Unable to get response from…”

Hangdumps, especially if multiple hourly, identified live in messages on any node displayed with

grep -A1 <today, as in 2017-05-11> “LOCK TIMEOUT AT” /var/log/messages

isi_hangdump: Initiating hangdump…

Intermittent high CPU load displayed as 1, 5, and 10 minute CPU load averages with:

isi_for_array -s uptime

High memory utilization on one or more services and on one or more nodes displayed by

isi_for_array -s “ps -auwx”

Service timeouts and Stack traces indicating memory exhaustion and service or “Swatchdog” timeouts displayed by

isi_for_array -s “grep <today as in 2017-05-11> /var/log/messages | grep -A3 -B1 Stack”

The above complex symptoms indicate node resource exhaustion. This can be caused by long wait times, locking, and unbalanced work flow exceeding one or more node’s capabilities, including one of several known causes:

– Uptime bugs

– IB hardware or switch failures

– BMC/CMC gen5 controller unresponsive

– SyncIQ policy when-source-modified configured on an active path

– SyncIQ job degrades performance after upgrade to 8.x with default SyncIQ Performance Rules and limited replication IP pool

– isi commands locking drive_purposing.lock

This KB recommends quickly identify or eliminate these known performance disruptors before proceeding with more detailed symptom troubleshooting.

Depending on workflow the timeouts, memory exhaustion and Stack traces may occur on one service more than another, such as lwio for SMB. Before implementing Troubleshooting guides (collect lwio cores, etc.) on a particular service, when pattern similar to above is present, run following commands and record outcome in case comment to indicate or eliminate a “common cause.”

uname -a

Interpret: Susceptible to 248 or 497 day uptime bugs: OneFS v7.1.0.0-7.1.0.6, 7.1.1.0-7.1.1.5, 7.2.0.0-7.2.0.3, and 7.2.1.0

Susceptible to drive_purposing.lock condition OneFS v7.1.1.0-7.1.1.9, 7.2.1.0-7.2.1.2 or below, and 8.0.0.0

isi_for_array -s uptime

Interpret: Uptime at or about 248 days or 497 days and uname -a indicated a susceptable version? Indicates Uptime bug.

isi status

Interpret: If running slowly, statistics time out or display n/a n/a n/a, isi_stats_d is not communicating on one or more nodes

If uname indicates susceptible to drive_purposing.lock, close multiple WebUI instances and run isi_for_array “killall isi_stats_d”

isi_for_array -s “tail /var/log/ethmixer.log”

Interpret: Many statechanges and ports register as down (change not reporting “is alive”) indicates IB harware cable, card, or switch.

Lack of IB errors reported in ethmixer log while intermittent stat and service failures suggest refocus on Sync or job engine.

/usr/bin/isi_hwtools/isi_ipmicmc -d -V -a bmc | grep firmware

Interpret: If nodes are gen5 (x or n 210, 410m HD400) and OneFS below v8.0.0.4 cluster is susceptible. No version output means controller is unresponsive.

A responding controller does not fully eliminate this as cause since stats_d and dependent services can fail while controller is unresponsive and then a controller restart without those services recovering.

isi sync policies list -v | grep Schedule

Interpret: If any “when-source-modified” then note policy <name> and disable: isi sync policies modify <name> –enabled false

isi sync rule list (for uname indicating 8.x only)

Interpret: if display blank, then no Performance Rules exist, SyncIQ will run with 8.x defaults

isi sync policies list -v | grep -A1 “Source Subnet”

Interpret: If blank there are no IP pool restrictions. If subnet and pool restrictions are listed, and default Sync Performance Rules, nodes participating in the limited replication pool may be overtaxed during SyncIQ jobs.

More troubleshooting and indicator details on each “common cause”:

Uptime bugs

Indicators: more likely to include one or more node offline, recently restarted or split from cluster, and full cluster DU vs. intermittent.

Check uptime and Code version for suceptability at or above 248 days or 497 days:

QuickTip: On any node CLI run

uname -a (OneFS versions Suseptable to 248 or 497 day bugs: 7.2.1.0, 7.2.0.0 – 7.2.0.3, 7.1.1.0 – 7.1.1.5, and 7.1.0.0 – 7.1.0.6)

isi_for_array -s uptime (nodes at or about 248 or 497 days?)

For more information on uptime bug troubleshooting, and identifying as cause of performance and nodes offline behavior:

ETA 209918: Isilon OneFS: Nodes run for more than 248.5 consecutive days may restart without warning which may lead to potential data unavailability https://support.emc.com/kb/301837

ETA 202452: Isilon OneFS: Nodes that have run for 497 consecutive days may restart without warning https://support.emc.com/kb/301837

ETA 491747: Isilon OneFS: Gen5 nodes containing Mellanox ConnectX-3 adapters reboot after 248.5 consecutive days https://support.emc.com/kb/491747

Infiniband (IB) hardware or switch failure

Indicators: Less likely to be intermittent than one or more node offline or split. If all nodes or all of one IB channel (eg. ib0) unable to ping, then issue is likely a failed or un-powered IB switch.

QuickTip

isi status takes long time to run and displays one or more node with throughput/drive stats as n/a n/a n/a

/var/log/ethmixer.log shows many statechanges and ports register as down (change not displaying status “is alive”)

ifconfig shows IP addresses for ib0 and ib1 – ping to another node IB address, failure indicates IB cable, card or switch issue.

For more information on IB errors and troubleshooting them: https://support.emc.com/kb/30183

BMC/CMC unresponsive

Indicators: Job, replication, celog, gconfig errors and unaccountably long job run times when OneFS version is older than 8.0.0.4 and BMC firmware is older than version 1.25. If attempting to query the BMC for F/W version on any node gets no output, then unresponsive BMC/CMC is confirmed, and BMC/CMC is likely a contributor to performance issues.

QuickTip

/usr/bin/isi_hwtools/isi_ipmicmc -d -V -a bmc | grep firmware

Note: A responding controller does not eliminate this as a contributing issue. Patches and versions prior to OneFS 8.0.0.4 added features to restart unresponsive management controllers. However the impact on stats and other dependent services such as jobs, celog and replication can persist following the controller restart. For more information on BMC errors and troubleshooting: https://support.emc.com/kb/466373

Sync policy with Schedule when-source-modified exists in an active folder

Indicators: More likely associated with intermittent hangdumps and stacks are tracking workflow peaks, symptoms quiet off-hours

QuickTip

isi sync policies list -v | grep Schedule

If “when-source-modified” is found on one or more policies, disable that policy while troubleshooting

isi sync policies modify <name> –enabled false

If there are many policies with that attribute, quickest relief can be from disabling isi_migrate service

isi services -a isi_migrate disable

Continue with services troubleshooting and repair, restart node services such as isi_stats_d and isi_stats_hist_d and change sync policy to use scheduled rather than when-source-modified before re-enabling Sync policy or service.

isi services -a isi_migrate enable

Sync job degrades performance after upgrade to 8.x

Indicators: After upgrading to 8.x without adjusting Performace Rules, a cluster without all nodes on network or with administratively-limited replication IP pool can overtax resources on the few participating nodes when SyncIQ jobs run.

QuickTip

isi sync rule list displays nothing (Performance Rules running as default)

isi sync policies list -v | grep -A1 “Source Subnet” (If blank there are no IP pool restrictions, otherwise will list subnet and pool)

If no Performance Rules and replication IP pool is restricted, disable isi_migrate services until adding SyncIQ Performance Rules or increasing number of nodes participating in replication to reduce imbalance of Sync workers on nodes participating in replication.

isi services -a isi_migrate disable

isi commands locking drive_purposing.lock

Indicators: Verification of issue can be made offline from logs with hangdump review, but key indicators during live engagement are

a) most likely to occur when running multiple isi_statistics or WebUI commands at same time from different nodes, and

b) cluster is at OneFS versions 7.1.1.0-7.1.1.9, 7.2.1.0-7.2.1.2, or 8.0.0.0.

QuickTip

If above indicators present, stop any statistics commands and WebUIs running on all nodes and restart the isi_stats_d daemon with

isi_for_array killall isi_stats_d

Related:

Leave a Reply