Isilon OneFS 8.0.0.2: Intermittent node unresponsive or node split from the Infiniband Network due to isi_phone_home

Article Number: 494612 Article Version: 6 Article Type: Break Fix



Isilon,Isilon OneFS,Isilon OneFS 7.2.1.0,Isilon OneFS 7.2.1.1,Isilon OneFS 7.2.1.2,Isilon OneFS 7.2.1.3,Isilon OneFS 7.2.1.4,Isilon OneFS 7.2.1.5,Isilon OneFS 8.0.0.0,Isilon OneFS 8.0.0.1,Isilon OneFS 8.0.0.2,Isilon OneFS 8.0.0.3

Node offline events are generated by the Cluster and after checking the uptime of the nodes in the cluster it is found that the nodes were split from the group.

You may notice lot of intermittent group changes.

For example:

2017-01-05T01:36:37-06:00 <0.5> ISILON-7(id7) /boot/kernel.amd64/kernel: [gmp_rtxn.c:1787](pid 72308=”kt: gmp-merge”)(tid=101118) gmp merge took 33s

2017-01-07T01:55:11-06:00 <0.4> ISILON-7(id7) /boot/kernel.amd64/kernel: [gmp_info.c:1734](pid 56721=”kt: rtxn_split”)(tid=100805) group change: <3,23351>

[up: 19 nodes, down: 1 node] (node 13 changed to down)

2017-01-07T01:55:11-06:00 <0.4> ISILON-7(id7) /boot/kernel.amd64/kernel: [gmp_info.c:1735](pid 56721=”kt: rtxn_split”)(tid=100805) new group: <3,23351>:

{ 1:0-14,18-32, 2:0-14,18-31,36, 3-4:0-14,18-32, 5:0-7,9-10,12-14,18-32,38-39, 6-7:0-14,18-32, 8:0-13,17-31,35, 9:0-10,12-14,18-32,36, 10-12,14-18:0-14,18-32, 19:0-3,5-14,18-32,36, 20:0-14,18-32, down: 13 }

2017-01-07T01:55:11-06:00 <0.5> ISILON-7(id7) /boot/kernel.amd64/kernel: [gmp_rtxn.c:2020](pid 56721=”kt: rtxn_split”)(tid=100805) gmp split took 20s

2017-01-07T01:56:20-06:00 <0.5> ISILON-7(id7) /boot/kernel.amd64/kernel: [gmp_rtxn.c:1598](pid 56826=”kt: gmp-merge”)(tid=100578) gmp txn <1,1301> error 85 from rtxn_exclusive_merge_lock_get after 44001ms

2017-01-07T01:56:48-06:00 <0.4> ISILON-7(id7) /boot/kernel.amd64/kernel: [gmp_info.c:1734](pid 56980=”kt: rtxn_split”)(tid=103079) group change: <9,23352>

[up: 18 nodes, down: 2 nodes] (node 20 changed to down)

2017-01-07T01:56:48-06:00 <0.4> ISILON-7(id7) /boot/kernel.amd64/kernel: [gmp_info.c:1735](pid 56980=”kt: rtxn_split”)(tid=103079) new group: <9,23352>:

{ 1:0-14,18-32, 2:0-14,18-31,36, 3-4:0-14,18-32, 5:0-7,9-10,12-14,18-32,38-39, 6-7:0-14,18-32, 8:0-13,17-31,35, 9:0-10,12-14,18-32,36, 10-12,14-18:0-14,18-32, 19:0-3,5-14,18-32,36, down: 13, 20 }

2017-01-07T01:56:48-06:00 <0.5> ISILON-7(id7) /boot/kernel.amd64/kernel: [gmp_rtxn.c:2020](pid 56980=”kt: rtxn_split”)(tid=103079) gmp split took 25s

2017-01-07T02:00:42-06:00 <0.4> ISILON-7(id7) /boot/kernel.amd64/kernel: [gmp_info.c:1734](pid 56826=”kt: gmp-merge”)(tid=100578) group change: <13,23353>

[up: 19 nodes, down: 1 node] (node 13 changed to up)

2017-01-07T02:00:42-06:00 <0.4> ISILON-7(id7) /boot/kernel.amd64/kernel: [gmp_info.c:1735](pid 56826=”kt: gmp-merge”)(tid=100578) new group: <13,23353>:

{ 1:0-14,18-32, 2:0-14,18-31,36, 3-4:0-14,18-32, 5:0-7,9-10,12-14,18-32,38-39, 6-7:0-14,18-32, 8:0-13,17-31,35, 9:0-10,12-14,18-32,36, 10-18:0-14,18-32, 19:0-3,5-14,18-32,36, down: 20 }

2017-01-07T02:00:42-06:00 <0.5> ISILON-7(id7) /boot/kernel.amd64/kernel: [gmp_rtxn.c:1787](pid 56826=”kt: gmp-merge”)(tid=100578) gmp merge took 232s

2017-01-07T02:01:22-06:00 <0.4> ISILON-7(id7) /boot/kernel.amd64/kernel: [gmp_info.c:1734](pid 57751=”kt: gmp-merge”)(tid=102358) group change: <1,23354>

[up: 20 nodes] (node 20 changed to up)

On the nodes that were split from the group, you may notice that the device worker threads are no progressing and below error is seen:

For example:

2017-01-05T01:35:02-06:00 <0.5> ISILON-13(id13) /boot/kernel.amd64/kernel: [rbm_core.c:811](pid 43=”swi4: clock sio”)(tid=100036) ping timeout no dwt progress

2017-01-07T01:54:58-06:00 <0.5> ISILON-13(id13) /boot/kernel.amd64/kernel: [rbm_core.c:811](pid 43=”swi4: clock sio”)(tid=100036) ping timeout no dwt progress

2015-06-08T09:00:25-05:00 <0.5> ISILON-3(id3) /boot/kernel.amd64/kernel: [rbm_core.c:811](pid 42=”swi4: clock sio”)(tid=100035) ping timeout no dwt progress

2015-06-08T09:00:27-05:00 <0.5> ISILON-3(id3) /boot/kernel.amd64/kernel: [rbm_core.c:811](pid 42=”swi4: clock sio”)(tid=100035) ping timeout no dwt progress

2016-12-24T16:20:07-06:00 <0.5> ISILON-9(id9) /boot/kernel.amd64/kernel: [rbm_core.c:811](pid 43=”swi4: clock sio”)(tid=100036) ping timeout no dwt progress

Another behavior you may see are failed attempts to create kernel threads due to too many kernel threads. This will show messages like the following:

/boot/kernel.amd64/kernel: kt_new: backoff 1 sec

/boot/kernel.amd64/kernel: kt_new: backoff 1 sec

/boot/kernel.amd64/kernel: [ktp.c:148](pid 5460=”kt: j_syncer”)(tid=100378) kt_new: backoff 1 sec

/boot/kernel.amd64/kernel: [ktp.c:148](pid 5460=”kt: j_syncer”)(tid=100378) kt_new: backoff 1 sec

/boot/kernel.amd64/kernel: [ktp.c:148](pid 79932=”kt: dxt3″)(tid=107754) [ktp.c:148](pid 80002=”kt: dxt0″)(tid=107871) kt_new: backoff 1 sec

/boot/kernel.amd64/kernel: [ktp.c:148](pid 79812=”kt: dxt2″)(tid=107720) kt_new: backoff 1 sec

/boot/kernel.amd64/kernel: kt_new: backoff 1 sec

/boot/kernel.amd64/kernel: [ktp.c:148](pid 80005=”kt: dxt1″)(tid=107802) kt_new: backoff 1 sec


On the nodes that were split you will see processes running that were created by the cron execution of the isi_phone_home script, used to collect information every certain time based on cron.

For example:

root 228 0.0 0.0 26908 3348 ?? I 11:25AM 0:00.00 cron: running job (cron)

root 230 0.0 0.0 7144 1388 ?? Is 11:25AM 0:00.01 /bin/sh -c isi_ropc -s python /usr/local/isi_phone_home/isi_phone_home –script-file cluster_stats.py

root 232 0.0 0.0 100532 13628 ?? I 11:25AM 0:00.11 python /usr/bin/isi_ropc -s python /usr/local/isi_phone_home/isi_phone_home –script-file cluster_stats.py

root 234 0.0 0.0 7144 1388 ?? I 11:25AM 0:00.01 sh -c python /usr/local/isi_phone_home/isi_phone_home –script-file cluster_stats.py

root 235 0.0 0.0 127608 17076 ?? I 11:25AM 0:00.16 python /usr/local/isi_phone_home/isi_phone_home –script-file cluster_stats.py

The phone_home processes pile up because they are trying to get an exclusive lock on a file which is currently locked by another phone_home and when phone home runs, it will try to acquire a lock on a file which is in /ifs/.ifsvar/run/isi_phone_home.lock and then populate the isi_phone_home.pid file with its processID and node number.

ls -l /ifs/.ifsvar/run



-rw——- 1 root wheel 0 Jul 7 2016 phone_home.lock

-rw-r–r– 1 root wheel 8 Feb 4 11:45 phone_home.pid

By default the lock is created using blocking flag, if a new instance of phone home attempts to get the lock on that file (which is already locked by the first instance), that second instance will get blocked and stuck until the first instance ends.

A permanent fix is already addressed in Halfpipe 8.0.1.0

A permanent fix for 7.2 branch is targeted to 7.2.1.6

A permanent fix for 8.0.0.x is targeted for 8.0.0.5

To workaround this issue, you should disable the script as shown here, as well as comment out the crontab entries on the cluster. This will only impact the automatic sending of periodic phone home logs via this feature. This will not impact dial home or any other cluster functions.

To disable the phone home script:

# isi_phone_home –d

or

# isi_phone_home -–disable

This places a .disable file in the run directory mentioned earlier and the script will check for this file before performing most of its operations. However, to completely disable the script, comment out these crontab entries in /etc/mcp/templates/crontab on any node in the cluster (the chanes will be propagated to /etc/crontab on each node automatically)

# — Start section of phone home

#min hour mday month wday who command

*/5 * * * * root isi_ropc -s python /usr/local/isi_phone_home/isi_phone_home –script-file cluster_stats.py

*/31 * * * * root isi_ropc -s python /usr/local/isi_phone_home/isi_phone_home –script-file cluster_full_disk.py

47 11 * * * root isi_ropc -s python /usr/local/isi_phone_home/isi_phone_home –script-file nfs_zones.py

13 00 * * * root isi_ropc -s python /usr/local/isi_phone_home/isi_phone_home –list-file telemetry.list

0 * * * * root isi_ropc -s python /usr/local/isi_phone_home/isi_phone_home –list-file everyhour.list

3 1 */7 * * root isi_ropc -s python /usr/local/isi_phone_home/isi_phone_home –delete-data

17 1 */7 * * root isi_ropc -s python /usr/local/isi_phone_home/isi_phone_home –create-package

4 2 */7 * * root isi_ropc -s python /usr/local/isi_phone_home/utils/send_data.py

# — End section of phone home

Note: If there are already nodes in a locked state, it may be necessary to delete the .lock and .pid files and kill the processes locking these files. If this happens, please contact Technical Support for assistance.

Note: In some versions of OneFS 7.2, these cron entries may be in /etc/mcp/override/crontab.smbtime. If this is the case, comment out the entries there.

Related:

Leave a Reply