High Availability Failovers Due to Missed HA HeartBeats of NetScaler VPX on VMware ESX Hypervisor

Using newslog event to confirm that VPX has scheduling issues

Check the failover event in the /var/nslog/newnslog*.

nsconmsg -K newnslog -d event | grep -E “node|heartbeat” | more

Here is an example of what is seen for an HA failover due to missed HA heartbeats.

Primary Device:

(The Primary device is now Secondary due to the Secondary device not receiving HA heartbeats)

 2077 7537 PPE-0 self node 192.168.1.10: INIT due to REQUEST from HA peer node Tue Jul 26 10:20:25 2016 2062 0 PPE-1 self node 192.168.1.10: INIT due to REQUEST from HA peer node Tue Jul 26 10:20:25 2016 2064 0 PPE-2 self node 192.168.1.10: INIT due to REQUEST from HA peer node Tue Jul 26 10:20:25 2016 2085 0 PPE-2 self node 192.168.1.10: Secondary Tue Jul 26 10:20:25 2016

Secondary Device:

(This Secondary Device did not miss the required HA heartbeats causing an HA failover and now it’s Primary)

 2630 7529 PPE-0 interface(0/1): No HA heartbeats (Last received: Tue Jul 26 10:20:24 2016; Missed 15 heartbeats) Tue Jul 26 10:20:27 2016 2631 0 PPE-0 interface(1/1): No HA heartbeats (Last received: Tue Jul 26 10:20:24 2016; Missed 15 heartbeats) Tue Jul 26 10:20:27 2016 2632 0 PPE-0 interface(1/2): No HA heartbeats (Last received: Tue Jul 26 10:20:24 2016; Missed 15 heartbeats) Tue Jul 26 10:20:27 2016 2633 0 PPE-0 interface(1/3): No HA heartbeats (Last received: Tue Jul 26 10:20:24 2016; Missed 15 heartbeats) Tue Jul 26 10:20:27 2016 2634 0 PPE-0 remote node 192.168.1.10: DOWN Tue Jul 26 10:20:27 2016 2635 0 PPE-0 self node 192.168.1.20: Claiming Tue Jul 26 10:20:27 2016 2636 0 PPE-0 self node 192.168.1.20: Primary Tue Jul 26 10:20:27 2016

Examining the netio_tot_called counter to confirm that VPX has scheduling issues

In the following logs we see that counter logging is stopped for few seconds on both VPXs during the HA failover, which means that the VPX Virtual Machine was scheduled out.

netio_tot_called – This is the number of times the function netio is called. This function is called every time NetScaler needs to start packet processing; ideally the gap should be seven (7) seconds.

Collector bundle for 192.168.1.10 – /var/nslog/

nsconmsg -g netio_tot_called -d current -K newnslog -s time=26Jul2016:10:20 -s disptime=1 |more

 Index rtime totalcount-val delta rate/sec symbol-name&device-no&time 0 3585223 287355050 56748 8105 netio_tot_called Tue Jul 26 10:20:08 2016 1 7002 287381927 26877 3838 netio_tot_called Tue Jul 26 10:20:15 2016 2 7002 287408841 26914 3843 netio_tot_called Tue Jul 26 10:20:22 2016 3 7002 287554531 85636 12230 netio_tot_called Tue Jul 26 10:20:34 2016 à Here we have a 12 second gap; ideally it should have been just 7 seconds 4 7002 287593240 38709 5528 netio_tot_called Tue Jul 26 10:20:41 2016 5 7003 287621530 28290 4039 netio_tot_called Tue Jul 26 10:20:48 2016 6 7003 287648373 26843 3833 netio_tot_called Tue Jul 26 10:20:55 2016 7 7001 287676102 27729 3960 netio_tot_called Tue Jul 26 10:21:02 2016 8 7004 287703248 27146 3875 netio_tot_called Tue Jul 26 10:21:09 2016 9 7004 287730415 27167 3878 netio_tot_called Tue Jul 26 10:21:16 2016

Collector bundle for 192.168.1.20 – /var/nslog/

nsconmsg -g netio_tot_called -d current -K newnslog -s time=26Jul2016:10:20 -s disptime=1 |more

 Index rtime totalcount-val delta rate/sec symbol-name&device-no&time 0 343090 246967167 26729 3817 netio_tot_called Tue Jul 26 10:20:07 2016 1 7001 246994115 26948 3849 netio_tot_called Tue Jul 26 10:20:14 2016 2 7003 247019658 25543 3647 netio_tot_called Tue Jul 26 10:20:21 2016 3 12698 247055240 35582 2802 netio_tot_called Tue Jul 26 10:20:33 2016 à Here is the 12 seconds gap 4 7012 247125542 70302 10025 netio_tot_called Tue Jul 26 10:20:40 2016 5 7001 247200102 25784 3682 netio_tot_called Tue Jul 26 10:20:55 2016

Examining the sys_cur_duration_since_start counter to confirm that VPX has scheduling issues

You can also verify this issue using sys_cur_duration_since_start counter which should also be updated every second and thus have a delta of seven (7) seconds in ideal case. If there is gaps in uptime reporting counter then it clearly indicates issue with lost CPU time.

 9 7001 163.21:23:31 7 0 sys_cur_duration_sincestart Mon Aug 14 13:32:12 2017 10 12201 163.21:23:43 12 0 sys_cur_duration_sincestart Mon Aug 14 13:32:25 2017------Delta value more than 7 11 7002 163.21:23:50 7 0 sys_cur_duration_sincestart Mon Aug 14 13:32:32 2017

Citrix Documentation – Managing High Availability Heartbeat Messages on a NetScaler Appliance

Related:

  • No Related Posts

Leave a Reply