NetScaler MAS High Availability Database Sync Failure – Symptoms and Recovery

This article describes the usual symptoms as well as checks that can performed on a NetScaler MAS-HA setup running on 11.1 or 12.0 based builds to detect DB sync failure between the MAS nodes in HA. It also provides steps to be followed to recover from this situation.

Causes

  1. If the network connectivity between the nodes in MAS-HA gets broken, then DB sync will fail. This could potentially lead to a split-brain situation where each MAS node will claim to be the PRIMARY MAS node. Even after network connectivity resumes,it is highly likely that DB sync would not be re-established as MAS runs DDL DB queries in PRIMARY mode thereby increasing the chances of DDL query replication failure once network connectivity resumes.
  2. DDL query replication failure – If DDL queries run by MAS processes in a node fail to get replicated to the other node for some reason, then also the DB sync will break. This is a limitation of the ‘BDR’ software presently used by MAS to perform DB sync between the MAS nodes. In the upcoming 12.1 release, MAS will be removing ‘BDR’ and instead be relying on ‘Physical Replication’ for performing DB sync between the nodes and thus eliminate sync failure issues caused by DDL replication failure.

NetScaler MAS Health check

Before we look at the symptoms, we will look at a way to ascertain if a MAS server node is in UP state i.e all necessary MAS processes are running properly in the VM. In MAS-HA, this check must be applied on each node separately to check if each node is UP and running properly.

The HTTP GET request “/mas_health” can be fired on the MAS VM to check its status. If MAS is UP, one will get “statuscode”: 0 in the response payload. If MAS is DOWN, either no response will be received or it will have “statuscode”: 1 in the response payload. This GET request works without any password or authentication requirements.

Sample requests and responses

Request: http://10.102.126.243/mas_health

Response: {“statuscode”:0, “is_passive”:1}

The above response indicates the node is UP as statuscode is ‘0’.

Request: http://10.102.126.238/mas_health

Response: {“statuscode”:1, “is_passive”:0}

The above response indicates the node is DOWN i.e, all necessary MAS processes are not running in the MAS VM as the status code is ‘1’.

Symptoms of DB sync failure

  1. In MAS GUI, one of the MAS nodes is appearing in DOWN state even though it is actually in UP state as reported by MAS heath check.

  2. Growth of “/var/mps/db_pgsql/pg_xlog” folder. Usually, this folder size will be a few GBs. But in the event of DB sync failure, the size of this folder keeps growing depending on the incoming traffic to MAS and the load on MAS. This issue can result in MAS disk getting filled up. Size of this folder can be checked using the command:

    du -h /var/mps/db_pgsql/pg_xlog

    If its size is seen to be more than 30GB, it is indicative of DB sync failure issue.

Checks to confirm DB sync failure

In case symptoms of DB sync failure are seen in your MAS-HA setup, to confirm that the DB sync has indeed stopped, a DB query could be run in MAS using below command:

/mps/db_pgsql/bin/psql -p 5454 -d mpsdb -U mpsroot -c "select * from pg_replication_slots"

User-added image

Under the column “active”, the value ‘t’ would be seen if sync is intact.

If sync has stopped, then the value ‘f’ would be seen as shown below:

User-added image

Recovery steps

Once it is made sure that both the MAS nodes in HA are UP and running properly using the “mas_health” NITRO test, ensuring the network between the nodes is intact by doing a simple PING check and then checking that the DB sync is actually failing despite the other things working fine, the following recovery steps could be implemented. Essentially, recovery consists of breaking-HA and then re-deploying HA. The steps can be seen in these documentations:

https://docs.citrix.com/en-us/netscaler-mas/11-1/netscaler-mas-ha-deployment.html

https://docs.citrix.com/en-us/netscaler-mas/12/deploy-netscaler-mas/ha-deployment.html

  1. Log in to the GUI of both the MAS nodes and check which one seems to have all the relevant data. Because DB sync is failing, there might be some disparity in the number of NSs and other reporting data that each MAS is showing. So, pick one of the nodes which seems to have the expected set of devices and data. Then perform a “Break” HA from the GUI of this MAS node and wait for MAS GUI to be back up.

    Caveat: After break-HA, upto MAS release 12.0 57.19, the node from which break-HA was invoked might become inaccessible due to this bug :- BUG0695723. If you SSH into the MAS and run the command “ps -ax | grep postgres“, you would not see the DB processes running. You can apply a workaround to recover from this bug:

    • Do: “masd stop
    • Stop postgresql-DB using command “su -l mpspostgres /mps/scripts/pgsql/stoppgsql.sh
    • Do: ‘echo “track_commit_timestamp = on” >> /var/mps/db_pgsql/data/postgresql.conf’
    • Start postgresql using: “su -l mpspostgres /mps/scripts/pgsql/startpgsql.sh
    • Do “ps -ax | grep postgres” and ensure postgres processes are running in the output.
    • Do: “masd start
  2. Go to the hypervisor console of the other node, not the one where break-HA was invoked, and using the deployment_type.py prompt, register it again as a second server of a HA deployment. First Server IP will be same as the IP of the MAS where break-HA was invoked.

    Known issues:

    • The registration of the second server with the first server would fail in case the “nsroot” user password had been changed to some non-default password.
    • The registration process expects the password to be “nsroot”.
    • In the first server GUI, go to “System->Users” and edit the password of “nsroot” user back to the default password “nsroot” in case it had been modified.
    • Set the system password of the second server also back to “nsroot”, by running these commands:
      • pw mod user nsroot -h 0” and then in the prompt, enter “nsroot”
      • pw mod user root -h 0” and then in the prompt, enter “nsroot”
  3. Go to the GUI of the first MAS server and “Deploy” HA again by going to System->Deployment page.

Related:

  • No Related Posts

Leave a Reply