ADM High Availability Database Sync Failure – Symptoms and Recovery

This article describes the usual symptoms as well as checks that can performed on a ADM-HA setup running on 11.1 or 12.0 based builds to detect DB sync failure between the ADM nodes in HA. It also provides steps to be followed to recover from this situation.


  1. If the network connectivity between the nodes in ADM-HA gets broken, then DB sync will fail. This could potentially lead to a split-brain situation where each ADM node will claim to be the PRIMARY ADM node. Even after network connectivity resumes,it is highly likely that DB sync would not be re-established as ADM runs DDL DB queries in PRIMARY mode thereby increasing the chances of DDL query replication failure once network connectivity resumes.
  2. DDL query replication failure – If DDL queries run by ADM processes in a node fail to get replicated to the other node for some reason, then also the DB sync will break. This is a limitation of the ‘BDR’ software presently used by ADM to perform DB sync between the ADM nodes. In the upcoming 12.1 release, ADM will be removing ‘BDR’ and instead be relying on ‘Physical Replication’ for performing DB sync between the nodes and thus eliminate sync failure issues caused by DDL replication failure.

ADM Health check

Before we look at the symptoms, we will look at a way to ascertain if a ADM server node is in UP state i.e all necessary ADM processes are running properly in the VM. In ADM-HA, this check must be applied on each node separately to check if each node is UP and running properly.

The HTTP GET request “/mas_health” can be fired on the ADM VM to check its status. If ADM is UP, one will get “statuscode”: 0 in the response payload. If ADM is DOWN, either no response will be received or it will have “statuscode”: 1 in the response payload. This GET request works without any password or authentication requirements.

Sample requests and responses


Response: {“statuscode”:0, “is_passive”:1}

The above response indicates the node is UP as statuscode is ‘0’.


Response: {“statuscode”:1, “is_passive”:0}

The above response indicates the node is DOWN i.e, all necessary ADM processes are not running in the ADM VM as the status code is ‘1’.

Symptoms of DB sync failure

  1. In ADM GUI, one of the ADM nodes is appearing in DOWN state even though it is actually in UP state as reported by ADM heath check.

  2. Growth of “/var/mps/db_pgsql/pg_xlog” folder. Usually, this folder size will be a few GBs. But in the event of DB sync failure, the size of this folder keeps growing depending on the incoming traffic to ADM and the load on ADM. This issue can result in ADM disk getting filled up. Size of this folder can be checked using the command:

    du -h /var/mps/db_pgsql/pg_xlog

    If its size is seen to be more than 30GB, it is indicative of DB sync failure issue.

Checks to confirm DB sync failure

In case symptoms of DB sync failure are seen in your ADM-HA setup, to confirm that the DB sync has indeed stopped, a DB query could be run in ADM using below command:

/mps/db_pgsql/bin/psql -p 5454 -d mpsdb -U mpsroot -c "select * from pg_replication_slots"

User-added image

Under the column “active”, the value ‘t’ would be seen if sync is intact.

If sync has stopped, then the value ‘f’ would be seen as shown below:

User-added image

Recovery steps

Once it is made sure that both the ADM nodes in HA are UP and running properly using the “mas_health” NITRO test, ensuring the network between the nodes is intact by doing a simple PING check and then checking that the DB sync is actually failing despite the other things working fine, the following recovery steps could be implemented. Essentially, recovery consists of breaking-HA and then re-deploying HA. The steps can be seen in these documentations:

  1. Log in to the GUI of both the ADM nodes and check which one seems to have all the relevant data. Because DB sync is failing, there might be some disparity in the number of NSs and other reporting data that each ADM is showing. So, pick one of the nodes which seems to have the expected set of devices and data. Then perform a “Break” HA from the GUI of this ADM node and wait for ADM GUI to be back up.

    Caveat: After break-HA, upto ADM release 12.0 57.19, the node from which break-HA was invoked might become inaccessible. If you SSH into the ADM and run the command “ps -ax | grep postgres“, you would not see the DB processes running. You can apply a workaround to recover from this :

    • Do: “masd stop
    • Stop postgresql-DB using command “su -l mpspostgres /mps/scripts/pgsql/
    • Do: ‘echo “track_commit_timestamp = on” >> /var/mps/db_pgsql/data/postgresql.conf’
    • Start postgresql using: “su -l mpspostgres /mps/scripts/pgsql/
    • Do “ps -ax | grep postgres” and ensure postgres processes are running in the output.
    • Do: “masd start
  2. Go to the hypervisor console of the other node, not the one where break-HA was invoked, and using the prompt, register it again as a second server of a HA deployment. First Server IP will be same as the IP of the ADM where break-HA was invoked.

    Known issues:

    • The registration of the second server with the first server would fail in case the “nsroot” user password had been changed to some non-default password.
    • The registration process expects the password to be “nsroot”.
    • In the first server GUI, go to “System->Users” and edit the password of “nsroot” user back to the default password “nsroot” in case it had been modified.
    • Set the system password of the second server also back to “nsroot”, by running these commands:
      • pw mod user nsroot -h 0” and then in the prompt, enter “nsroot”
      • pw mod user root -h 0” and then in the prompt, enter “nsroot”
  3. Go to the GUI of the first ADM server and “Deploy” HA again by going to System->Deployment page.


  • No Related Posts

Leave a Reply