Scenario 1 – postgres database is not started on secondary node
Symptoms
– MAS daemons are not running on secondary node
– Postgres is not started
– There is no space on /var/mps
Probable causes
– lack of space in /var/mps
Solution
– perform full data resync
Scenario 2 – postgres wal process not started on secondary
Symptoms
– MAS daemons are not running on secondary node
– Postgres is started
– Postgres wal process is not started
– In /var/mps/db_pgsql/data/pg_ctl.log there is an info that the wal process cannot be started “FATAL: timeline 3 of the primary does not match recovery target timeline 2”
Probable cause
The probable cause for this problem is unclean shutdown/reboot of the primary node.
Solution
Perform the full data resync.
Scenario 3 – postgres wal not streaming on primary
Symptoms
– MAS daemons are not running on secondary node
– Postgres is started
– Postgres wal process is started on secondary
– There is no wal sender process on primary, which is streaming
Probable cause
The difference in data between primary and secondary is too big; hence the replication cannot start automatically.
Solution
Perform the full data resync
Procedure to perform the full data resync:
1. Backup /var/mps/db_pgsql/data/pg_log and /var/mps/db_pgsql/data/pg_ctl.log
2. Clean up /var/mps filesystem
3. Check the space
4. Check the health of the disk and filesystem
5. Access CLI of Secondary ADM using username as nsrecover and password of nsroot user
6. Run the below command (Replace with original IP addresses in the command):
nohup sh /mps/scripts/pgsql/join_streaming_replication.sh SecondaryIP PrimaryIP nsroot > /var/mps/log/join_streaming_replication_console.log 2>&1 &
7. Monitor the output of the above command in /var/mps/log:
‘
tail -f join_streaming_replication_console.log
8. Wait for a few hours and confirm if the HA channel is UP by running the command on Secondary:
ps -ax | grep -i wal
9. You should see this line to confirm if the channel is UP
?? Ss 0:14.14 postgres: wal receiver process streaming