How To Fix The Database Replication Issues In Various Situations

Scenario 1 – postgres database is not started on secondary node

Symptoms

– MAS daemons are not running on secondary node

– Postgres is not started

– There is no space on /var/mps

Probable causes

– lack of space in /var/mps

Solution

– perform full data resync

Scenario 2 – postgres wal process not started on secondary

Symptoms

– MAS daemons are not running on secondary node

– Postgres is started

– Postgres wal process is not started

– In /var/mps/db_pgsql/data/pg_ctl.log there is an info that the wal process cannot be started “FATAL: timeline 3 of the primary does not match recovery target timeline 2

Probable cause

The probable cause for this problem is unclean shutdown/reboot of the primary node.

Solution

Perform the full data resync.

Scenario 3 – postgres wal not streaming on primary

Symptoms

– MAS daemons are not running on secondary node

– Postgres is started

– Postgres wal process is started on secondary

– There is no wal sender process on primary, which is streaming

Probable cause

The difference in data between primary and secondary is too big; hence the replication cannot start automatically.

Solution

Perform the full data resync

Procedure to perform the full data resync:

1. Backup /var/mps/db_pgsql/data/pg_log and /var/mps/db_pgsql/data/pg_ctl.log

2. Clean up /var/mps filesystem

3. Check the space

4. Check the health of the disk and filesystem

5. Access CLI of Secondary ADM using username as nsrecover and password of nsroot user

6. Run the below command (Replace with original IP addresses in the command):

nohup sh /mps/scripts/pgsql/join_streaming_replication.sh SecondaryIP PrimaryIP nsroot > /var/mps/log/join_streaming_replication_console.log 2>&1 &

7. Monitor the output of the above command in /var/mps/log:

tail -f join_streaming_replication_console.log

8. Wait for a few hours and confirm if the HA channel is UP by running the command on Secondary:

ps -ax | grep -i wal

9. You should see this line to confirm if the channel is UP

?? Ss 0:14.14 postgres: wal receiver process streaming

Related:

Domain Replication NTP (Firewall)

I need a solution

Good Morrning,

I’ve deployed NTP on our Domain Controller, whereby only the Microsoft defined Active Directory ports are allowed. This works great for our workstations, but I noticed our secondary Domain Controller is now failing replication. So as a result, I created a new rule to allow All traffic from all ports, inbound and outbound, to the Secondary Domain Controller IP Address. Through the SEPM I can see this rule is allowing some types of traffic between Primary and Secondary DC- yet replication still fails. 

I know this is NTP related, because if I disable the firewall on the primary DC, then the secondary DC (which has no firewall) replication is a success.

So my question is, what other NTP feature would cause replication to fail despite explicitly having a rule to allow All between these two servers? I’ve attached a screenshot of the rule which applies to the Primary Domain Controller, whereby the IP of the secondary DC is added under “Hosts”.

The rules below that one just go on to allow specific AD ports for all hosts, as well as some prohibitive rules which should not apply to the Secondary DC since this is the first rule in the sequence, above all else. 

Any guidance would be appreciated, I’ve been struggling with this for days now.

0

Related:

Symantec 14.2 RU1 – Overdeployed

I do not need a solution (just sharing information)

Hi all! 

good day to everyone, i would just like to seek your advise / help. for the past few weeks, i am recieving notifications that i have overdeployed clients.(inconsistently, mostly the alert is triggering while replicating)

it’s not a big number but usually it happens when my 2 sepm replicates.

to further understand here is what my current infra

1.1 US sepm —- 1 ASIA sepm (db replication)

2. database replication (every 4 hours)

3. 7400 seats.

4. client will be removed after 3 days

5. non – persistent client will be removed after 1 day

6. also the alert is only happening on 1 sepm at a time (us shows overdeployed on dashboard while asia is not)

something to consider 

— i have no control when new machines are being installed / remove.

now on my concern. i have checked with symantec support and honestly the response time and no solution was given, (gave them, symdiag, logs could not explain properly what is happening)

basically what i want to understand is.

*where can is see proof, aside from the alert that i am really lacking of license — compute status etc. (management will have to ask me if thats the case, budget are tight)

 *if this is an issue where can check? is it database etc.

thank you,

0

Related:

  • No Related Posts

SEP Replication Partner Failover?

I need a solution

Hi,

Is it possible ot configure autofailover between replication partners?

At the moment i I have a central base with multiple sites using a single replicaiton partners communicate with this central base.

How its currently configured means I have a single point of failure on each site, if the replcation server on any remote was to fail then communicateion would be lost until its restored.

I have 2 SEPMs on each site, the first SEPM was instralled and configred as a replication partner and the second one was confiugured as a site partner.

Can i configure replcation failover between both SEPMs on these remote sites?

Thanks,

Jamie

0

Related:

Replication is failing with partner Site.

I need a solution

After upgrading from 14 RU1 MP2 to 14.2 MP1 I kept getting unexpected error in the notification pannel in Admin tab and from couple of days the replication is failing with partner site with below error:

March 8, 2019 2:24:05 PM GMT:  Replication from remote site WXWD1PSEM0002 to local site London_WXLN2PSEM0001 finished unsuccessfully  [Site: London_WXLN2PSEM0001]  [Server: WXLN2PSEM0001]
March 8, 2019 2:24:05 PM GMT:  Unable to reach remote Site [WXWD1PSEM0002]: Failed to connect to the server.

Make sure that the server is running and your session has not timed out.
If you can reach the server but cannot log on, make sure that you provided the correct parameters.
If you are experiencing network issues, contact your system administrator. ErrorCode: 0x80020000  [Site: London_WXLN2PSEM0001]  [Server: WXLN2PSEM0001]

I have check the Network connection between local site and remote site, connection seems to be fine, the servers are able to ping each other and they are able to telnet each other on port 443.

both sites have dedicated SQL 2012 database.

Please advice.

0

Related:

Easy, Economical Cloud DR to AWS with RecoverPoint for Virtual Machines

The most recent RecoverPoint for Virtual Machines v5.2.1 release, adds the capability to protect VMs directly to AWS S3 object storage, using proprietary snap-based replication, with RPO that can be measured in minutes. This blog recaps the capabilities that Cloud DR 18.4 unlocks for Recover Point for Virtual Machines. RecoverPoint for Virtual Machines works with Cloud DR to protect your VMs by replicating them to the AWS cloud. Replicated data is compressed, encrypted, and stored as incremental snapshots on Amazon S3 object storage. You can set parameters around the snap-replication policies for reliable and repeatable Disaster … READ MORE

Related:

SEPM SQL Issues

I need a solution

Hi Everyone,

When i try to run replication between SEPM Prod and DR servers, I’m getting below errors.

February 11, 2019 11:20:24 AM PST:  Replication from remote site XXXX. – DR to local site XXX. finished unsuccessfully  [Site: XXXX.]  [Server: ]
February 11, 2019 11:20:23 AM PST:  Unable to fetch changed data from remote site [XXX – DR]: Failed to load data. If packet size is not too large, please modify it by scm.bcp.packet.size. Detailed message
: SQLState = 08001, NativeError = 21
Error = [Microsoft][ODBC Driver 11 for SQL Server]Encryption not supported on the client.
SQLState = 08001, NativeError = 21
Error = [Microsoft][ODBC Driver 11 for SQL Server]Client unable to establish connection
SQLState = 08001, NativeError = -2146893007
Error = [Microsoft][ODBC Driver 11 for SQL Server]SSL Provider: The client and server cannot communicate, because they do not possess a common algorithm.
SQLState = 08001, NativeError = -2146893007
Error = [Microsoft][ODBC Driver 11 for SQL Server]A network-related or instance-specific error has occurred while establishing a connection to SQL Server. Server is not found or not accessible. Check if instance name is correct and if SQL Server is configured to allow remote connections. For more information see SQL Server Books Online.
  [Site: PayPal Inc.]  [Server: XXXX]
February 11, 2019 11:20:19 AM PST:  Client activity logs have been swept.  [Site: XXX.]  [Server: XXXX]
February 11, 2019 11:20:15 AM PST:  Replication data from remote site XXX. – DR is received by local site XXX  [Site: XXX.]  [Server: XXXX]

Please let me know how i can resolve it. I have checked the ODBC connections and its working fine.

Thanks,

Sundeep

0

Related:

  • No Related Posts