I’ve read the entire QRadar SIEM High Availibility Guide for 7.3.1 and am still struggling to design a disaster recovery solution to our QRadar systems(two 3105 All in One). I’ve also read different topics on this subject with very good explanations by JonathanPechtaIBM.
We are looking for a solution which offers **almost no dataloss** in case of failure of site A. Yes..We have a Site A and a Site B.
There are three DR deployment scenarios according to the HA Guide.
Option1: Primary QRadar Console and backup console
Option2: Event and flow forwarding
Option3: Distributing the same events and flows to the primary and secondary sites.
**Option1** depicts the console failover in a scenario where I would have a hot console and a cold standby. In case of failure, I have to manually start the cold console, change the ip and apply the backup of the failed machine. In this scenario there is NO DATA SYNC. I have to restore data manually. In case my first machine gets restored, I must copy the delta data manually to the primary. Data can be lost during the failover period. Therefore, this option is discarded.
**Option2:** Event and flow forwarding. I have similar deployments on both sites. Both are active. Events and Flows have to be forwarded from the first system to the secondary system using:
A) off-site targets (configured under System and License Management)
B) routing rules.: There are two modes: Online and Offline. and are configured under “Forwarding Destinations” and “Routing Rules” .. (There is a very good explanation here: https://www.ibm.com/developerworks/community/forums/html/topic?id=b8be5e81-d1ed-452b-bf55-7659f78684fb)
Online mode uses best effort, which can cause data loss if there is no communication between sending and listening devices. Therefore, it is discarded.
Offline mode sends the data after being written to disk and there is a sync-delay of about >1 minute due to ariel writing data every minute. No data is lost because the offline process uses bookmarks to keep track of the last sent data. This seems to be a good method for fulfilling my requirements.
**[Question1]**: What is the difference between off-site targets and routing rules using the offline mode? If there is none, why there are the two options?
In the guide it also says on page 33 “Periodically, use the content management tool to update content from the primary QRadar to the secondary”.
**[Question2]**: What can be understood unter “content” in this sentence? Apps, DSMs? In general content that is not available in the backup file?
It is also mentioned on page 33 “In the case of a failure at site 1, you can use a high-availiability (HA) Deployment to trigger an automatic failover”.
**[Question3]:** In this case, one should be aware of the latency limitation between the two sites. Moreover, it is not anymore necessary to forwarding events and flows using one of both methods mentioned above, right?
**[Question4]**: Using routing rules and “online” modes it is possible to drop the data and bypass the CRE after being forwarded. What are the use cases for that? A system would send events to Qradar and we would like to forward some of them to another system, but not want to store them on Qradar?or let CRE test some of them but just store them for logging reasons?
**Option3**: Distributing the same events and flows to the primary and secondary sites
In this scenario I have a load balancer or another similar component which is reponsible for sending data to both sites. If Site A fails, Site B it is still active. Both components have different IP Addresses and it is not necessary to either forward data nor to backup or restore any data. This seems to be the most expensive option, because both sites should have similar architectures and there is the load balancer.
**[Question5]**: The load balancer represents a Single Point of Failured (SPOF) and should be therefore planned redundant? According to picture 3 on page 37, all data is sent to a Load balancer on site 1. What happens if it fails?
I know that if a whole site fails, I have more things to worry about than logging, but I would like to go through all the methods.
to sum up. The method to be chosen should be option2 with offline mode, right?
Thank you in advance
ps: This video: https://www-01.ibm.com/support/docview.wss?uid=swg21997652
provides a wrong definition at 0:27. it says “in **online** mode all data is stored in the database and then forwarded”. This is wrong, right?