How Microsoft Service Witness Protocol Works in OneFS

The Service Witness Protocol (SWP) remote procedure call (RPC)-based protocol. In a highly available cluster environment, the Service Witness Protocol (SWP) is used to monitor the resource states like servers and NICs, and proactively notify registered clients once the monitored resource states changed.

This blog will talk about how SWP is implemented on OneFS.

In OneFS, SWP is used to notify SMB clients when a node is down/rebooted or NICs are unavailable. So the Witness server in OneFS need to monitor the states of nodes/NICs and the assignment of IP addresses to the interfaces of each pool. These information is provided by SmartConnect/FlexNet and OneFS Group Management Protocol (GMP).

The OneFS GMP is used to create and maintain a group of synchronized nodes. GMP distributes a variety of state information about nodes and drives, from identifiers to usage statistics. So that Witness service can get the states of nodes from the notification of GMP.

As for the information of IP addresses in each pool, SmartConnect/Flexnet provides the following information to support SWP protocol in OneFS:

  1. Locate Flexnet IP Pool given a pool member’s IP Address. Witness server can be aware of the IP pool it belongs to and get the other pool members’ info through a given IP address.
  2. Get SmartConnect Zone name and alias names through a Flexnet IP pool obtained in last step.
  3. Witness can subscribe to changes to the Flexnet IP Pool when the following changes occur:
    • Witness will be notified when an IP address is added to an active pool member or removed from a pool member.
    • Witness will be notified when a NIC goes from DOWN to UP or goes from UP to Down. So that the Witness will know whether an interface is available.
    • Witness will be notified when an IP address is moved from one interface to another.
    • Witness will be notified when an IP address will be removed from the pool or will be moved from one interface to another initiated by an admin or a re-balance process.

The figure below shows the process of Witness selection and after failover occurs.

Drawing1.jpg

  1. SMB CA supported client connect to a OneFS cluster SMB CA share through the SmartConnect FQDN in Node 1.
  2. The client find the CA is enabled, start the Witness register process by sending a GetInterfaceList request to Node 1.
  3. Node 1 returns a list of available Witness interface IP addresses to which the client can connect.
  4. The client select anyone interface IP address from the list (in this example is Node 2 which is selected as the Witness server). Then the client will send a RegisterEx request to Node 2, but this request will failed as OneFS does not this operation. RegisterEx is a new operation introduced in SWP version 2. OneFS only support SWP version 1.
  5. The client send a Register request to node 2 to register for resource state change notification of NetName and IPAddress (In this example, the NetName is the SmartConnect FQDN and IPAddress is the IP of Node 1)
  6. The Witness server (Node 2) process the request and returns a context handle that identifies the client on the server.
  7. The client sends an AsyncNotify request to Node 2 to receive asynchronous notification of the cluster nodes/nodes interfaces states changes.
  8. Assume Node 1 does down unexpectedly. Now, the Witness server Node 2 is aware of the Node 1 broken and sends an AsyncNotify response to notify the client about the server states is down.
  9. The SMB CA feature forces the client to reconnect to OneFS cluster using the SmartConnect FQDN. In this example, the SMB CA successfully failover to Node 3.
  10. The client sends a context handle in an UnRegister request to unregister for notifications from Witness server Node 2.
  11. The Winess server processes the requests by removing the entry and no longer notifies the client about the resource state changes.
  12. Step 12-17. The client starts the register process similar to step 2-7.

Related:

Leave a Reply