Users unable to relaunch the published applications, error “Citrix workspace will try to reconnect…”

Keep Alive policy settings:

https://docs.citrix.com/en-us/xenapp-and-xendesktop/7-15-ltsr/policies/reference/ica-policy-settings/keep-alive-policy-settings.html

Session reliability policy settings:

https://docs.citrix.com/en-us/xenapp-and-xendesktop/7-15-ltsr/policies/reference/ica-policy-settings/session-reliability-policy-settings.html

Session reliability on Citrix ADC high availability pair:

https://docs.citrix.com/en-us/citrix-adc/current-release/ns-ag-appflow-intro-wrapper-con/session-reliablility-on-citrix-adc-ha-pair.html

Related:

  • No Related Posts

How to Enable Session Reliability on NetScaler in High Availability

This article describes how to enable session reliability on NetScaler in high availability.

Background

When high availability failover occurs, the ICA sessions will get disconnected. In order to avoid the ICA session disconnection on high availability failover, you can configure Session Reliability.

Points to Note

  • NetScaler appliances should be running on software version 11.1 build 49.16 or later.
  • You should not Enable or disable Session Reliability mode when the NetScaler appliances have active connections.
  • Enabling or Disabling the feature when connections are still active causes HDX Insight to stop parsing those sessions after a failover occurs and result in loss of information about the sessions.

Related:

Citrix ADC High Availability Counters

The counter tracks the state of the HA node, based on its health, in a high availability setup. Possible values are:

  • UP – Indicates that the node is accessible and can function as either a primary or secondary node.
  • DISABLED – Indicates that the high availability status of the node is manually disabled. Synchronization and propagation cannot take place between the peer nodes.
  • INIT – Indicates that the node is in the process of becoming part of the high availability configuration.
  • PARTIALFAIL – Indicates that one of the high availability monitored interfaces has failed because of a card or link failure. This state triggers a failover.
  • COMPLETEFAIL – Indicates that all the interfaces of the node are unusable, because the interfaces on which high availability monitoring is enabled are not connected or are manually disabled. This state triggers a failover.
  • DUMB – Indicates that the node is in listening mode. It does not participate in high availability transitions or transfer configuration from the peer node. This is a configured value, not a statistic.
  • PARTIALFAILSSL – Indicates that the SSL card has failed. This state triggers a failover.
  • ROUTEMONITORFAIL – Indicates that the route monitor has failed. This state triggers a failover.

Related:

Deploying A New Surveillance System?

Deploying a new surveillance system? Test for the “what-ifs” of system-wide integration. Accelerate time to deployment, minimize risks, and overcome complexities of surveillance system integration with the most comprehensive lab validation services in the industry. With the solution from Dell, we can guarantee 100% uptime, no data loss, and no service disruption, which means maximum business continuity. — Enzo Palladini, Sales and Engineering Office Manager, Bettini Video There can be no questions when it comes to the reliability of your surveillance infrastructure. Given what’s at stake—whether downtime in daily operations, loss of critical evidence, or worse … READ MORE

Related:

Re: Clariion CX4 – Failed Drive, how can i tell if Hotspare has kicked in

There is one KB article for your reference: emc250611. I copied out the main steps here:

To check whether a hot spare is actively replacing a failed disk from the Navisphere Manager:

  1. Navigate to the LUN folders.
  2. Go to the Unowned LUNS folder and expand it by clicking on the plus symbol.
  3. Select a hot spare and right-click it.
  4. Go to the properties of the hot spare.
  5. Go to disk tab and check the status of the hot spare.

If the hot spare is replacing the failed disk, the status will be displayed as Active.

Alternatively, select the disk under the hot spare and right click and select the properties. If the hot spare is invoked, it will display as Engaged under the current state and under the hot spare replacing the status will be displayed as Active.

For command line check, you may issue getdisk -hs. For example, my 1_0_8 is down, to check if HS was involved:

getdisk -hs

Bus 1 Enclosure 0 Disk 6

Hot Spare: 24567: YES

Hot Spare Replacing: 1_0_8

Bus 1 Enclosure 0 Disk 7

Hot Spare: NO

Bus 1 Enclosure 0 Disk 8

State: Removed



As you can see, the removed drive 1_0_8 is now replaced by HS 1_0_6

Related:

Software Defined Storage Availability (Part 2): The Math Behind Availability

EMC logo


As we covered in our previous post ScaleIO can easily be configured to deliver 6-9’s of availability or higher using only 2 replicas that saves 33% of the cost compared to other solutions while providing very high performance. In this blog we will discuss the facts of availability using math and demystify the myth behinds ScaleIO’s high availability.

For data loss or data unavailability to occur in a system with two replicas of data (such as ScaleIO) there must be two concurrent failures or a second failure must occur before the system recovers from a first failure. Therefore one of the following four scenarios must occur:

  1. Two drive failures in a storage pool OR
  2. Two nodes failures in a storage pool OR
  3. A node failed followed by a drive failure OR
  4. A drive failed followed by a node failure

Let us choose two popular ScaleIO configurations and derive the availability of each.

  1. 20 x ScaleIO servers deployed on Dell EMC’s PowerEdge Servers R740xd with 24 SSD drives each, 1.92TB SSD drive size using 4 x 10GbE Network. In this configuration we will assume that the rebuild time is network bound.
  2. 20 x ScaleIO servers deployed on Dell EMC’s PowerEdge Servers R640 with 10 SSD drives each, 1.92TB SSD drives using 2 x 25GbE Network. In this configuration we will assume that the rebuild time is SSD bound.

Note: ScaleIO best practices recommend a maximum of 300 drives in a storage pool, therefore for the first configuration we will configure two storage pools with 240 drives in each pool.

To calculate the availability of a ScaleIO system we will leverage a couple of well know academic publications:

  1. RAID: High Performance Reliable secondary Storage (from UC Berkeley) and
  2. A Case for Redundant Array of Inexpensive Disks (RAID).

We will adjust the formulas in the paper to the ScaleIO architecture and model the different failures.

Two Drive Failures

We will use the following formula to calculate the MTBF of ScaleIO system for a two drive failure scenario:

Where:

  • N = Number of drives in a system
  • G = Number of drives in a storage pool
  • M = Number of drives per server
  • K = 8,760 hours
( 1 Year)
  • = MTBF of a single drive
  • = Mean Time to Repair – repair/rebuild time of a failed drive

Note: This formula assumes that two drives that fail in the same ScaleIO SDS (server) will not cause DU/DL as the ScaleIO architecture guarantees that replicas of the same data will NEVER reside on the same physical node.

Let’s assume two scenarios – in the first scenario the rebuild process is constrained by network bandwidth – in the second scenario the rebuild process is constrained by drive performance bandwidth.

Network Bound

In this case we assume that the rebuild time/performance is limited by the availability of network bandwidth. This will be the case if you deploy a dense configuration such as the DELL 740xd servers with a large number of SSDs in a single server. In this case, the MTTR function is:

Where:

  • S – Number of servers in a ScaleIO cluster
  • Network Speed – Bandwidth in GB/s available for rebuild traffic (excluding application traffic)
  • Conservative_Factor = factor additional time to complete the rebuild (to be conservative).

Plugging in the relevant values in the formula above, we get a MTTR of ~1.5 minutes for the 20 x R740, 24 SSDS @ 1.92TB w/ 4 X 10GbE network connections configuration (two storage pools w/ 240 drives per pool). The 20 x R640, 10SSDs @ 1.92TB w/ 2 X 25GbE network connections config provides MTTR of ~2 minutes. These MTTR values reflect the superiority of ScaleIO’s declustered RAID architecture that result in a very fast rebuild time. In a later post we will show how those MTTR values are critical and how they impact system availability and operational efficiency.

SSD Drive Bound

In this case, the rebuild time/performance is bound by the number of SSD drives and the rebuild time is a function of the number of drives available in the system. This will be the case if you deploy less dense configurations such as the 1U Dell EMC PowerEdge R640 servers. In this case, the MTTR function is:

Where:

  • G – Number of drives in a storage pool
  • Drive_Speed – Drive speed available for rebuild
  • Conservative_Factor = factor additional time to complete the rebuild (to be conservative).

System availability is calculated by dividing the time that the system is available and running, by the total time the system was running added to the restore time. For availability we will use the following formula:

Where:

  • RTO – Recovery Time Objective or the amount of time it takes to recover a system after a data loss event (For example: if two drives fail in a single pool), where data needs to be recovered from a backup system. We will be highly conservative and will consider Data Unavailability (DU) scenarios as bad as Data Loss (DL) scenarios therefore we will use RTO in the availability formula.

Note: the only purpose of RTO is to translate MTBF to availability.

Node and Device Failure

Next, let’s discuss the system’s MTBF when a node fails and followed by a drive failure, for this scenario we will be using the followed model:

Where:

  • M = Number of drives per node
  • G = Number of drives in the pool
  • S = Number of servers in the system
  • K = Number of hours in 1 year i.e. 8,760 hours
  • MTBFdrive = MTBF of a single drive
  • MTBFserver = MTBF of a single node
  • MTTRserver = repair/rebuild time of failed server

In a similar way, one can develop the formulas for other failure sequences such as a drive failure after a node failure and a second node failure after a first node failure.

Network Bound Rebuild Process

In this case we assume that rebuild time/performance is constrained by network bandwidth. We will make similar assumptions as for drive failure. In this case, the MTTR function is:

Where:

  • M – Number of drives per server
  • S – Number of servers in a ScaleIO cluster
  • Network Speed – Bandwidth in GB/s available for rebuild traffic (excluding application traffic)
  • Conservative_Factor = factor additional time to complete the rebuild to be conservative

Plugging the relevant values in the formula above, we get a MTTR of ~30 minutes for the 20 x R740, 24 SSDS @ 1.92TB w/ 4 X 10GbE network connections configuration (two storage pools w/ 240 drives per pool). The 20 x R640, 10SSDs @ 1.92TB w/ 2 x 25GbE Network config provides MTRR of ~20 minutes. During system recovery ScaleIO rebuilt about 48TB of data for the first configuration and about 21TB for the second configuration.

SSD Drive Bound

In this case we assume that the Rebuild time/performance is SSD drive bound and the rebuild time is a function of the number of drives available in the system. Using the same assumptions as for drive failures, the MTTR function is:

Where:

  • G – Number of drives in a storage pool
  • M – Number of drives per server
  • Drive_Speed – Drive speed available for rebuild
  • Conservative_Factor = factor additional time to complete the rebuild to be conservative

Based on the provided formulas let’s calculate the availability of ScaleIO system based on the two different configurations:

20 x R740, 24 SSDS @ 1.92TB w/ 4 X 10GbE Network

(Deploying 2 storage pools w/ 240 drives per pool)

Reliability (MTBF) Availability
Drive After Drive 43,986 [Years] 0.999999955
Drive After Node 6,404 [Years] 0.999999691
Node After Drive 138,325 [Years] 0.999999985
Node After Node 38,424 [Years] 0.999999897
Overall System 4,714 [Years] 0.99999952 or 6-9’s

20 x R640, 10SSDs @ 1.92TB w/ 2 x 25GbE:

Reliability (MTBF) Availability
Drive After Drive 105,655 [Years] 0.999999983
Drive After Node 27,665 [Years] 0.999999937
Node After Drive 276,650 [Years] 0.999999993
Node After Node 69,163 [Years] 0.999999975
Overall System 15,702 [Years] 0.99999989 or 6-9’s

Since these calculations are complex, ScaleIO provides its customers with FREE online tools to build HW configurations and obtain availability numbers that includes all possible failure scenarios. We advise customers to use this tool, rather than crunch complex mathematics, to build system configurations based on desired system availability targets.

As you can see, yet again, we prove that the ScaleIO system easily exceeds 6-9’s of availability with just 2 replicas of the data. Unlike other vendors, neither extra additional data replicas nor erasure coding is required!  So do you have to deploy three replica copies to achieve enterprise availability? No you do not! The myth is BUSTED.



ENCLOSURE:https://blog.dellemc.com/uploads/2018/02/Facts-Myths-Blackboard-Chalkboard-Yellow-Arrows-1000×500.jpg

Update your feed preferences


   

   


   


   

submit to reddit
   

Related:

  • No Related Posts

Re: Failed hot spare lun

I have a VNX array which has a hot-spare lun.

We had a failure in a pool, at which time the hot spare lun took over.

However, the hot spare lun also failed, resulting in corruption.

I tried to delete the the pool, but the hot-spare lun is still marked in-use and I cannot drop it to replace the drive.

Im stuck with the hot-spare disk faulted and attempts to replace it are failing and I cannot delete the hot spare lun or raid group.

Stuck…

Related:

Failed hot spare lun

I have a VNX array which has a hot-spare lun.

We had a failure in a pool, at which time the hot spare lun took over.

However, the hot spare lun also failed, resulting in corruption.

I tried to delete the the pool, but the hot-spare lun is still marked in-use and I cannot drop it to replace the drive.

Im stuck with the hot-spare disk faulted and attempts to replace it are failing and I cannot delete the hot spare lun or raid group.

Stuck…

Related: