Delivering Softphones with Virtual Apps and Desktops

This article describes a generic approach to delivering softphones and voice chat applications with Citrix Virtual Apps and Desktops (CVAD) 7.x.

1. Alternatives for Delivering Softphones

CVAD support several alternatives for delivering softphones.

  • Control mode, where the hosted (published) softphone in XenApp or XenDesktop simply controls a physical telephone set. In this mode, no audio traffic goes through the XenApp or XenDesktop server. TCP/IP connectivity between the softphone in the virtual desktop and the phone device must exists, either directly or through a Communications Manager server (e.g. CUCM).
  • Local App Access, a XenApp and XenDesktop feature that allows an application such as a softphone to run locally on the user’s Windows device yet appear seamlessly integrated with their virtual/published desktop. This offloads all audio processing to the user device. This may not be viable if the softphone needs to share information with an app that is hosted on XenApp/XenDesktop (e.g. Outlook), depending on how they communicate.
  • HDX generic softphone support (VoIP-over-ICA).

User-added image

This article focuses on generic softphone support, where an unmodified softphone is hosted on Citrix Virtual Apps and Desktops in the data center and the audio traffic goes over the Citrix ICA protocol (preferably using UDP/RTP) to the user device running the Citrix Workspace app. Generic softphone support is a feature of HDX, which also includes technologies for real-time video (see CTX124516 – How to Optimize HDX MediaStream Server-Rendered Video and Citrix Documentation – Improve video conference performance).

This approach to softphone delivery is especially valuable when:

  • An optimized solution for delivering the softphone is not available and the user is not on a Windows device where Local App Access could be used;
  • The media engine needed for optimized delivery of the softphone has not been installed on the user device or is not available for the operating system version running on the user device; in this scenario, HDX technologies provides a valuable fallback solution.

2. Generic Softphone Support

There are two aspects to softphone delivery using Citrix Virtual Apps and Desktops:

  1. How the softphone application is delivered to the virtual/published desktop
  2. How the audio is delivered to and from the user’s headset, microphone, and speakers, or USB telephone set

XenApp and XenDesktop 7.6 and higher include numerous technologies to support generic softphone delivery:

  • Optimized-for-Speech codec for fast encode of real-time audio and bandwidth efficiency
  • Low latency audio stack
  • Server-side jitter buffer to smooth out the audio when network latency fluctuates
  • Packet tagging (DSCP and WMM) for QoS
    • DSCP tagging for RTP packets (Layer 3)
    • WMM tagging for Wi-Fi

The Workspace app (former Citrix Receiver) versions for Windows, Linux, Mac, Android and Chrome also are VoIP capable.

Citrix Workspace app for Windows offers:

  • Client-side jitter buffer – Ensures smooth audio even when network latency fluctuates
  • Echo cancellation – Allows for greater variation in the distance between microphone and speakers for workers who do not use a headset
  • Audio plug-n-play – Audio devices do not need to be plugged in before starting a session, they can be plugged in at any time
  • Audio device routing – Users can direct ringtone to speakers but the voice path to their headset
  • Multi-stream ICA – Enables flexible Quality of Service (QoS)-based routing over the network
  • ICA supports 4 TCP and 2 UDP streams; one of the UDP streams supports real-time audio over RTP

For a summary of XenApp and XenDesktop audio features, see the Citrix Documentation – Audio Features.

For a summary of Citrix Workspace app capabilities, see Citrix Workspace app Feature Matrix.

2.1. Delivering Softphone Applications to the Virtual Desktop

There are three methods by which a softphone can be delivered to the virtual desktop:

  • The application can be installed in the virtual desktop image.
  • The application be streamed to the virtual desktop using Microsoft App‑V. This approach has manageability advantages because the virtual desktop image is kept uncluttered. Once streamed to the virtual desktop, the application executes in that environment just as if it had been installed in the usual manner. [Note that not all applications are compatible with App-V.]
  • In addition, Citrix App Layering can be used.

2.2. Delivering Audio to and from the User Device

HDX RealTime supports three methods of delivering audio to and from the user device:

  • the optimized Citrix Audio Virtual Channel
  • Generic USB Redirection.
  • A third hybrid mode known as Composite USB Redirection is also supported (see 2.2.3)

The Citrix Audio Virtual Channel is generally recommended as it is designed specifically for audio transport, but Generic USB Redirection is useful to support audio devices with buttons and/or a display, that is HID devices, if the user device is on a LAN or LAN-like connection back to the XenApp/XenDesktop VDA server.

2.2.1 Citrix Audio Virtual Channel

The bidirectional Client Audio Mapping Virtual Channel (CTXCAM) enables audio to be delivered very efficiently over the network.

HDX takes the audio from the user’s headset or microphone, compresses it, and sends it over ICA to the softphone application on the virtual desktop. Likewise, the audio output of the softphone is compressed and sent in the other direction to the user’s headset or speakers.

This compression is independent of the compression used by the softphone itself (such as Opus, G.729 or G.711). It is done using the HDX Optimized-for-Speech codec (“Medium Quality”). Its characteristics are ideal for voice-over-IP (VoIP). It features quick encode time (~34 msec) and it consumes only [approximately] 32 kilobits per second of network bandwidth (16 kbps in each direction). Note that this codec must be explicitly selected in the Studio console as it is not the default audio codec; the default is the HD Audio codec (“High Quality”) which is excellent for high fidelity stereophonic soundtracks but is slower to encode compared to the Optimized-for-Speech codec.

Bandwidth guidelines for audio playback and recording:

  • High quality (default)
    • Bitrate : ~100 kbps [VBR min 75 ; max 175 kbps] for playback / ~70 kbps for microphone capture
    • Number of Channels : 2 (Stereo) for playback / 1 (mono) for microphone capture
    • Frequency : 44100 Hz
    • Bit-depth : 16-bit
  • Medium quality (recommended for VoIP)
    • Bitrate : ~ 16 kbps [min 10 ; max 40 kbps] for playback / ~16 kbps for microphone capture
    • Number of Channels : 1 (Mono) for both playback and capture
    • Frequency : 16000 Hz (wideband)
    • Bit-depth : 16-bit
  • Low quality
    • Bitrate : ~ 11 kbps [min 10 ; max 20 kbps] for playback / ~11 kbps for microphone capture
    • Number of Channels : 1 (Mono) for both playback and capture
    • Frequency : 8000 Hz (narrowband)
    • Bit-depth : 16-bit


When digitizing an audio signal, the bit-rate is defined as the number of bits per unit of time required to encode the audio. It is measured in kilobits per second.

HDX uses lossy codecs for audio compression, which means the perceived quality is directly proportional to the bitrate used for the encoding process.

The target bitrates described in the section above can be modified via registry keys in the VDA (Windows 7 or 10 only), resulting in different levels of quality for audio playback (at the expense of more bandwidth).

When setting the Audio quality to “High Quality”, create this key in the VDA to modify the target bit-rate: When setting the Audio quality to “Medium Quality”, create this key in the VDA to modify the target bit-rate:

REG_DWORD MaxVorbisQuality

Value: 0 to 10 (default is 2)

REG_DWORD MaxSpeexQuality

Value: 0 to 10 (default is 5)

Value 0 1 2 3 4 5 6 7 8 9 10
Bitrate [kbps] for High Quality 64 80 96 112 128 160 192 224 256 320 500
Bitrate [kbps] for Medium Quality 4 6 8 10 13 16 20 24 28 34 42

UDP Audio

(Note: UDP Audio is not available when using Citrix Gateway Service)

Audio over UDP provides excellent tolerance of network congestion and packet loss, and is preferred over TCP when available. XenDesktop 5.5 introduced the “Audio over UDP Real-time Transport” user policy setting. This capability is also available in XenApp 7.6 or higher.

When this policy is configured in Studio, the Audio stream is effectively pulled out and delivered out-of-band from ICA TCP – Workspace app is communicating directly to the VDA over ICA RTP/UDP.

For this policy to be enforced, Audio Quality policy in Studio must be configured to use Medium Quality.

User-added image

By default, the Audio virtual channel would start on a regular ICA TCP connection, and then an ICA module starts initializing corresponding ICA RTP sessions. Both VDA and Workspace app would do handshake over UDP before they start sending data over RTP. UDP handshake confirms that data can flow over UDP in each direction. If the UDP handshake fails then the Audio virtual channel would continue to use ICA TCP to send and receive the data.

Windows Workspace app

Please note that for UDP Audio to work, you will need to configure GPOs on the client-side also. See CTX121613.

User-added image

The GPO when applied will eventually set these regkeys on the Client machine:

User-added image


If you also have unmanaged endpoints (BYOD), you will need to edit the default.ica file in Storefront, as described here.



EnableUDPThroughGateway=true [if you have a gateway, otherwise false]



These settings are not available currently when using Citrix Cloud with Workspace. You will need to push the regkeys in the screenshot above using other tools.

Linux Workspace app

See here for the online documentation.

1. Set the following options in the ClientAudio section of module.ini: Set EnableUDPAudio to True.

By default, this is set to False, which disables UDP audio.

2. Specify the minimum and maximum port numbers for UDP audio traffic using UDPAudioPortLow and UDPAudioPortHigh respectively. By default, ports 16500 to 16509 are used.

Note: these settings can also be deployed by modifying the ICA file template in Storefront as described here, in case you do not want to modify the .ini file.

Note: Linux Workspace app does not support DTLS encryption for RTP Audio, hence if a Gateway is in place the Audio virtual channel will fall back to TCP.

Resultant Set of Policies

When you set the Audio quality on the Client, the Server (or default.ica) cannot upgrade it.

The Server (or default.ica) can, however, downgrade the resulting quality.

User-added image

If UDP audio is enabled but the resultant quality is not medium, audio transmission will use TCP not UDP.

HDX Monitor

You can run HDX Monitor in the VDA ( Windows 7 or 10 VDAs only) to confirm if Audio over RTP is in use . Values should be set to True:

User-added image

Note: Windows Server VDAs do not expose RTP Audio info via WMI to HDX Monitor.

EDT and Audio

Audio over RTP is still preferred over EDT, since EDT is a reliable protocol (despite using UDP as the underlying transport).

EDT guarantees packet delivery thanks to a custom layer of congestion and flow control, which is less optimal for VoIP.

When EDT is in use, and Audio over RTP is also configured, the Audio virtual channel is de-multiplexed from the rest of the ICA virtual channels and delivered over RTP out-of-band from the EDT transport.

EDT is still preferred over TCP for HDX Audio.

UDP Audio and ICA Encryption

When no Citrix Gateway is in place, if Workspace app detects ICA encryption other than “Basic” or “RC5 (128) bit Logon Only” in use, it will not respond to the first handshake message sent by the VDA. No UDP packets

will be sent over the network. The VDA will eventually time out and will not enable RTP audio.

If a Citrix Gateway is in place, only the leg Workspace app-to-front-end virtual server can be encrypted with DTLS. The leg back-end SNIP-to-VDA will be restricted with the same limitation described above (Basic or RC5 Logon Only).

For more information on Secure ICA, see here.

UDP Ports

By default this feature uses UDP port range 16500-16509 and picks up the first available port pair. During VDA installation, it opens up the UDP port range on the VDA side. RTP is used in conjunction with RTCP.

In a Windows 10 VDA, RTP will use an even port (e.g. 16500) and RTCP will use the odd port (16501).

User-added image

In a Windows Server VDA, Audio is handled by the Generic Virtual Channel service CtxSvcHost.exe [-g AudioSvcs] using a single multiplexed UDP port:

User-added image

On the client side, it doesn’t explicitly open up any UDP ports during installation of Citrix Workspace app for Windows. During connection setup, Citrix Workspace app uses UDP hole punching to open up the UDP port automatically. Corporate firewalls need to also open up the necessary port range for Audio-over-UDP to work.

Important : When Citrix Gateway is not in the path, audio data transmitted with UDP is not encrypted. If Citrix Gateway is configured to access Citrix Virtual Apps and Desktops resources, then audio traffic between the endpoint device and Citrix Gateway is secured using DTLS protocol. See section 3.5 below.

Firewall Considerations

By default, Windows Firewall blocks inbound UDP traffic. During server/VDA installations, the default UDP port range would be opened up for both inbound and outbound traffic for the specific processes on the VDA. If the administrator wants to specify a different range for server using a machine level policy then that range would need to be explicitly opened up by an administrator.

User-added image

External corporate firewalls need to be explicitly opened up to handle UDP/RTP traffic.

Since UDP is a stateless protocol, there is no concept of connection tear down, and most firewalls have default timeouts for UDP traffic (~30 sec – 2 minutes). This could result in UDP port closing if no traffic is detected. Make sure your firewall has a long enough timeout value for traffic destined to the Gateway (or VDA).

Also note that some Firewalls might have 2 UDP timeout values, udp-timeout and udp-stream-timeout.

(Stream means the connection tracking mechanism has detected packets in both directions). Please refer to your vendor’s documentation.

On most client end point devices, there is normally no need to open up the UDP ports explicitly. During UDP handshake as the client sends the first UDP packet, the firewall creates a rule Src ip. Src port, Dest ip. Dest port and would allow UDP packet coming from dst ip, dst port thus effectively opening up the UDP port dynamically.

Some thin client devices running Windows 7 Embedded have shown to close its UDP “hole punched” firewall port very quickly, on the order of 1-2 minutes. This behavior causes RTP audio to be blocked if audio is not continuously playing, causing UDP packets to be sent which keep the port open. To mitigate this problem it will be necessary to open up the firewall ports on the client device, if such behavior is observed.

To enable UDP Audio, refer to the following links:

2.2.2 Generic USB Redirection

Citrix Generic USB Redirection technology (CTXGUSB virtual channel) provides a generic means of remoting USB devices, including composite devices (audio plus HID) and isochronous USB devices. This approach is generally limited to LAN-connected users because the USB protocol tends to be sensitive to network latency and requires considerable network bandwidth. Isochronous USB redirection has been found to work very well with some softphones, providing excellent voice quality and low latency, but it is generally preferred to use the Citrix Audio Virtual Channel which is optimized for audio traffic.

The primary exception is when using an audio device with HID buttons such as a USB telephone attached to the user device that is LAN-connected to the data center. In this case, Generic USB Redirection offers the advantage of supporting buttons on the phone set or headset that control features by sending a signal back to the softphone (not an issue with buttons that work locally on the device).

Generic USB is not enabled by default and requires configuration.

2.2.3 Composite USB Redirection (hybrid mode)

In Citrix Receiver for Windows 4.7 and earlier, HDX allowed generic redirection of USB devices, where all the interfaces of the device were redirected as a single device. This resulted in sub-optimal Audio performance, so instead the Admin would configure the Audio virtual channel and the sound quality would be fine but they lose the functionality of HID buttons.

Starting with Citrix Receiver for Windows 4.8, we now allow splitting of composite USB device redirection. A composite USB device consists of multiple interfaces, each having its own functionality. Examples of composite USB devices include HID devices that consist of audio and buttons, like a Plantronics headset.

User-added image

Composite USB redirection is available both in XenDesktop and XenApp sessions. In a desktop session, split devices are displayed in the desktop viewer. In an application session, split devices are displayed in the Connection Center.

You can configure composite USB redirection in using the Group Policy Object administrative template, and the registry.

See here for detailed info.

3. System Configuration Recommendations

3.1 Client Hardware and Software

For optimal audio quality, Citrix recommends the latest version of Citrix Workspace app and a good quality headset with built-in acoustic echo cancellation (AEC). Bi-directional audio VoIP is currently supported by the Citrix Workspace app versions for Windows, Linux, Mac, Android and Chrome. See here for more details.

In addition, Dell Wyse offer VoIP support for ThinOS (WTOS).

3.2 CPU Considerations

Monitor CPU utilization on the VDA to determine if it is necessary to assign (at least) two virtual CPUs to each virtual machine. Real-time voice and video are data intensive and configuring two virtual CPUs reduces the thread switching latency. Therefore, it is generally recommended to configure (at least) two vCPUs in a XenDesktop VDI environment.

Note: Having two virtual CPUs does not necessarily mean doubling the number of physical CPUs, because physical CPUs can be shared across sessions.

It is possible to configure the Citrix Audio Service with CPU priority “high” to help mitigate audio choppiness dramatically. The command line to change the CPU priority can be found here. Because this will be reset after a reboot, it should be run as a start-up script. Please note, do not change the priority to “Realtime”.

User-added image

Note: the HDX Audio service in Windows Server is CtxSvcHost.exe, but this service is leveraged by multiple virtual channels (Smart Card, Flash, BCR, Teams, NSAP and others). Each virtual channel will have its own PID. More info here.

User-added image

Citrix Gateway Protocol (CGP), which is used for the Session Reliability feature, also increases CPU consumption. Improvements in XenDesktop 5.5 greatly reduced the CPU impact of Session Reliability / Citrix Gateway Protocol (CGP). Nevertheless, on high quality network connections, this feature could be disabled to further reduce CPU consumption on the VDA.

Neither of the preceding steps might be necessary on a powerful server.

Important: If you are using a Citrix Gateway to encapsulate Audio RTP with DTLS, CGP and Session Reliability must be turned on. Session Reliability is configured both as a Studio policy and in Storefront.

    3.3 LAN/WAN Configuration

    Proper configuration of the network is critical for good real-time audio quality. VLANs typically need to be configured because excessive broadcast packets can introduce jitter. IPv6-enabled devices may generate a lot of broadcast packets; IPv6 can be disabled on those devices if IPv6 support is not needed. Routers need to be configured to support QoS.

    3.4 Settings for use WAN Connections

    Voice chat can be used over both LAN and WAN connections. On a WAN connection, audio quality depends on the latency, packet loss, and jitter on the connection. If delivering softphones to users on a Wide Area Network (WAN) connection, Citrix recommends using the Citrix SD-WAN between the data center and the remote office to maintain a high Quality-of-Service (QoS). SD-WAN supports Multi-Stream ICA, including UDP. Also, in the case of a single TCP stream, it is possible to distinguish the priorities of various ICA virtual channels to ensure that high priority real-time Audio data gets preferential treatment. Citrix SD-WAN can notoriously improve MOS scores.

    Use Director or the HDX Monitor to validate your HDX configuration.

    3.5 Remote User Connections

    (Note: Audio over RTP is not available when using Citrix Gateway Service)

    NetScaler Gateway 11 supports DTLS to deliver UDP/RTP traffic natively (without encapsulation in ICA TCP). Traffic is encrypted using Basic ICA encryption (on by default) between the VDA and the Citrix Gateway backend virtual server, and is encrypted with DTLS between the frontend virtual server and Workspace app for Windows (only platform supported). CGP/Session Reliability must be enabled on the VDA and in Storefront.

    Firewalls must be opened bidirectionally for UDP traffic over Port 443 from Workspace app to the Gateway’s front-end virtual server.

    The Gateway back-end SNIP will communicate with the VDA using RTP/UDP ports 16500-16509.

    User-added image

    Workspace app for Windows is the only platform capable of handling RTP Audio over DTLS.

    User-added image

    More info can be found here:

    3.6 Codec Selection and Bandwidth Consumption

    Between the user device and the Virtual Delivery Agent (VDA) in the data center, Citrix recommends using the Optimized-for-Speech codec setting, also known as Medium Quality audio.

    Between the VDA platform and the IP-PBX, the softphone uses whatever codec is configured or negotiated. For example:

    • G711 provides very good voice quality but has a bandwidth requirement of 80 to 100 kilobits per second per call (depending on Network Layer2 overheads).
    • G729 provides good voice quality and has a low bandwidth requirement of 30 to 40 kilobits per second per call (depending on Network Layer 2 overheads).

    Additional Resources


      • No Related Posts

      Leave a Reply