Understanding Workspace Environment Management (WEM) System Optimization

The WEM System Optimization feature is a group of settings designed to dramatically lower resource usage on a VDA on which the WEM Agent is installed.

These are machine-based settings that will apply to all user sessions.




Managing Servers with different Hardware Configurations

Sets of VMs may have been configured with different hardware configurations. For instance some machine may have 4 CPU cores and 8GB RAM, while others have 2 CPU Cores and 4GB RAM. The determination could be made such that each server set requires a different set of WEM System Optimization settings. Because machines can only be part of one WEM ConfigSet, administrators must consider whether they need to create multiple ConfigSets to accommodate different optimization profiles.


WEM System Optimization Settings

User-added image


Fast Logoff

A purely visual option that will end the HDX connection to a remote session, giving the impression that the session has immediately closed. However, the session itself continues to progress through the session logoff phases on the VDA.


CPU Management

CPU Priority:

You can statically define the priority for a process. Every instance of, for example, Notepad that is launched on the VDA will be launched with a priority of the desired CPU priority. The choices are:

  • Idle
  • Below Normal
  • Normal
  • Above Normal
  • High
  • Realtime *

* https://stackoverflow.com/questions/1663993/what-is-the-realtime-process-priority-setting-for

CPU Affinity:

You can statically define how many CPU cores a process will use. Every instance of Notepad that is launched on the VDA will use the number of cores defined.

Process Clamping:

Process clamping allows you to prevent a process from using more CPU percentage than the specified value. A process in the Process Clamping list can use CPU up to the configured percentage, but will not go higher. The setting limits the CPU percentage no matter which CPU cores the process uses.

Note: The clamping percentage is global, not per core (that is, 10% on a quad-core CPU is 10%, not 10% of one core).


Generally, Process Clamping is not a recommended solution for keeping the CPU usage of a troublesome process artificially low. It’s a brute force approach and computationally expensive. The better solution is to use a combination of CPU spikes protection and to assign static Limit CPU / Core Usage, CPU priorities, CPU affinities values to such processes.

CPU Management Settings:

CPU Spikes Protection:

CPU Spikes Protection is not the same as Process Clamping. Process Clamping will prevent a process from exceeding a set CPU percentage usage value. Spikes Protection manages the process when it exceeds the CPU Usage Limit (%) value.

CPU Spikes Protection is not designed to reduce overall CPU usage. CPU Spikes Protection is designed to reduce the impact on user experience by processes that consume an excessive percentage of CPU Usage.

If a process exceeds the CPU Usage Limit (%) value, for over a set period of time (defined by the Limit Sample Time (s) value), the process will be relegated to Low Priority for a set period of time, defined by the Idle Priority Time (s) value. The CPU usage Limit (%) value is global across all logical processors.

The total number of logical processors is determined by the number of CPUs, the number of cores in the CPU, and whether HyperThreading is enabled. The easiest method of determining the total number of logical cores in a machine is by using Windows Task Manager (2 logical processors shown in the image):

User-added image

To better understand CPU Spikes Protection, let’s follow a practical scenario:

Users commonly work with a web app that uses Internet Explorer. An administrator has noticed that iexplore.exe processes on the VDAs consume a lot of CPU time and overall responsiveness in user sessions is suffering. There are many other user processes running and percentage CPU usage is running in the 90 percent range.

To improve responsiveness, the administrator sets the CPU Usage Limit value to 50% and a Idle Priority Time of 180 seconds. For any given user session, when a single iexplore.exe process instance reaches 50% CPU usage, it’s CPU priority is immediately lowered to Low for 180 seconds. During this time iexplore.exe will consequently get less CPU time due to its low position in the CPU queue and thereby reduce its impact on overall session responsiveness. Other user processes that haven’t also reached 50% have a higher CPU priority and so continue to consume CPU time and although the overall percentage CPU usage continues to show above 90%, the session responsiveness for that user is greatly improved.

In this scenario, the machine has 4 logical processors. If the processes’ CPU usage is spread equally across all logical processors, each will show 12.5% usage for that process instance.

If there are two iexplore.exe process instances in a session, their respective percentage CPU usage values are not added to trigger Spikes Protection. Spikes Protection settings apply on each individual process instance.​

User-centric CPU Optimization (process tracking on the WEM Agent):

As stated previously, all WEM System Optimization settings are machine-based and settings configured for a particular ConfigSet will apply to all users launching sessions from the VDA.

The WEM Agent records the history of every process on the machine that has triggered Spikes Protection. It records the number of times that the process has triggered Spikes Protection, and it records the user for which the trigger occurred.

So if a process triggers the CPU Spikes Protection in User A’s session, the event is recorded for User A only. If User B starts the same process, then WEM Process Optimization behavior is determined only by process triggers in User B’s session. On each VDA the Spike Protection triggers for each user (by user SID) are stored in the local database on the VDA and refreshing the cache does not interfere with this stored history.

Limit CPU / Core Usage:

When a process has exceeded the CPU Usage Limit value (i.e. Spikes Protection for the process has been triggered), in addition to setting the CPU priority to Low, WEM can also limit the amount of CPU cores that the process uses if a CPU / Core Usage Limit value is set. The limit is in effect for the duration of the Idle Priority Time.

Enable Intelligent CPU Optimization:

When Enable Intelligent CPU Optimization is enabled, all processes that the user launches in their session will start at a CPU Priority of High. This makes sense as the user has purposefully launched the process, so we want the process to be reactive.

If a process triggers Spikes Protection, it will be relegated to Low priority for 180 seconds (if default setting is used). But, if it triggers Spikes Protection a certain number of times, the process will run at the next lowest CPU Priority the next time it’s launched.

So it was launching at High priority initially; once the process exceed a certain number of triggers, it will launch at Above Normal priority the next time. If the process continues to trigger Spikes Protection, it will launch at the next lowest priority until eventually it will launch at the lowest CPU priority.

The behavior of Enable Intelligent CPU Optimization is overridden if a static CPU Priority value has been set for a process. If Enable Intelligent CPU Optimization is enabled and a process’s CPU Priority value has been set to Below Normal, then the process will launch at Below Normal CPU priority instead of the default High priority.

If Enable Intelligent CPU Optimization is enabled and a process’s CPU Priority value has been statically set to High, then the process will launch at High. If the process triggers Spikes Protection, it will be relegated to Low priority for 180 seconds (if default setting is used), but then return to High priority afterwards.

Note: The Enable CPU Spikes Protection box must be ticked for Enable Intelligent CPU Optimization to work.


Memory Management

Working Set Optimization:

WEM determines how much RAM a running process is currently using and also determines the least amount of RAM the process requires, without losing stability. The difference between the two values is considered by WEM to be excess RAM. The process’s RAM usage is calculated over time, the duration of which is configured using the Idle Sample Time (min) WEM setting. The default value is 120 minutes.

Let’s look at a typical scenario when WEM Memory Management has been enabled:

A user opens Internet Explorer, navigates to YouTube, and plays some videos. Internet Explorer will use as much RAM as it needs. In the background, and over the sampling period, WEM determines the amount of RAM Internet Explorer has used and also determines the least amount of RAM required, without losing stability.

Then the user is finished with Internet Explorer and minimizes it to the Task Bar. When the process percentage CPU usage drops to the value set by the Idle State Limit (percentage) value (default is 1%), WEM then forces the process to release the excess RAM (as previously calculated). The RAM is released by writing it to the pagefile.

When the user restores Internet Explorer from the Task Bar, it will initially run in its optimized state but can still go on to consume additional RAM as needed.

When considering how this affects multiple processes over multiple user sessions, the result is that all of that RAM freed up is available for other processes and will increase user density by supporting a greater amount of users on the same server.

Idle State Limit (percent):

The value set here is the percentage of CPU usage under which a process is considered to be idle. The default is 1% CPU usage. Remember that when a process is considered to be idle, WEM forces it to shed its excess RAM. So be careful not to set this value too high; otherwise a process being actively used may be mistaken as an idle process, resulting in its memory being released. It is not advised to set this value higher than 5%.


I/O Management

These settings allow you to optimize the I/O priority of specific processes, so that processes which are contending for disk and network I/O access do not cause performance bottlenecks. For example, you can use I/O Management settings to throttle back a disk-bandwidth-hungry application.

The process priority you set here establishes the “base priority” for all of the threads in the process. The actual, or “current,” priority of a thread may be higher (but is never lower than the base). In general, Windows give access to threads of higher priority before threads of lower priority.

I/O Priority Settings:

Enable Process I/O Priority

When selected, this option enables manual setting of process I/O priority. Process I/O priorities you set take effect when the agent receives the new settings and the process is next restarted.

Add Process I/O Priority

Process Name: The process executable name without the extension. For example, for Windows Explorer (explorer.exe) type “explorer”.

I/O Priority: The “base” priority of all threads in the process. The higher the I/O priority of a process, the sooner its threads get I/O access. Choose from High, Normal, Low, Very Low.

Enable Intelligent I/O Optimization:

This adopts exactly the same principles as Enable Intelligent CPU Optimization, but for I/O instead of CPU.

Note: The Enable CPU Spikes Protection box must be ticked for Enable Intelligent I/O Optimization to work.


Exclude specified processes:

By default, WEM CPU Management excludes all of the most common Citrix and Windows core service processes. This is because they make the environment run and they need to make their own decisions about how much CPU time & priority they need. WEM administrators can however, add processes they want to exclude from Spikes Protection to the list. Typically, antivirus processes would be excluded. In this case, in order to stop antivirus scanning taking over disk I/O in the session, administrators would also set a static I/O Priority of Low for antivirus processes.


Notes:

  1. When configuring, the entered process name is a match to the process name’s entry in Windows Task Manager.
  2. Process names are not case-sensitive.
  3. You don’t enter “.exe” after the process name. So for instance, enter “notepad” rather than “notepad.exe”.

Related:

SEP 14.x use over 30% CPU usage

I need a solution

We noticed Symantec Service Framework use more than 30% CPU usage. Is this normal?

Any reason it use a lot resources eventhough no scanning in progress ?

Hope the community can advise.

0

Related:

How To Troubleshoot High Packet or Management CPU Issue on Citrix ADC

CPU is a finite resource. Like many resources, there are limits to a CPU’s capacity. The NetScaler appliance has two kinds of CPUs in general: The Management CPU and Packet CPU.

Wherein, the Management CPU is responsible for processing all the Management traffic on the appliance and the Packet CPU(s) are responsible for handling all the data traffic for eg. TCP , SSL etc.

When diagnosing a complaint involving high CPU, start by gathering the following fundamental facts:

  1. CPUs impacted: nsppe (one or all) & management.
  2. Approximate time stamp/duration.

The following command o/p are quintessential for troubleshooting the high CPU issues:

  • Output of top command: Gives the CPU utilization percentage by the processes running on the NetScaler.
  • Output of stat system memory command: Gives the memory utilization percentage which can also contribute in the CPU utilization.
  • Output of stat system cpu command: This gives the stats about the current CPU utilization in total on the appliance.

Sample o/p of stat cpu command:

> stat cpuCPU statisticsID Usage1 29

The above o/p indicates that there is only 1 CPU (utilized for both Management and Data traffic) and the percentage of utilization is 29%.

The CPU ID is 1.

Now, there are appliances with multiple cores (nCore ) wherein more than single core is allocated to the appliance and then we see multiple CPU IDs on the “stat system cpu ” o/p.

*The high CPU seen when running a “top” command does not impact the performance of the box. It also “does not” mean that the NetScaler is running at high CPU or consuming all of the CPU. The NetScaler Kernel runs on top of BSD and that is what is being seen. Although it appears to be using the full amount of the CPU, it is actually not.

We can further follow the below steps for understanding the CPU usage:

  1. Check the following counters to understand CPU usage.

    CLASSIC:

    master_cpu_use

    cc_appcpu_use filter=cpu(0)

    (If AppFW or CMP is configured, then looking at slave_cpu_use also makes sense for classic)

    nCORE:

    (For an 8 Core system)

    mgmt_cpu_use (CPU0 – nscollect runs here)

    master_cpu_use (average of cpu(1) thru cpu(7))

    cc_cpu_use filter=cpu(1)

    cc_cpu_use filter=cpu(2)

    cc_cpu_use filter=cpu(3)

    cc_cpu_use filter=cpu(4)

    cc_cpu_use filter=cpu(5)

    cc_cpu_use filter=cpu(6)

    cc_cpu_use filter=cpu(7)

  2. How to look for CPU use for a particular CPU?

    Use the nsconmsg command and search for cc_cpu_use and grep for the CPU you are interested in.

    The output will look like the following:

    Index rtime totalcount-val delta rate/sec symbol-name&device-no
    320 0 209 15 2 cc_cpu_use cpu(8)
    364 0 205 -6 0 cc_cpu_use cpu(8)
    375 0 222 17 2 cc_cpu_use cpu(8)
    386 0 212 -10 -1 cc_cpu_use cpu(8)
    430 0 216 6 0 cc_cpu_use cpu(8)
    440 0 201 -15 -2 cc_cpu_use cpu(8)
    450 0 208 7 1 cc_cpu_use cpu(8)
    461 0 202 -6 0 cc_cpu_use cpu(8)
    471 0 209 7 1 cc_cpu_use cpu(8)
    482 0 238 29 4 cc_cpu_use cpu(8)
    492 0 257 19 2 cc_cpu_use cpu(8)
  • Look at the total count (third) column and divide by 10 to get the CPU percentage. For eg. in the last line above, 257 implies that 257/10 = 25.7% CPU is used by CPU(8).

    Run the following command to investigate the nsconsmg counters for CPU issue:

    nsconmsg –K newnslog –g cpu_use –s totalcount=600 –d currentnsconmsg –K newnslog –d current | grep cc_cpu_use
  • Look at the traffic, memory and CPU in conjunction. We may be hitting platform limits if it sustained high CPU usage. Try to understand if the CPU has gone up because of traffic. If so, try to understand if it is genuine traffic or any sort of attack.
  • We can further check for the Profiler o/p to understand who is taking the CPU.

    For details on the profiler o/p , logs , refer to the below article:

    https://support.citrix.com/article/CTX212480

  • We can further use the CPU counters mentioned in the below article for more details:

    https://support.citrix.com/article/CTX133887


Profiling FAQs

1. What is Constant profiling?

This refers to the running of CPU profiler at all times, as soon as the NetScaler device comes up. At the boot time, the profiler is invoked and it keeps running. Any time any of the PE’s associated CPU exceeds 90%, the profiler captures the data into a set of files.

2. Why is this needed?

This was necessitated with the issues seen at some customer sites and in internal tests. With customer issues, it’s hard to go back and request the customer to run the profiler when the issue is seen again. Hence, we have felt the need of a profiler running to be able to see the functions triggering high CPU. With this feature now, the profiler will be running always and the data gets captured when the high CPU usage occurs.

3. Which releases/builds contain this feature?

TOT (Crete) 44.2+

9.3 – all builds

9.2 52.x +

Only nCore builds are affected.

4. How do we know the profiler is already running?

Run the ps command to check if nsproflog and nsprofmon are running. The number of nsprofmon processes should be the same as the number of PEs running.

root@nc1# ps -ax | grep nspro36683 p0 S+ 0:00.00 grep nspro79468 p2- I 0:00.01 /bin/sh /netscaler/nsproflog.sh cpuuse=800 start79496 p2- I 0:00.00 /bin/sh /netscaler/nsproflog.sh cpuuse=800 start79498 p2- I 0:00.00 /bin/sh /netscaler/nsproflog.sh cpuuse=800 start79499 p2- I 0:00.00 /bin/sh /netscaler/nsproflog.sh cpuuse=800 start79502 p2- S 33:46.15 /netscaler/nsprofmon -s cpu=3 -ys cpuuse=800 -ys profmode=cpuuse -O -k /v79503 p2- S 33:48.03 /netscaler/nsprofmon -s cpu=2 -ys cpuuse=800 -ys profmode=cpuuse -O -k /v79504 p2- S 32:20.63 /netscaler/nsprofmon -s cpu=1 -ys cpuuse=800 -ys profmode=cpuuse -O -k /v

5. Where is the profiler data?

The profiled data is collected in /var/nsproflog directory. Here is a sample output of the list of files in that folder. At any point of time, the currently running files are newproflog_cpu_<penum>.out. Once the data in these files exceed 10MB in size, they are archived into a tar file and compressed. The roll over mechanism is similar to what we have for newnslog files.

newproflog.0.tar.gz newproflog.5.tar.gz newproflog.old.tar.gznewproflog.1.tar.gz newproflog.6.tar.gz newproflog_cpu_0.outnewproflog.2.tar.gz newproflog.7.tar.gz nsproflog.nextfilenewproflog.3.tar.gz newproflog.8.tar.gz nsproflog_optionsnewproflog.4.tar.gz newproflog.9.tar.gz ppe_cores.txt

The current data is always captured in newproflog_cpu_<ppe number>.out. Once the profiler is stopped, the newproflog_cpu_* files will be archived into newproflog.(value in nsproflog.nextfile-1).tar.gz.

6. What is nsprofmon and what’s nsproflog.sh?

Nsprofmon is the binary that interacts with PE, retrieves the profiler records and writes them into files. There are a myriad of options present which are hard to remember. The wrapper script nsproflog.sh is easier to use and remember. Going forward, it is recommended to use the wrapper script, if it’s limited to collecting CPU usage data.

7. Should I use nsprofmon or nsproflog.sh?

In earlier releases (9.0 and earlier), nsprofmon was heavily used internally and by the support groups. Some internal scripts that devtest use, refer to nsprofmon. It is recommended to use nsproflog.sh, if it’s limited to collecting CPU usage data.

8. Will the existing scripts be affected?

It will affect the existing scripts if they try to invoke the profiler. Please see the next question.

9. What if I want to start the profiler with a different set of parameters?

There can be only one instance of profiler running at any time. If the profiler is already running (invoked at boot time with constant profiling), and if we want to invoke again, it flags an error and exits.

root@nc1# nsproflog.sh cpuuse=900 startnCore ProfilingAnother instance of profiler is already running.If you want to run the profiler at a different CPU threshold, please stop the current profiler using# nsproflog.sh stop... and invoke again with the intended CPU threshold. Please see nsproflog.sh -h for the exact usage.


Similarly, nsprofmon is also modified to check if another instance is running. If it is, it exits flagging an error.

If the profiler needs to be run again with a different CPU usage (i.e. 80%), the running instance needs to be stopped and invoked again:

root@nc1# nsproflog.sh stopnCore ProfilingStopping all profiler processesRemoving buffer for -s cpu=1Removing profile buffer on cpu 1 ... Done.Saved profiler capture data in newproflog.5.tar.gzSetting minimum lost CPU time for NETIO to 0 microsecond ... Done.Stopping mgmt profiler process
root@nc1# nsproflog.sh cpuuse=800 start

10. How do I view the profiler data?

In /var/nsproflog, unzip and untar the desired tar archive. Each file in this archive should correspond to each PE.

Caution: When we unzip and untar the older files, the files from the archive will overwrite the current ones. The names stored inside the tar archive are the same as the ones to which currently running profiler keeps writing into. To avoid this, unzip and untar into a temporary directory.

The simplest way to see the profiled data is

# nsproflog.sh kernel=/netscaler/nsppe display=newproflog_cpu_<ppe number>.out

11. How do we collect this data for analysis?

The showtech script has been modified to collect the profiler data. When customer issues arrive, var/nsproflog can be checked to see if the profiler has captured any data.

12. Anything else that I need to know?

Collecting traces and profiler data are made mutually exclusive. When nstrace.sh is run to collect traces, profiler is automatically stopped and restarted when nstrace.sh exits. We wouldn’t have the profiler data during the time of collecting traces.

13. What commands get executed when profiler is started?

Initialization:

For each CPU, the following commands are executed initially:

nsapimgr -cnsapimgr -ys cpuuse=900 nsprofmon -s cpu=<cpuid> -ys profbuf=128 -ys profmode=cpuuse

Capturing:

For each CPU, the following are executed:

nsapimgr -cnsprofmon -s cpu=<cpuid> -ys cpuuse=900 -ys profmode=cpuuse -O -k /var/nsproflog/newproflog_cpu_<cpuid>.out -s logsize=10485760 -ye capture

After the above, nsprofmon processes will be running till any one of the capture buffers is full.

nsproflog.sh waits for any of the above child processes to exit

Stopping:

Kill all nsprofmon processes (killall -9 nsprofmon)

For each CPU, the following commands are executed:

nsprofmon -s cpu=<cpuid> -yS profbuf

Profiler capture files are archived:

nsapimgr -ys lctnetio=0

Related:

How To Troubleshoot High Packet or Management CPU Issue on NetScaler Appliance

CPU is a finite resource. Like many resources, there are limits to a CPU’s capacity. The NetScaler appliance has two kinds of CPUs in general: The Management CPU and Packet CPU.

Wherein, the Management CPU is responsible for processing all the Management traffic on the appliance and the Packet CPU(s) are responsible for handling all the data traffic for eg. TCP , SSL etc.

When diagnosing a complaint involving high CPU, start by gathering the following fundamental facts:

  1. CPUs impacted: nsppe (one or all) & management.
  2. Approximate time stamp/duration.

The following command o/p are quintessential for troubleshooting the high CPU issues:

  • Output of top command: Gives the CPU utilization percentage by the processes running on the NetScaler.
  • Output of stat system memory command: Gives the memory utilization percentage which can also contribute in the CPU utilization.
  • Output of stat system cpu command: This gives the stats about the current CPU utilization in total on the appliance.

Sample o/p of stat cpu command:

> stat cpuCPU statisticsID Usage1 29

The above o/p indicates that there is only 1 CPU (utilized for both Management and Data traffic) and the percentage of utilization is 29%.

The CPU ID is 1.

Now, there are appliances with multiple cores (nCore ) wherein more than single core is allocated to the appliance and then we see multiple CPU IDs on the “stat system cpu ” o/p.

*The high CPU seen when running a “top” command does not impact the performance of the box. It also “does not” mean that the NetScaler is running at high CPU or consuming all of the CPU. The NetScaler Kernel runs on top of BSD and that is what is being seen. Although it appears to be using the full amount of the CPU, it is actually not.

We can further follow the below steps for understanding the CPU usage:

  1. Check the following counters to understand CPU usage.

    CLASSIC:

    master_cpu_use

    cc_appcpu_use filter=cpu(0)

    (If AppFW or CMP is configured, then looking at slave_cpu_use also makes sense for classic)

    nCORE:

    (For an 8 Core system)

    mgmt_cpu_use (CPU0 – nscollect runs here)

    master_cpu_use (average of cpu(1) thru cpu(7))

    cc_cpu_use filter=cpu(1)

    cc_cpu_use filter=cpu(2)

    cc_cpu_use filter=cpu(3)

    cc_cpu_use filter=cpu(4)

    cc_cpu_use filter=cpu(5)

    cc_cpu_use filter=cpu(6)

    cc_cpu_use filter=cpu(7)

  2. How to look for CPU use for a particular CPU?

    Use the nsconmsg command and search for cc_cpu_use and grep for the CPU you are interested in.

    The output will look like the following:

    Index rtime totalcount-val delta rate/sec symbol-name&device-no
    320 0 209 15 2 cc_cpu_use cpu(8)
    364 0 205 -6 0 cc_cpu_use cpu(8)
    375 0 222 17 2 cc_cpu_use cpu(8)
    386 0 212 -10 -1 cc_cpu_use cpu(8)
    430 0 216 6 0 cc_cpu_use cpu(8)
    440 0 201 -15 -2 cc_cpu_use cpu(8)
    450 0 208 7 1 cc_cpu_use cpu(8)
    461 0 202 -6 0 cc_cpu_use cpu(8)
    471 0 209 7 1 cc_cpu_use cpu(8)
    482 0 238 29 4 cc_cpu_use cpu(8)
    492 0 257 19 2 cc_cpu_use cpu(8)
  • Look at the total count (third) column and divide by 10 to get the CPU percentage. For eg. in the last line above, 257 implies that 257/10 = 25.7% CPU is used by CPU(8).

    Run the following command to investigate the nsconsmg counters for CPU issue:

    nsconmsg –K newnslog –g cpu_use –s totalcount=600 –d currentnsconmsg –K newnslog –d current | grep cc_cpu_use
  • Look at the traffic, memory and CPU in conjunction. We may be hitting platform limits if it sustained high CPU usage. Try to understand if the CPU has gone up because of traffic. If so, try to understand if it is genuine traffic or any sort of attack.
  • We can further check for the Profiler o/p to understand who is taking the CPU.

    For details on the profiler o/p , logs , refer to the below article:

    https://support.citrix.com/article/CTX212480

  • We can further use the CPU counters mentioned in the below article for more details:

    https://support.citrix.com/article/CTX133887


Profiling FAQs

1. What is Constant profiling?

This refers to the running of CPU profiler at all times, as soon as the NetScaler device comes up. At the boot time, the profiler is invoked and it keeps running. Any time any of the PE’s associated CPU exceeds 90%, the profiler captures the data into a set of files.

2. Why is this needed?

This was necessitated with the issues seen at some customer sites and in internal tests. With customer issues, it’s hard to go back and request the customer to run the profiler when the issue is seen again. Hence, we have felt the need of a profiler running to be able to see the functions triggering high CPU. With this feature now, the profiler will be running always and the data gets captured when the high CPU usage occurs.

3. Which releases/builds contain this feature?

TOT (Crete) 44.2+

9.3 – all builds

9.2 52.x +

Only nCore builds are affected.

4. How do we know the profiler is already running?

Run the ps command to check if nsproflog and nsprofmon are running. The number of nsprofmon processes should be the same as the number of PEs running.

root@nc1# ps -ax | grep nspro36683 p0 S+ 0:00.00 grep nspro79468 p2- I 0:00.01 /bin/sh /netscaler/nsproflog.sh cpuuse=800 start79496 p2- I 0:00.00 /bin/sh /netscaler/nsproflog.sh cpuuse=800 start79498 p2- I 0:00.00 /bin/sh /netscaler/nsproflog.sh cpuuse=800 start79499 p2- I 0:00.00 /bin/sh /netscaler/nsproflog.sh cpuuse=800 start79502 p2- S 33:46.15 /netscaler/nsprofmon -s cpu=3 -ys cpuuse=800 -ys profmode=cpuuse -O -k /v79503 p2- S 33:48.03 /netscaler/nsprofmon -s cpu=2 -ys cpuuse=800 -ys profmode=cpuuse -O -k /v79504 p2- S 32:20.63 /netscaler/nsprofmon -s cpu=1 -ys cpuuse=800 -ys profmode=cpuuse -O -k /v

5. Where is the profiler data?

The profiled data is collected in /var/nsproflog directory. Here is a sample output of the list of files in that folder. At any point of time, the currently running files are newproflog_cpu_<penum>.out. Once the data in these files exceed 10MB in size, they are archived into a tar file and compressed. The roll over mechanism is similar to what we have for newnslog files.

newproflog.0.tar.gz newproflog.5.tar.gz newproflog.old.tar.gznewproflog.1.tar.gz newproflog.6.tar.gz newproflog_cpu_0.outnewproflog.2.tar.gz newproflog.7.tar.gz nsproflog.nextfilenewproflog.3.tar.gz newproflog.8.tar.gz nsproflog_optionsnewproflog.4.tar.gz newproflog.9.tar.gz ppe_cores.txt

The current data is always captured in newproflog_cpu_<ppe number>.out. Once the profiler is stopped, the newproflog_cpu_* files will be archived into newproflog.(value in nsproflog.nextfile-1).tar.gz.

6. What is nsprofmon and what’s nsproflog.sh?

Nsprofmon is the binary that interacts with PE, retrieves the profiler records and writes them into files. There are a myriad of options present which are hard to remember. The wrapper script nsproflog.sh is easier to use and remember. Going forward, it is recommended to use the wrapper script, if it’s limited to collecting CPU usage data.

7. Should I use nsprofmon or nsproflog.sh?

In earlier releases (9.0 and earlier), nsprofmon was heavily used internally and by the support groups. Some internal scripts that devtest use, refer to nsprofmon. It is recommended to use nsproflog.sh, if it’s limited to collecting CPU usage data.

8. Will the existing scripts be affected?

It will affect the existing scripts if they try to invoke the profiler. Please see the next question.

9. What if I want to start the profiler with a different set of parameters?

There can be only one instance of profiler running at any time. If the profiler is already running (invoked at boot time with constant profiling), and if we want to invoke again, it flags an error and exits.

root@nc1# nsproflog.sh cpuuse=900 startnCore ProfilingAnother instance of profiler is already running.If you want to run the profiler at a different CPU threshold, please stop the current profiler using# nsproflog.sh stop... and invoke again with the intended CPU threshold. Please see nsproflog.sh -h for the exact usage.


Similarly, nsprofmon is also modified to check if another instance is running. If it is, it exits flagging an error.

If the profiler needs to be run again with a different CPU usage (i.e. 80%), the running instance needs to be stopped and invoked again:

root@nc1# nsproflog.sh stopnCore ProfilingStopping all profiler processesRemoving buffer for -s cpu=1Removing profile buffer on cpu 1 ... Done.Saved profiler capture data in newproflog.5.tar.gzSetting minimum lost CPU time for NETIO to 0 microsecond ... Done.Stopping mgmt profiler process
root@nc1# nsproflog.sh cpuuse=800 start

10. How do I view the profiler data?

In /var/nsproflog, unzip and untar the desired tar archive. Each file in this archive should correspond to each PE.

Caution: When we unzip and untar the older files, the files from the archive will overwrite the current ones. The names stored inside the tar archive are the same as the ones to which currently running profiler keeps writing into. To avoid this, unzip and untar into a temporary directory.

The simplest way to see the profiled data is

# nsproflog.sh kernel=/netscaler/nsppe display=newproflog_cpu_<ppe number>.out

11. How do we collect this data for analysis?

The showtech script has been modified to collect the profiler data. When customer issues arrive, var/nsproflog can be checked to see if the profiler has captured any data.

12. Anything else that I need to know?

Collecting traces and profiler data are made mutually exclusive. When nstrace.sh is run to collect traces, profiler is automatically stopped and restarted when nstrace.sh exits. We wouldn’t have the profiler data during the time of collecting traces.

13. What commands get executed when profiler is started?

Initialization:

For each CPU, the following commands are executed initially:

nsapimgr -cnsapimgr -ys cpuuse=900 nsprofmon -s cpu=<cpuid> -ys profbuf=128 -ys profmode=cpuuse

Capturing:

For each CPU, the following are executed:

nsapimgr -cnsprofmon -s cpu=<cpuid> -ys cpuuse=900 -ys profmode=cpuuse -O -k /var/nsproflog/newproflog_cpu_<cpuid>.out -s logsize=10485760 -ye capture

After the above, nsprofmon processes will be running till any one of the capture buffers is full.

nsproflog.sh waits for any of the above child processes to exit

Stopping:

Kill all nsprofmon processes (killall -9 nsprofmon)

For each CPU, the following commands are executed:

nsprofmon -s cpu=<cpuid> -yS profbuf

Profiler capture files are archived:

nsapimgr -ys lctnetio=0

Related:

NetScaler CPU Profiling

1. Profiler Scripts

nsproflog.sh – This script is used to start/stop NetScaler Profiler.

2. Profiler directory path

/var/nsproflog – All profiler related captured data/or scripts resides in this directory.

3. Constant Profiling

On NetScaler, at the boot time, the profiler is invoked and it keeps running. At any time if any of the packet engine’s (PE) associated CPU exceeds 90%utilization, the profiler captures the data into a set of files in newproflog_cpu_<cpu-id>.out.

4. Help Usage

root@ns# /netscaler/nsproflog.sh -hnCore Profilingnsproflog - utility to start/stop NetScaler profiler to capture data and to display the profiled datausage: nsproflog.sh [-h] [cpu=<cpu-id>] [cpuuse=<cpu_utilization_in_percentage*10> | lctnetio=<time_in_microseconds> | lctidle=<time_in_microseconds> | lctbsd=<time_in_microseconds> | lcttimer=<time_in_microseconds> | lcttimerexec=<time_in_microseconds> | lctoutnetio=<time_in_microseconds> | time=<time_in_seconds>] [loop=<count>] [hitperc=<value_in_percentage>] [display=<capture_file_path>] [kernel=<nsppe_file_path>] [start | stop] -h - print this message - exclusive option

Options used for starting the profiler:

  • start – start the capture
  • cpu – cpu-id on which profiler needs to capture data, default: on all cpus
  • cpuuse – threshold value (in cpu_percentage*10) when cpu utilization exceeds above this will trigger profiler to start capturing data in newproflog_cpu_<cpu-id>.out
  • lct* – help to find Lost CPU Time (in microseconds), when CPU cycles are spent for longer duration in functions other than packet processing
  • time – time (in seconds) to capture the profiler data before restarting a new capture
  • loop – number of iterations of the profiler captures
  • LCT have following options:

  • lctidle – Amount of time spent in idle function
  • lctnetio – Amount of time spent in netio
  • lcttimer – Amount of timer HA timer is not called
  • lcttimerexec – Amount of time spent in executing NetScaler timeout functions e.g pe_dotimeout etc
  • lctbsd – Amount of time spent in freebsd
  • lctoutnetio – Amount of time spent since netio is called again


Options used for displaying the profiler data:

  • hitperc – hit percentage threshold for displaying functions with Hitratio (Number of hits for the function in percentage) above the threshold, default: 1%
  • display – display profiled data captured for specific cpu-id from capture file e.g newproflog_cpu_<cpu-id>.out
  • kernelhits – display hits symbols for kernel profile data captured for specific cpu-id from capture file e.g newproflog_cpu_<cpu-id>.out
  • ppehits – display hits symbols for PPE profile data captured for specific cpu-id from capture file e.g newproflog_cpu_<cpu-id>.out
  • aggrhits – display aggregated hits symbol for combined kernel and PPE data captured for specific cpu-id from capture file e.g newproflog_cpu_<cpu-id>.out
  • kernel – NetScaler nsppe file path, default: /netscaler/nsppe


Options used for stopping the profiler:

  • cpu – cpu-id on which profiler needs to be stopped, default: on all cpus
  • ​stop – stop the capture and generate a .tar.gz file for the captured outputs

Examples:

To start the profiler with a threshold above 70% CPU utilization to capture data on all the CPUs :

nsproflog.sh cpuuse=700 start

To start the profiler with capture when lost cpu time exceeds 2 milliseconds inside idle functions:

nsproflog.sh lctidle=2000 start

To stop the profiler and generate the .tar.gz of all captured data:

nsproflog.sh stop

To display captured data for all function with Hitratio > 1% :

nsproflog.sh display=/var/nsproflog/newproflog.0/newproflog_cpu_1.out

To display captured data for all function with Hitratio > 0% :

nsproflog.sh hitperc=0 display=/var/nsproflog/newproflog.0/newproflog_cpu_1.out kernel=/netscaler/nsppe

Note: If another instance of profiler is already running, then please stop the current profiler, before running the new instance of profiler with a different CPU threshold.


5. To start the profiler with no CPU threshold to capture data on all the CPU’s

root@ns# /netscaler/nsproflog.sh start & [1] 65065 root@ns# nCore Profiling Setting (512 KB) of profile buffer for cpu 3 ... Done.Setting (512 KB) of profile buffer for cpu 2 ... Done.Setting (512 KB) of profile buffer for cpu 1 ... Done.Collecting profile data for cpu 3Collecting profile data for cpu 2Capturing profile data for 10 seconds...Collecting profile data for cpu 1Capturing profile data for 10 seconds...Please wait for profiler to capture dataCapturing profile data for 10 seconds...root@ns# root@ns# root@ns# Saved profiler capture data in newproflog.9.tar.gzCollecting profile data for cpu 3Collecting profile data for cpu 2Capturing profile data for 10 seconds...Collecting profile data for cpu 1Capturing profile data for 10 seconds...Please wait for profiler to capture dataCapturing profile data for 10 seconds...root@ns#cd /var/nsproflog root@ns#pwd /var/nsproflog root@ns# ls -lltotal 9356 -rw-r--r-- 1 root wheel 109423 Sep 24 22:37 newproflog.0.tar.gz-rw-r--r-- 1 root wheel 156529 Sep 24 22:38 newproflog.1.tar.gz-rw-r--r-- 1 root wheel 64410 Sep 24 22:38 newproflog.2.tar.gz-rw-r--r-- 1 root wheel 111448 Sep 24 22:38 newproflog.3.tar.gz-rw-r--r-- 1 root wheel 157538 Sep 24 22:38 newproflog.4.tar.gz-rw-r--r-- 1 root wheel 65603 Sep 24 22:38 newproflog.5.tar.gz-rw-r--r-- 1 root wheel 112944 Sep 24 22:38 newproflog.6.tar.gz-rw-r--r-- 1 root wheel 158081 Sep 24 22:39 newproflog.7.tar.gz-rw-r--r-- 1 root wheel 44169 Sep 24 22:39 newproflog.8.tar.gz-rw-r--r-- 1 root wheel 48806 Sep 25 22:19 newproflog.9.tar.gz-rw-r--r-- 1 root wheel 339 Sep 16 23:16 newproflog.old.tar.gz-rw-r--r-- 1 root wheel 208896 Sep 25 22:19 newproflog_cpu_1.out-rw-r--r-- 1 root wheel 208896 Sep 25 22:19 newproflog_cpu_2.out-rw-r--r-- 1 root wheel 208896 Sep 25 22:19 newproflog_cpu_3.out-rw-r--r-- 1 root wheel 6559889 Sep 18 21:43 newproflog_mgmtcpu-rw-r--r-- 1 root wheel 202630 Sep 18 05:58 newproflog_mgmtcpu.0.gz-rw-r--r-- 1 root wheel 3 Sep 25 22:19 nsproflog.nextfile-rw-r--r-- 1 root wheel 309 Sep 25 22:19 nsproflog_args-rw-r--r-- 1 root wheel 1 Sep 25 22:18 nsproflog_options-rw-r--r-- 1 root wheel 6 Sep 25 22:18 ppe_cores.txt


6. To stop the profiler on all the CPUs

root@ns# /netscaler/nsproflog.sh stop nCore Profiling Stopping all profiler processesKilledKilledKilledRemoving buffer for -s cpu=3Removing profile buffer on cpu 3 ... Done.Removing buffer for -s cpu=2Removing profile buffer on cpu 2 ... Done.Removing buffer for -s cpu=1Removing profile buffer on cpu 1 ... Done.Saved profiler capture data in newproflog.0.tar.gzStopping mgmt profiler process[1]+ Killed: 9 /netscaler/nsproflog.sh startroot@ns# 


7. To display the profiled data on cpu#1 with no CPU threshold for all the function whose Hitratio% is greater than default threshold hitperc=1

root@ns# tar -xzvf newproflog.9.tar.gz newproflog.9/newproflog.9/newproflog_cpu_1.outnewproflog.9/newproflog_cpu_2.outnewproflog.9/newproflog_cpu_3.outnewproflog.9/nsproflog_argsroot@ns# cd newproflog.9root@ns# lsnewproflog_cpu_1.out newproflog_cpu_2.out newproflog_cpu_3.out nsproflog_argsroot@ns#root@ns# /netscaler/nsproflog.sh display=newproflog_cpu_1.out nCore ProfilingDisplaying the profiler command-line arguments used during start of capture/netscaler/nsproflog.sh/netscaler/nsprofmon -s cpu=3 -O -k /var/nsproflog/newproflog_cpu_3.out -T 10 -ye capture/netscaler/nsprofmon -s cpu=2 -O -k /var/nsproflog/newproflog_cpu_2.out -T 10 -ye capture/netscaler/nsprofmon -s cpu=1 -O -k /var/nsproflog/newproflog_cpu_1.out -T 10 -ye captureDisplaying the profile capture statistics for proc with HitRatio > 1% NetScaler NS11.0: Build 13.6.nc, Date: Sep 18 2013, 13:54:47 ==============================================================================Index HitRatio Hits TotalHit% Length Symbol name==============================================================================1 50.358% 5550 50.358% 1904 packet_engine**2 15.380% 1695 65.738% 32 pe_idle_readmicrosec**3 9.037% 996 74.775% 112 nsmcmx_is_pending_messages**4 8.956% 987 83.731% 80 vc_idle_poll**5 7.041% 776 90.772% 96 vmpe_intf_loop_rx_any**6 6.143% 677 96.915% 256 vmpe_intf_e1k_sw_rss_tx_any**7 1.370% 151 98.285% 64 vmpe_intf_e1k_rx_any**==============================================================================8 98.285% 11021 ==============================================================================** - Idle Symbols Displaying the summary of proc hits.................................==============================================================================PID PROCNAME PROCHIT PROCHIT%==============================================================================1326 NSPPE-00 11021 100.00==============================================================================8. Once All data is collected we need to revert back the profiler settings, i.e leave it running at 90% threshold. If this step is skipped, continuous profiling wont happen till next reboot.nohup /usr/bin/bash /netscaler/nsproflog.sh cpuuse=900 start &Verify the following, note the the last line will repeat for each packet cpu.root@ns# ps -aux | grep -i profroot 2946 0.0 0.0 1532 984 ?? Ss 12:27PM 0:00.00 /netscaler/nsprofmgmt 90.0root 2920 0.0 0.1 5132 2464 0 S 12:27PM 0:00.02 /usr/bin/bash /netscaler/nsproflog.sh cpuuse=900 startroot 2957 0.0 0.1 46564 4376 0 R 12:27PM 0:00.01 /netscaler/nsprofmon -s cpu=1 -ys cpuuse=900 -ys profmode=cpuuse -O -k /var/nsproflog/newproflog_cpu_1.out -s logsize=10485760 -ye capture

Related:

QRadar – Supervise system with QID

Hello all,

Once more I’m having an issue with monitoring…

I’m looking to keep an eye on CPU usage (also RAM, HDD, etc. but my example will be based on CPU).

So I did 3 rules based on the QIDs :

– 38750073 SAR Sentinel threshold crossed
– 4254950 SYSINFO:CPU Alerts when CPU usage hits a preconfigured threshold
– 3250045 Syslog and Webtrends Critical CPU utilization has surpassed the alarm threshold

After editing these rules, I did a stress test (command “stress” on Centos and “while true;do a=1;done”), I am monitoring the system on VMWare where I can see the CPU at 100%.

I’m logging the event by sending a SNMP TRAP (this config is ok, tried it with other rules) but nothing is sent…

Do you know which QIDs would work or/and where I should find what I’m looking for ?

Regards,

Related:

How to Use the XenServer Xentop Utility

To run the Xentop utility open an SSH console with the XenServer host or go to the Console tab in XenCenter.

Run the xentop command in the console. The console displays information about this server in a table.

User-added image

Available Parameters

You can use the following parameters to configure the output for the xentop command:

  • –delay=SECONDS – Set the number of seconds between updates
  • -n – Output the VIF network data
  • -x – Output the VBD block device data
  • -r – Repeat the table header before each domain
  • -v – Output the vCPU data
  • -i – Number of iterations (updates) to display before xentop exits
  • -f – Output the full domain name instead of a truncated name

Use the -h parameter to see more available parameters for the xentop command.

Columns Description

The Xentop utility displays the following columns in the console: CPU(sec) – Prints domain CPU usage in seconds

CPU(%) – Prints cpu percentage statistic

VCPUS – Prints number of virtual CPUs

NETS – Prints number of virtual networks

MEM – Prints current memory

MAXMEM(k) – Prints maximum domain memory statistic in KB

MAXMEM(%) – Prints memory percentage statistic, ratio of current domain memory to total node memory

NETTX – Prints number of total network tx bytes statistic/1024

NETRX– Prints number of total network rx bytes statistic/1024

VBDS – Prints number of virtual block devices

VBD OO – Prints number of total VBD OO requests. This shows the number of times that the VBD has encountered an out of requests error. When that occurs, I/O requests for the VBD are delayed.

Shows number of total VBD requests statistic:

  • VBD_RD number of read requests
  • VBD_WR number of write requests

Possible virtual machine states:

  • d – domain is dying
  • s – domain shutting down
  • b – blocked domain
  • c – domain crashed
  • p – domain paused
  • r – domain is actively running on one of the CPU

Related: