Cache file size of “Cache on device RAM with overflow on hard disk” is growing faster than “Cache on device hard disk”

  • Cache Type = “Cache on device hard disk”

The block reservation of “Cache on device hard disk” is ONLY 4KB, and it’s random I/O.

  • Cache Type = “Cache on device RAM with overflow on hard disk”

Since “overflow on hard disk” is vhdx format, the block reservation is 2MB & required for sequential I/O, which will provide better performance than legacy local write cache, but it will consume more filesystem usage.

Hence if there’s heavy IOPS occurred on target device, and RAM cache is running out of range, “overflow on hard disk” cache will be generated, which may grow very fast.

Related:

  • No Related Posts

Server HDD activity high, crippling

I do not need a solution (just sharing information)

Our small organization’s single Server 2012 file server has been running SEPM since 2014.

I’ve been here since 2017.

Since I;ve been here, the server has been sluggish.

It’s a VM with 3TB disk space on 2 volumes and 6GB dedicated RAM.

The only other VMs are a very small linux VM and a small Win7 VM used only for remote login.

I’ve played around with stripping out unnecessary apps and even moved the paging file with only marginal sucess.

Today it was particularly lethargig and I noticed in resource monitor, that almost all of the disk activity (which was quite high) was rrelated to SEPM.

A google search turned up an issue where someone had “Live Update Administrator” and SEPM both running on the same server, and that was causing excessive disk activity.

That post was from 2010. Our services are not named the same.

I was wondering if “Live Update” is the same as “Live Update Manager” as we have bothe “Live Update” and “Symantec Endpoint Manager” services runningas well as a few others that begin with “Symantec…”

I wanted to stop the “Live Update” service to see what happened but I can’t. Looks like I’ll have to uninstall it.

It shows up as a seperate installed ap in “add and remove…”

I want to make sure from someone here, before i do that, though.

Thanks-

KK

0

Related:

Avamar Client for Windows: Avamar backup fails with “avtar Error : Out of memory for cache file” on Windows clients

Article Number: 524280 Article Version: 3 Article Type: Break Fix



Avamar Plug-in for Oracle,Avamar Client for Windows,Avamar Client for Windows 7.2.101-31



In this scenario we have the same issue presented in the KB 495969 however the solution does not apply due to an environment issue on a Windows client.

  • KB 495969 – Avamar backup fails with “Not Enough Space” and “Out of Memory for cache file”

The issue could affect any plugin like in this case with the error presented in the following manner:

  • For FS backups:
avtar Info <8650>: Opening hash cache file 'C:Program Filesavsvarp_cache.dat'avtar Error <18866>: Out of memory for cache file 'C:Program Filesavsvarp_cache.dat' size 805306912avtar FATAL <5351>: MAIN: Unhandled internal exception Unix exception Not enough space
  • For VSS backups:
avtar Info <8650>: Opening hash cache file 'C:Program Filesavsvarp_cache.dat'avtar Error <18866>: Out of memory for cache file 'C:Program Filesavsvarp_cache.dat' size 1610613280avtar FATAL <5351>: MAIN: Unhandled internal exception Unix exception Not enough space
  • For Oracle backup:
avtar Info <8650>: Opening hash cache file 'C:Program Filesavsvarclientlogsoracle-prefix-1_cache.dat'avtar Error <18866>: Out of memory for cache file 'C:Program Filesavsvarclientlogsoracle-prefix-1_cache.dat' size 100663840avtar FATAL <5351>: MAIN: Unhandled internal exception Unix exception Not enough spaceor this variant:avtar Info <8650>: Opening hash cache file 'C:Program Filesavsvarclientlogsoracle-prefix-1_cache.dat'avtar Error <18864>: Out of restricted memory for cache file 'C:Program Filesavsvarclientlogsoracle-prefix-1_cache.dat' size 100663840avtar FATAL <5351>: MAIN: Unhandled internal exception Unix exception Not enough space avoracle Error <7934>: Snapup of <oracle-db> aborted due to rman terminated abnormally - check the logs
  • With the RMAN log reporting this:
RMAN-00571: ===========================================================RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============RMAN-00571: ===========================================================RMAN-03002: failure of backup plus archivelog command at 06/14/2018 22:17:40RMAN-03009: failure of backup command on c0 channel at 06/14/2018 22:17:15ORA-04030: out of process memory when trying to allocate 1049112 bytes (KSFQ heap,KSFQ Buffers)Recovery Manager complete. 

Initially it was though the cache file could not grow in size due to incorrect “hashcachemax” value.

The client had plenty of free RAM (48GB total RAM) so we increase the flag’s value from -16 (3GB file size max) to -8 (6GB file size max)

But the issue persisted and the disk space was also not an issue, there was plenty of GBs of free space

Further investigations with a test binary from the engineering team lead to the fact that the MS OS was not releasing enough unused and contiguous memory required to allocate/load into the memory the entire hash cache file for the backup operation.

It was tried a test binary that would allocate the memory in smaller pieces to see if we could reach the point where the OS would allow the full file p_cache.dat to be loaded into memory but that also did not help, the Operative system was still not allowing to load the file into memory for some reason.

The root cause is hided somewhere in the OS however in this case we did not engage the MS team for further investigations on their side.

Instead we found a way to work around the issue setting the cache file to be smaller, see details in the resolution section below.

In order to work around this issue we set the hash cache file to be of a smaller size so that the OS would not have issues in allocating it into memory.

In this case it was noticed that the OS was also having problems in allocating smaller sizes like 200+ MB so we decided to re-size the p_cache.dat to be just 100MB with the use of the following flag:

–hashcachemax=100

This way the hash cache file would never grow beyond 100MB and would overwrite the old entries.

After adding that flag it is requited to recycle the cache file by renaming or deleting the p_cache.dat (renaming is the preferred option)

After the first backup which would take longer than usual as expected (to rebuild the cache file) the issue should be resolved.

  • The Demand-paging cache is not recommended in this scenario since the backup are directed to GSAN storage so the Monolithic paging cache was used.
  • Demand-paging was designed to gain benefit for backup being sent to DataDomain storage.

Related:

Understanding Write Cache in Provisioning Services Server

This article provides information about write cache usage in a Citrix Provisioning, formerly Provisioning Services (PVS), Server.

Write Cache in Provisioning Services Server

In PVS, the term “write cache” is used to describe all the cache modes. The write cache includes data written by the target device. If data is written to the PVS server vDisk in a caching mode, the data is not written back to the base vDisk. Instead, it is written to a write cache file in one of the following locations:

When the vDisk mode is private/maintenance mode, all data is written back to the vDisk file on the PVS Server. When the target device is booted in standard mode or shared mode, the write cache information is checked to determine the cache location. When a target device boots to a vDisk in standard mode/shared mode, regardless of the cache type, the data written to the Write Cache is deleted on boot so that when a target is rebooted or starts up it has a clean cache and contains nothing from the previous sessions.

If the PVS target is using Cache on Device RAM with overflow on hard disk or Cache on device hard disk, the PVS target software either does not find an appropriate hard disk partition or it is not formatted using NTFS. As a result, it will fail over to Cache on the server. The PVS target software will, by default, redirect the system page file to the same disk as the write cache so that the pagefile.sys is allocating space on the cache drive unless it is manually set up to be redirected on a separate volume.

For RAM cache without a local disk, you should consider setting the system page file to zero because all writes, including system page file writes, will go to the RAM cache unless redirected manually. PVS does not redirect the page file in the case of RAM cache.



Cache on device Hard Disk

Requirements

  • Local HD in every device using the vDisk.
  • The local HD must contain a basic volume pre-formatted with a Windows NTFS file system with at least 512MB of free space.

The cache on local HD is stored in a file called .vdiskcache on a secondary local hard drive. It gets created as an invisible file in the root folder of the secondary local HD. The cache file size grows, as needed, but never gets larger than the original vDisk, and frequently not larger than the free space on the original vDisk. It is slower than RAM cache or RAM Cache with overflow to local hard disk, but faster than server cache and works in an HA environment. Citrix recommends that you do not use this cache type because of incompatibilities with Microsoft Windows ASLR which could cause intermittent crashes and stability issues. This cache is being replaced by RAM Cache with overflow to the hard drive.

Cache in device RAM

Requirement

  • An appropriate amount of physical memory on the machine.

The cache is stored in client RAM. The maximum size of the cache is fixed by a setting in the vDisk properties screen. RAM cache is faster than other cache types and works in an HA environment. The RAM is allocated at boot and never changes. The RAM allocated can’t be used by the OS. If the workload has exhausted the RAM cache size, the system may become unusable and even crash. It is important to pre-calculate workload requirements and set the appropriate RAM size. Cache in device RAM does not require a local hard drive.

Cache on device RAM with overflow on Hard Disk

Requirement

  • Provisioning Service 7.1 hotfix 2 or later.
  • Local HD in every target device using the vDisk.
  • The local HD must contain Basic Volume pre-formatted with a Windows NTFS file system with at least 512 MB of free space. By default, Citrix sets this to 6 GB but recommends 10 GB or larger depending on workload.
  • The default RAM is 64 MB RAM, Citrix recommends at least 256 MB of RAM for a Desktop OS and 1 GB for Server OS if RAM cache is being used.
  • If you decide not to use RAM cache you may set it to 0 and only the local hard disk will be used to cache.

Cache on device RAM with overflow on hard disk represents the newest of the write cache types. Citrix recommends using this cache type for PVS, it combines the best of RAM with the stability of hard disk cache. The cache uses non-paged pool memory for the best performance. When RAM utilization has reached its threshold, the oldest of the RAM cache data will be written to the local hard drive. The local hard disk cache uses a file it creates called vdiskdif.vhdx.

Things to note about this cache type:

  • This write cache type is only available for Windows 7/2008 R2 and later.
  • This cache type addresses interoperability issues with Microsoft Windows ASLR.



Cache on Server

Requirements

  • Enough space allocated to where the server cache will be stored.
Server cache is stored in a file on the server, or on a share, SAN, or other location. The file size grows, as needed, but never gets larger than the original vDisk, and frequently not larger than the free space on the original vDisk. It is slower than RAM cache because all reads/writes have to go to the server and be read from a file. The cache gets deleted when the device reboots, that is, on every boot, the device reverts to the base image. Changes remain only during a single boot session. Server cache works in an HA environment if all server cache locations to resolve to the same physical storage location. This cache type is not recommended for a production environment.

Additional Resources

Selecting the Write Cache Destination for Standard vDisk Images

Turbo Charging your IOPS with the new PVS Cache in RAM with Disk Overflow Feature

Related:

Win7 MCS Machines BSOD With Error: 0x0000007E

When creating Machine Catalogs with the MCS I/O feature:

1. Make sure the “Memory allocated to cache (MB)” and the “Disk cache size (GB)” boxes are populated when running the Machine Catalog Setup wizard.

Note: If you do not want to use the feature, clear both boxes and temporary data will not be cached (see the information on the red box below).

User-added image

2. After the virtual machine has been created, log on the VM and check that an extra uninitialized disk exists on Disk management. Do not attempt to format it or initialize it.

User-added image

3. If there is no extra cache drive on the VDAs, the catalog needs to be recreated.

Related:

ECS: Partial GC status in ECS 3.0.

Article Number: 491554 Article Version: 8 Article Type: Break Fix



ECS Appliance,ECS Appliance Hardware,ECS Appliance Software with Encryption,ECS Appliance Software without Encryption,ECS Software

Partial Garbage Collection (GC) is not enabled in ECS 3.0

Partial Repo GC is not available in ECS 3.0

What is Partial GC?

This is working as designed.

N/A

Partial Repo Garbage Collection (GC) is not enabled as part of ECS 3.0. The decision was made to leave repo Partial GC disabled, by default, until the next release, ECS 3.1.

Note partial Btree GC is available as of 3.0HF1 with general patch installed. It reclaims chunks with less than 5% valid data

ETA for ECS 3.1 is due out in late August 31st 2017. This will allow for chunks to be reclaimed with 2/3 of data is filled with garbage.This applies to REPO GC only.

How is ECS data stored:

ECS data is stored in 128MB chunks. When a chunk is first created, it is 100% utilized, but as deletions occur or new versions of an object are written, the chunk becomes less than 100% utilized. The Garbage Collection facility will not recycle a chunk until it consists of 100% deleted or obsoleted data.

What is Partial Garbage Collection (Partial GC):

Partial GC is the process which identifies underutilized chunks and moves the remaining allocated objects to other chunks so that the partially allocated chunks can be reclaimed by The Garbage Collected facility and its 128MB of space converted to available capacity.

This procedure may be recognized as Compaction. The ECS (like most object storage systems) has a Garbage Collection facility. Calling this aspect of the ECS software Partial Garbage Collection made sense to the designers, but it may be called Compaction.

Related:

  • No Related Posts

Re: How to check amount of data written on a preallocated device.

I haven’t seen what the output looks like for preallocated devices in V3, but try:

symcfg list -sid xx -pools -thin -v -detail

Then look at the section “other thin devices with allocations in this pool” It has for each device “Total Tracks”, “pool Allocated tracks” and “pool used tracks”

BTW, for VMAX 3, is there a particular reason you are pre-allocating? Certainly prior to V3, there was a little latency overhead for allocating tracks on first write to the newly allocated space, but much improved in V3, so this should not be a consideration now.

Related:

How to check amount of data written on a preallocated device.

I haven’t seen what the output looks like for preallocated devices in V3, but try:

symcfg list -sid xx -pools -thin -v -detail

Then look at the section “other thin devices with allocations in this pool” It has for each device “Total Tracks”, “pool Allocated tracks” and “pool used tracks”

BTW, for VMAX 3, is there a particular reason you are pre-allocating? Certainly prior to V3, there was a little latency overhead for allocating tracks on first write to the newly allocated space, but much improved in V3, so this should not be a consideration now.

Related:

7021211: Memory, I/O and DefaultTasksMax related considerations for SLES for SAP servers with huge memory

Somegeneral guidelines about using pagecache_limit and optimizing some ofthe I/O related settings:-

If on the server in question,you are *not* simultaneously mixing a heavy file I/O workload whilerunning a memory intensive application workload, then this setting(pagecache_limit) will probably cause more harm than good. However,in most SAP environments, there is both high I/O and memory intensiveworkloads.

Ideally, vm.pagecache_limit_mb should be zerountil such time that pagecache is seen to exhaust memory. If it doesexhaust memory then trial-and-error-tuning must be used to findvalues that work for the specific server/workload in question.

Asregards the type of settings that have both a fixed value and a’ratio’ setting option, keep in mind that ratio settings will be moreand more inaccurate as the amount of memory in the server grows.Therefore, specific ‘byte’ settings should be used as opposed to’ratio’ type settings. The ‘ratio’ settings can allow too muchaccumulation of dirty memory which has been proven to lead toprocessing stalls during heavy fsync or sync write loads. Settingdirty_bytes to a reasonable value (which depends on the storageperformance) leads to much less unexpected behavior.

Setting,say, a 4gb pagecache limit on a 142G machine, is asking for trouble,especially when you consider that this would be much smaller than adefault dirty ratio limit (which is by default 40% of availablepages).

If the pagecache_limit is used, it should alwaysbe set to a value well above the ‘dirty’ limit, be it a fixed valueor a percentage.

The thing is that there is no universal’correct’ values for these settings. You are always balancingthroughput with sync latency. If we had code in the kernel so that itwould auto-tune automatically based on the amount of RAM in theserver, it would be very prone to regressions because it depends onserver-specific loading. So, necessarily, it falls to the serveradmins to come up with the best values for these settings (viatrial-and-error).

*If* we know for a fact that the serverdoes encounter issues with pagecache_limit set to 0 (not active),then choose a pagecache_limit that is suitable in relation to howmuch memory is in the server.

Lets assume that you have aserver with 1TB of RAM, these are *suggested* values which could beused as a starting point:-

pagecache_limit_mb = 20972 # 20gb – Different values could be tried from say 20gb <>64gb

pagecache_limit_ignore_dirty = 1 # see the below section on this variable to decide what it should be set toovm.dirty_ratio =0

vm.dirty_bytes = 629145600 # This could be reduced orincreased based on actual hardware performance but

keep thevm.dirty_background_bytes to approximately 50% of thissetting

vm.dirty_background_ratio = 0

vm.dirty_background_bytes= 314572800 # Set this value to approximately 50% of vm.dirty_bytes


NOTE: If it isdecided to try setting pagecache_limit to 0 (not active) then it’sstill a good idea to test different values for dirty_bytes anddirty_background_bytes in an I/O intensive environment to arrive atthe bestperformance.

———————————————————————

Howpagecache_limit works:

—————————————-

Theheart of this patch is a function called shrink_page_cache(). It iscalled from balance_pgdat (which is the worker for kswapd) if thepagecache is above the limit. The function is also called in__alloc_pages_slowpath.

shrink_page_cache() calculates thenumber of pages the cache is over its limit. It reduces this numberby a factor (so you have to call it several times to get down to thetarget) then shrinks the pagecache (using the KernelLRUs).

shrink_page_cache does several passes:

– Just reclaiming from inactive pagecache memory. This is fast– but it might not find enough free pages; if that happens, thesecond pass will happen.

– In the second pass,pages from active list will also be considered.

– The third pass will only happen if pagecacahe_limig_ignore-dirty isnot 1. In that case, the third pass is a repetition of the secondpass, but this time we allow pages to be written out.

Inall passes, only unmapped pages will be considered.


Howit changes memorymanagement:

——————————————————

Ifthe pagecache_limit_mb is set to zero (default), nothing changes.

Ifset to a positive value, there will be three different operatingmodes:

(1) If we still have plenty of free pages, the pagecachelimit will NOT be enforced. Memory management decisions are taken asnormally.

(2) However, as soon someone consumes those freepages, we’ll start freeing pagecache — as those are returned to thefree page pool, freeing a few pages from pagecache will return us tostate (1) — if however someone consumes these free pages quickly,we’ll continue

freeing up pages from the pagecache until wereach pagecache_limit_mb.

(3) Once we are at or below the lowwatermark, pagecache_limit_mb, the pages in the page cache will begoverned by normal paging memory management decisions; if it startsgrowing above the limit (corrected by the free pages), we’ll freesome up again.

This feature is useful for machines thathave large workloads, carefully sized to eat most of the memory.Depending on the applications page access pattern, the kernel may tooeasily swap the application memory out in favor of pagecache. Thiscan happen even for low values of swappiness. With this feature, theadmin can tell the kernel that only a certain amount of pagecache isreally considered useful and that it otherwise should favor theapplicationsmemory.


pagecache_limit_ignore_dirty:

——————————————

Thedefault for this setting is 1; this means that we don’t considerdirty memory to be part of the limited pagecache, as we can not easily free up dirty memory (we’d need to do writes for this). Bysetting this to 0, we actually consider dirty (unampped) memoryto be freeable and do a third pass in shrink_page_cache() where weschedule the pages for write-out. Values larger than 1 are alsopossible and result in a fraction of the dirty pages to be considerednon-freeable.

From SAP on the subject:

If there are alot of local writes and it is OK to throttle them by limiting thewriteback caching, we recommended that you set the value to 0. Ifwriting mainly happens to NFS filesystems, the default 1 should beleft untouched. A value of 2 would be a middle ground, not limitinglocal write back caching as much, but potentially resulting in somepaging.

Related: