7021211: Memory, I/O and DefaultTasksMax related considerations for SLES for SAP servers with huge memory

Somegeneral guidelines about using pagecache_limit and optimizing some ofthe I/O related settings:-

If on the server in question,you are *not* simultaneously mixing a heavy file I/O workload whilerunning a memory intensive application workload, then this setting(pagecache_limit) will probably cause more harm than good. However,in most SAP environments, there is both high I/O and memory intensiveworkloads.

Ideally, vm.pagecache_limit_mb should be zerountil such time that pagecache is seen to exhaust memory. If it doesexhaust memory then trial-and-error-tuning must be used to findvalues that work for the specific server/workload in question.

Asregards the type of settings that have both a fixed value and a’ratio’ setting option, keep in mind that ratio settings will be moreand more inaccurate as the amount of memory in the server grows.Therefore, specific ‘byte’ settings should be used as opposed to’ratio’ type settings. The ‘ratio’ settings can allow too muchaccumulation of dirty memory which has been proven to lead toprocessing stalls during heavy fsync or sync write loads. Settingdirty_bytes to a reasonable value (which depends on the storageperformance) leads to much less unexpected behavior.

Setting,say, a 4gb pagecache limit on a 142G machine, is asking for trouble,especially when you consider that this would be much smaller than adefault dirty ratio limit (which is by default 40% of availablepages).

If the pagecache_limit is used, it should alwaysbe set to a value well above the ‘dirty’ limit, be it a fixed valueor a percentage.

The thing is that there is no universal’correct’ values for these settings. You are always balancingthroughput with sync latency. If we had code in the kernel so that itwould auto-tune automatically based on the amount of RAM in theserver, it would be very prone to regressions because it depends onserver-specific loading. So, necessarily, it falls to the serveradmins to come up with the best values for these settings (viatrial-and-error).

*If* we know for a fact that the serverdoes encounter issues with pagecache_limit set to 0 (not active),then choose a pagecache_limit that is suitable in relation to howmuch memory is in the server.

Lets assume that you have aserver with 1TB of RAM, these are *suggested* values which could beused as a starting point:-

pagecache_limit_mb = 20972 # 20gb – Different values could be tried from say 20gb <>64gb

pagecache_limit_ignore_dirty = 1 # see the below section on this variable to decide what it should be set toovm.dirty_ratio =0

vm.dirty_bytes = 629145600 # This could be reduced orincreased based on actual hardware performance but

keep thevm.dirty_background_bytes to approximately 50% of thissetting

vm.dirty_background_ratio = 0

vm.dirty_background_bytes= 314572800 # Set this value to approximately 50% of vm.dirty_bytes


NOTE: If it isdecided to try setting pagecache_limit to 0 (not active) then it’sstill a good idea to test different values for dirty_bytes anddirty_background_bytes in an I/O intensive environment to arrive atthe bestperformance.

———————————————————————

Howpagecache_limit works:

—————————————-

Theheart of this patch is a function called shrink_page_cache(). It iscalled from balance_pgdat (which is the worker for kswapd) if thepagecache is above the limit. The function is also called in__alloc_pages_slowpath.

shrink_page_cache() calculates thenumber of pages the cache is over its limit. It reduces this numberby a factor (so you have to call it several times to get down to thetarget) then shrinks the pagecache (using the KernelLRUs).

shrink_page_cache does several passes:

– Just reclaiming from inactive pagecache memory. This is fast– but it might not find enough free pages; if that happens, thesecond pass will happen.

– In the second pass,pages from active list will also be considered.

– The third pass will only happen if pagecacahe_limig_ignore-dirty isnot 1. In that case, the third pass is a repetition of the secondpass, but this time we allow pages to be written out.

Inall passes, only unmapped pages will be considered.


Howit changes memorymanagement:

——————————————————

Ifthe pagecache_limit_mb is set to zero (default), nothing changes.

Ifset to a positive value, there will be three different operatingmodes:

(1) If we still have plenty of free pages, the pagecachelimit will NOT be enforced. Memory management decisions are taken asnormally.

(2) However, as soon someone consumes those freepages, we’ll start freeing pagecache — as those are returned to thefree page pool, freeing a few pages from pagecache will return us tostate (1) — if however someone consumes these free pages quickly,we’ll continue

freeing up pages from the pagecache until wereach pagecache_limit_mb.

(3) Once we are at or below the lowwatermark, pagecache_limit_mb, the pages in the page cache will begoverned by normal paging memory management decisions; if it startsgrowing above the limit (corrected by the free pages), we’ll freesome up again.

This feature is useful for machines thathave large workloads, carefully sized to eat most of the memory.Depending on the applications page access pattern, the kernel may tooeasily swap the application memory out in favor of pagecache. Thiscan happen even for low values of swappiness. With this feature, theadmin can tell the kernel that only a certain amount of pagecache isreally considered useful and that it otherwise should favor theapplicationsmemory.


pagecache_limit_ignore_dirty:

——————————————

Thedefault for this setting is 1; this means that we don’t considerdirty memory to be part of the limited pagecache, as we can not easily free up dirty memory (we’d need to do writes for this). Bysetting this to 0, we actually consider dirty (unampped) memoryto be freeable and do a third pass in shrink_page_cache() where weschedule the pages for write-out. Values larger than 1 are alsopossible and result in a fraction of the dirty pages to be considerednon-freeable.

From SAP on the subject:

If there are alot of local writes and it is OK to throttle them by limiting thewriteback caching, we recommended that you set the value to 0. Ifwriting mainly happens to NFS filesystems, the default 1 should beleft untouched. A value of 2 would be a middle ground, not limitinglocal write back caching as much, but potentially resulting in somepaging.

Related:

7022353: OOM Killer Invoked after SLE12 Upgrade

After upgrading from any SLES11 to any SLES12 version, the system became unstable and was running out of memory. Messages about the out of memory killer being invoked were observed.

2017-11-08T18:05:25.713556-02:00 db2p121 kernel: lsof invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0

The sysctl -a | grep swappiness shows:

vm.swappiness=0

Related: