Adding Lustre Storage to the HPC Equation

For organizations that need extreme scalability in high-performance computing systems, Lustre is often the file system of choice — for a lot of good reasons. When it comes to high-performance computing applications, there is basically no such thing as too much data storage. Who doesn’t need more storage? Everywhere you look, HPC applications are ballooning in size. A few examples: AccuWeather, the world’s largest source of weather forecasts and warnings, responds to more than 30 billion data requests daily.[1] The wave of medical data washing over the global healthcare industry is expected to swell to 2,314 … READ MORE

Related:

Do We Have Containerized Solution For CWP ?

I need a solution

Currently, when we deploy CWP in our AWS environment it spins up EC2 instances for threat scanning. Although we don’t have any files into S3 coming up for a long time, servers keep on running. Is there a way we can containerize the solution and only when files landed to S3 it should spin up EC2 and scan for any threats and shut it down ?

0

Related:

Easy, Economical Cloud DR to AWS with RecoverPoint for Virtual Machines

The most recent RecoverPoint for Virtual Machines v5.2.1 release, adds the capability to protect VMs directly to AWS S3 object storage, using proprietary snap-based replication, with RPO that can be measured in minutes. This blog recaps the capabilities that Cloud DR 18.4 unlocks for Recover Point for Virtual Machines. RecoverPoint for Virtual Machines works with Cloud DR to protect your VMs by replicating them to the AWS cloud. Replicated data is compressed, encrypted, and stored as incremental snapshots on Amazon S3 object storage. You can set parameters around the snap-replication policies for reliable and repeatable Disaster … READ MORE

Related:

Open Enterprise Server 2018 SP1 – Now Available!

Nick Scholz

Micro Focus is pleased to announce the availability of Open Enterprise Server 2018 Service Pack 1, which includes plenty of exciting features and enhancements. Open Enterprise Server 2018 SP1 Highlights include: CIFS Performance/Specification Compliance – Folder Redirection allows users to redirect the path of a known folder to NSS AD network file share. Users can then …

+read more

The post Open Enterprise Server 2018 SP1 – Now Available! appeared first on Cool Solutions. Nick Scholz

Related:

  • No Related Posts

7022780: Get the most out of Novell Open Enterprise Server 2015

  • Maximum Cached Sub-directories Per Volume: 1024000

    This is settable by executing “novcifs -k SDIRCACHE=1024000” as root.
  • Maximum Cached Files Per Subdirectory: 102400

    This is settable by executing “novcifs -k DIRCACHE=102400” as root.
  • Maximum Cached Files Per Volume: 2048000

    This is settable by executing “novcifs -k FILECACHE=2048000” as root.

If you however are experiencing inaccessible CIFS shares, CIFS stops listening or communicating, becomes unresponsive or the novell-cifs daemon hangs or other CIFS or novell-cifs daemon related issues, please check TID 7008956 “Troubleshooting and Debugging CIFS on Open Enterprise Server”.

In the lifespan of OES2015 a memory leak was addressed in the NMAS code used for and by the novell-cifsd.

The code on disk is part of the oes2015sp1-July-2017-Scheduled-Maintenance patch, or later.

More details on how to check and when required update the code in the eDirectory can be found in TID 7022690 “ndsd memory leak caused by novell-cifs nmas authentication method on OES2015.1”



NAMCD (LUM):

As several services rely on the Novell Authentication Module for their authentication to the eDirectory it is recommended to tune this so it uses preferably the local server if this has a local replica, or a server on the local subnet that does have a replica of the needed tree partitions.

By default the “preferred-server” is set to the first Novell Open Enterprise Server that was installed in the eDirectory tree, so there is a huge chance this parameter requires adjustment.

To get the current namcd configuration execute as root:

namconfig get

If the preferred-server value does not point to the local ip address (which is preferred even when the server does not have a local replica), or an IP address of a server on the same physical subnet, this can be changed with the following sequence of commands as root:

namconfig set preferred-server=[local ip address (*)]

namconfig -k

namconfig cache_refresh

(*) When in doubt use ‘ndsconfig get’ and use the ip address listed in n4u.server.interfaces without the @524.

Alternatively, if the server does not have a local replica, it is also possible to add one or a couple alternative LDAP servers.

Adding a single alternative LDAP server is possible by executing this command as root:

namconfig set alternative-ldap-server-list=[ds-server01]

Adding more than one alternative LDAP server is possible using a comma separated list and can be accomplished by executing the following command as root:

namconfig set alternative-ldap-server-list=[ds-server01],[ds-server02],[ds-server03]

The usage of alternative LDAP servers for LUM can also be used to make LUM more scalable.

It is also recommended to turn persistent search off and cache-only on.

This can be accomplished by executing these commands as root:

namconfig set persistent-search=no

namconfig set cache-only=yes


As the namconfig cache_refresh includes a restart of the namcd there is no need to restart this daemon to enable these changes.

For the other changes and tunings to be enabled, it is recommended to restart the namcd.

NSS.

Although it is required to tune Novell Storage Services to the requirements of the environment it is being used, these are some tunings worth considering:

– Increase the NSS IDCacheSize to 128K, this can be accomplished by executing, as root:

nsscon /idcachesize=131072

– Disable the Access Time by executing the following line as root:

nsscon /noatime=[volume name]

– From OES2SP2 onwards the Unplugalways parameter has a default value set to “off”. If this is not the case, disable it by executing as root:

nsscon /nounplugalways

More details can be found in “UnplugAlways Command for the Read Queue” in “Cache Management Commands” of the “NSS Commands” section in the OES2015 Documentation.


In order to make these tunings persistent they must be added (as by default they are not there) to the /etc/opt/novell/nss/nssstart.cfg, though make sure to make no typos in this file, as they can cause novell-nss to fail to start.

Make sure that all unmarked entries in this file start with “/” and not “nss /“.

Using nssmu, launched as root, increase the “read ahead blocks” for all nss volumes from the default value of 16 to 64.

If over time this appears to be insufficient, this can be increased to 128. It is not really recommended to go beyond this value, as it may have a negative impact on the performance.

This change is activated “on the fly”, and does not require a server restart.
Furter information regarding tuning NSS performance can be found in “Tuning NSS Performance” in the “NSS File System Administration Guide for Linux” of the OES2015 Documentation.

During the lifespan of a NSS Volume it is recommended to preserve at least 10% to 20% free space. Available purgable disks pace does not equal to true free disk space that is available.

Once a volume drops below these thresholds and there is insufficient true free disk space is available to the system to write data, a background process starts calculating what is the oldest deleted data that can be purged for the system in order to make space available for the new data to be written to disk..When there are large amounts of purgable data, nearly no true free space and large amounts of data is written this calculation process becomes a thread that will cause continuous I/O, performance degradation and other unwanted phenomena up to data corruption or loss.

For this reason it is recommended for NSS volumes hosting services with a high usage of temporary files to mark either the folders used for temporary storage or the volume as “purge immediate”, basically disabling salvage for those directories or the volume.

When a volume passes these thresholds it is recommended to either expand the pool and volume, delete obsolete data but a temporary measure may be to manually purge the volume.

This can be accomplished by executing the following command as root:

ncpcon purge volume [volume name]

However, as storage areas have become greater over time, grown into the terabytes (TB) keeping 20% free space might be a bit too much.

In case the NSS Pool size is in the TBs or larger, it might be worthwhile to consider lowering the PoolHighWaterMark and PoolLowWaterMark that is used for these large NSS Pools.

More details on this can be found in the “Salvage and Purge Commands” of the OES2015 documentation

After adjusting the PoolHighWaterMark and PoolLowWaterMark the behavior of the server will be the same when dropping below the set WaterMarks.

SMS (tsafs / backup)

In case you are suffering from slow performing back-up, or regressing back-up speed, the documentation for optimizing sms on OES2015 can be found in “Optimizing SMS” in the “Storage Management Services Administration Guide for Linux” of the OES2015 Documentation.

NCS:

In a cluster environment, where a single OES2015 server might be hosting several NSS, NSSforAD, NCP and CIFS volumes at once, it is recommended to use lower cache values compared to a standalone OES2015 server.

The OES2015 cluster node must be able to handle the cache and memory requests when it is hosting all cluster resources at once.

All the previous steps should also address most issues seen with OES2015 servers running Novell Cluster Services and their cluster-enabled resources.

Non the less, if you are suffering from random split brains, without a physical reason (LAN or SAN outage), it would be advisable to investigate if there are time-jumps (back and or forward) caused by the CPU. Clock Jumps can occur on physical and virtual CPU’s and is not restricted to either of those.

In some cases where Novell Cluster Services are not starting up properly at boot-up, it is recommended to alter the /etc/sysconfig/boot so it reads RUN_PARALLEL=”no”.

This to allow the Novell services to start in their proper sequence and is the default for OES2015.




PKI (Certificate Server)
:

During the installation of any OES server, a couple default Server Certificates are generated.

These are by default valid for 2 years. After this period, when the certificates expire the server and all servers that use the server as LDAP source will no longer be able to access the required eDirectory information and all services that rely on it will fail.

Therefor it is recommended to validate the Default Server Certificates when some or all services of a particular server fails as one of the troubleshooting steps. If the failing server is using a different server as preferred_server, or LDAP source, that particular server’s Default Certificate Material should be checked too.

The LDAP servers that the server might be using are listed under /ets/sysconfig/novell/ldap_servers/ though it is recommended to also check each service to determine which LDAP server it is using, as they may differ.

A method of validating the Default Certificates of a server is via iManager.

This can be accomplished under the “Novell Certificate Access”, the “Server Certificates” task.

If one or more certificates is deemed invalid, the next step would be to repair these.

This task can be accomplished under “Novell Certificate Server”, the “Repair Default Certificates” task.

More details on this can be found in TID 7000075 “SSL Certificates expire after two years, affecting OES Services”


In case the LDAP server being used is a Novell NetWare server, these tasks can also be accomplished using the pkidiag.nlm


After the Certificates were replaced or repaired, it is recommended to at least restart the ndsd (rcndsd restart) so the server is using the new certificate for it’s LDAPS communication and run a namconfig -k to update the certificate that namcd is using.


A more sustainable option might be to enable “Self-Provisioning” for the Certificate Authority (CA) of the tree.

This feature is described here in the Novell Documentation.

Enabling this for the CA of the tree, this feature is enabled for all servers in the tree that use this CA.

As soon as the server or it’s ndsd is restarted it will be aware that “Server Self Provisioning” is enabled.

With this feature enabled, the certificates that are expired or about to be expired are extended automatically when executing either:

ndstrace

unload pkiserver

load pkiserver

or

rcndsd restart


Anti-Virus scanners:

When an Anti-Virus suite is installed, try to avoid the usage of “on access” scanning, as this can create a severe overhead.

The Anti-Virus suite, running either locally or remotely, should be excluded from scanning system crucial directories like the ._NETWARE of all NCP exported filesystems (both NSS and POSIX), /_admin and the Linux System directories.

In case Novell Cluster services is installed, /admin should also be excluded from being scanned by the anti-virus.

In case Novell GroupWise is installed, all repository and queue directories should be excluded from being scanned by the anti-virus as well. Novell GroupWise stores it’s information in an encrypted way, so scanning the physical files stored on disk is unnecessary and can even cause severe problems like file locking or even corruption.

There are several Anti-Virus suites that can scan inside the Novell GroupWise mail storage and / or incoming e-mail.

Related:

Upgrading OE on Unity that is configured for file services only

it depends what you understand as “disrupt connectivity”

a reboot – no matter how fast – will always be kind of disruptive on the lower levels

and a client will have to at least re-establish the TCP connection

The question is more how much of that is visible to the client OS and application

NFS clients using default hard mounts will just see a pause in I/O but no error to applications

The OS and protocol stack will of course re-establish the connection, recover locks, ….

for CIFS clients it depends on the application and OS

Windows itself will automatically reconnect

cluster aware application that retry internally should be ok

simple applications like copying files via explorer.exe can stop and show a “Try again” dialog

For those application that really require transparent failover – like SharePoint or Hyper-V over SMB shares you can enable SMB CA (Continuous Availability) per share

then they will also just pause and resume I/O similar to NFS

See the NAS white paper and Microsoft details about CA in SMB3

Why dont you just try it ??

all an upgrade is doing is a SP reboot – which you can easily do even from the GUI

If you dont want to use your hardware Unity as VSA will show the same behaviour

Related:

Nutanix AFS (Nutanix Files) might not function properly with the ELM

This information is very preliminary and has not been rigorously tested.

AFS appears to use DFS namespace redirection to point you to individual nodes in the AFS cluster where your data is actually held. The ELM does not support DFS redirection, so when the STATUS_PATH_NOT_COVERED comes back from the initial node we reached, we fail the attempt instead of moving to the requested server. If randomly you happen to connect to the node where your data is, there is no redirection and no error.

Unfortunately, there does not appear to be a workaround except to point the ELM to a specific node in the AFS cluster instead of the main cluster address. This node probably has to be the AFS “leader” node.

Related:

Networker 9.1 DDOS 5.7.x > RHEL Clients 8.x 9.x>Stale Sessions

Hi

We have networker server on windows , DD9500 DDOS 5.7.x , RHEL CLients 8.x and 9.x That Do not Cancel sessions on the Client side and on the Data Domain, when a Workflow action timeout occurs on a client due to a stale NFS Mount.

Issue is with 350+ RHEL Clients all using same NFS Mount from NFS Server cause maximum session limit reached on DD when the NFS server crashed and all the RHEL NFS Clients mount points went stale, resulting in backup reaching timeout, but sessions not cancelled on the data domain.

The only way to detect actual session count is netstat -an in SE mode and lsof -I TCP:20149 in BASH mode

Anyone else see this on RHEL clients? Currently have support investigating but wanted to get a feel for this in the community

Related:

Re: Re: unity migration vdm,usermapper,multiprotocol questions

castleknock wrote:

This differs from VNX behaviour as secmap created a local UID reference to ‘hide’ the lack of a unix account rather than simple deny SMB access Is this a correct read ? and it so explains the lack of any references to secmap import during VNX migration.

the different isnt in secmap

secmap is not a mapping method – its merely a cache so that we dont have to do repeated do calls to external mapping source which can take time

The difference is with usermapper

usermapper was only every meant to as a mapping method for CIFS only file systems but on VNX/Celerra this wasnt enforced.

The manuals told you clearly to disable usermapper if you are doing multi-protocol but many customers didnt do that – either because they didnt know of out of convinience

So they are using a config where some users were mapped through the AD/NIS/ntxmap and the ones that couldnt got a uid from usermapper

In Unity we improved this:

usermapper is per NAS server – and not globally per data mover

by default usermapper is disabled for multi-protocol NAS server

instead we add options for default Unix/Windows user that get used if AD/NIS/ntxmap are unable to map the user – which didnt exist in VNX/Celerra

So if you use the default on a multi-protocol NAS server and we cannot map a user then access is denied

You an then either:

– make sure this user is covered by the mapping sources

– configure the default Unix user

– enable automatic user mapping (usermapper)

this is explained in detail with flowcharts in the multi-protocol manual that I mentioned

keep in mind though that just enabling usermapper like on VNX is convinient but it also makes changes and troubleshooting more difficult

This is because secmap entries never expire or get updated

For example if a user connects to a NAS server before you have configured its account in AD/NIS/ntxmap mappings he will get a UID from usermapper

Then if later the admin adds the account to AD/NIS/ntxmap this account will still use the uid from usermapper for this NAS server but on a new NAS server the uid from the mapping source

Also since usermapper is now per NAS server the same user will get different uid’s on different NAS servers

bottom line – if you want full multi-protocol then use a deterministic mapping method and not usermapper

Related: