Dell EMC Isilon Enhancements Embrace Cloud, Support Kubernetes and Reduce Storage Footprint

What a month this has been for Dell EMC Isilon! Not only are we announcing some pretty powerful innovations, we also won a Technology & Engineering Emmy® Award this month. Isilon and OneFS have made a powerful impact on the media and entertainment industry and we have been able to empower organizations to take control of their unstructured data and drive change. According to Gartner, “By 2024 enterprises will triple their unstructured data stored as file or object storage from what they have in 2019.”* As data continues to grow at this unrelenting pace, it is … READ MORE

Related:

Isilon Gen6: Addressing Generation 6 Battery Backup Unit (BBU) Test Failures[2]

Article Number: 518165 Article Version: 7 Article Type: Break Fix



Isilon Gen6,Isilon H400,Isilon H500,Isilon H600,Isilon A100,Isilon A2000,Isilon F800

Gen6 nodes may report spurious Battery Backup Unit (BBU) failures similar to the following:

Battery Test Failure: Replace the battery backup unit in chassis <serial number> slot <number> as soon as possible.

Issues were identified with both the OneFS battery test code, and the battery charge controller (bcc) firmware, that can cause these spurious errors to be reported.

The underlying causes for most spurious battery test failures have been resolved in OneFS 8.1.0.4 and newer, and Node Firmware Package 10.1.6 and newer (DEbcc/EPbcc v 00.71); to resolve this issue, please upgrade to these software versions, in that order, as soon as possible. In order to perform these upgrades and resolve this issue, the following steps are required:

Step 1: check the BBU logs for a “Persistent fault” message. This indicates a test failure state that cannot be cleared in the field. Run the following command on the affected node:

# isi_hwmon -b |grep “Battery 1 Status”


If the battery reports a Persistent Fault condition, gather and upload logs using the isi_gather_info command, then contact EMC Isilon Technical Support and reference this KB.

Step 2: Clear the erroneous battery test result by running the following commands:

# isi services isi_hwmon disable

# mv /var/log/nvram.xml /var/log/nvram.xml.old

Step 3: Clear the battery test alert and unset the node read-only state so the upgrade can proceed:

– Check ‘isi event events list’ to get the event ID for the HW_INFINITY_BATTERY_BACKUP_FAULT event. Then run the following commands:

# isi event modify <eventid> –resolved true

# /usr/bin/isi_hwtools/isi_read_only –unset=system-status-not-good

Step 4: Upgrade OneFS to 8.1.0.4 or later

Instructions for upgrading OneFS can be found in the OneFS Release Notes on the support.emc.com web site.

Step 5: Update node firmware using Node Firmware Package 10.1.6 or later

Instructions for upgrading node firmware can be found in the Node Firmware Package Release Notes on the support.emc.com web site.

Once the system is upgraded, no further spurious battery replacement alerts should occur.

If an OneFS upgrade to 8.1.0.4 or newer is not an option at this time, or if the system generates further battery failure alerts after upgrading, please contact EMC Isilon Technical Support for assistance, and reference this KB.

Related:

Isilon OneFS: Node compatibility class create fails when not all drives are HEALTHY

Article Number: 504582 Article Version: 3 Article Type: Break Fix



Isilon OneFS 8.1,Isilon OneFS 8.0,Isilon OneFS

Creating a node compatibility class fails if not all drives are HEALTHY and causes the process isi_smartpools_d to fail to start. That results in the event:

Process isi_smartpools_d of service isi_smartpools_d has failed to restart after multiple attempts

Running ‘isi status -p’ will contain the following:

Diskpool status temporarily unavailable.

The following error is logged in /var/log/messages:

2017-09-08T11:30:59-06:00 <1.4> for-isi-b-1 isi_smartpools_d[5415]: Exception: : Traceback (most recent call last): File "/usr/bin/isi_smartpools_d", line 287, in <module> main() File "/usr/bin/isi_smartpools_d", line 80, in main run_as_daemon() File "/usr/bin/isi_smartpools_d", line 89, in run_as_daemon run_uncaught() File "/usr/bin/isi_smartpools_d", line 118, in run_uncaught conform_diskpool_db_to_drive_purpose() File "/usr/bin/isi_smartpools_d", line 163, in conform_diskpool_db_to_drive_purpose needs_write = dp_cfg.conform_provisioning_to_node_types(fp_cfg) File "/usr/local/lib/python2.6/site-packages/isi/smartpools/diskpools.py", line 1200, in conform_provisioning_to_node_types File "/usr/local/lib/python2.6/site-packages/isi/smartpools/diskpools.py", line 1335, in conform_diskpools_to_storage_units File "/usr/local/lib/python2.6/site-packages/isi/smartpools/diskpools.py", line 1094, in drive_to_storage_unit AssertionError 

A missing drive will cause the disk pool database to fail to update as OneFS is unable to allocate that bay to a disk pool.

  • Replace any drives in bays in REPLACE status and make sure all bays in the cluster are HEALTHY. Once all drives are HEALTHY the node compatibility class can be created successfully.
  • After creating the node compatibility class make sure the ‘Diskpool status temporarily unavailable’ message is no longer in the output of:

# isi status -p

  • Verify storagepool health and compatible nodes are now in the correct pools by running:

# isi storagepool health -v

Related:

Isilon OneFS: Can not enable ESRS when legacy ESRS setup is enabled?

Article Number: 504579 Article Version: 4 Article Type: Break Fix



Isilon OneFS 8.1.0,Isilon OneFS 8.1

ESRS shows as NOT enabled in WebUI but shows Enabled in CLI

WebUI is showing error about version of ESRS is not supported

The reason the webui is showing the error about version of ESRS is not supported is because customer has “legacy” configuration setup, and not the new configuration

dsgsc1-1# isi remotesupport connectemc view

Enabled: Yes

Primary Esrs Gateway: 10.64.xxx.xxx

Secondary Esrs Gateway: –

Use SMTP Failover: No

Email Customer On Failure: No

Gateway Access Pools: subnet0:pool0

dsgsc1-1# isi esrs view

Enabled: No

Primary ESRS Gateway: 10.64.xxx.xxx

Secondary ESRS Gateway: –

Alert on Disconnect: Yes

Gateway Access Pools: subnet0.pool0, subnet1.pool0, subnet1.SyncIQ-pool

Gateway Connectivity Check Period: 3600

License Usage Intelligence Reporting Period: 86400

Gateway Connectivity Status: Disconnected

Customer would first need to setup and license the new configuration, then disable “legacy” stuff and see if the error would go away.

See KB’s https://support.emc.com/kb/511053 and https://support.emc.com/kb/511087 for further troubleshooting info

Install of new Isilon Gen6

setup the new configuration

isi esrs modify –enabled 1

EMC username and password are required to enable ESRS

disable legacy configuration

isi remotesupport connectemc modify –enabled=false

Related:

  • No Related Posts

Isilon: Error Invalid User name and Password, trying to Enable ESRS on Gen6 OneFS v8.1.0.0 (User Correctable)

Article Number: 504577 Article Version: 3 Article Type: Break Fix



Isilon OneFS,Isilon OneFS 8.1

ESRS shows as NOT enabled in WebUI but shows Enabled in CLI

Webui is showing error about version of ESRS is not supported

The reason the webui is showing the error about version of ESRS is not supported is because customer has “legacy” configuration setup, and not the new configuration

dsgsc1-1# isi remotesupport connectemc view

Enabled: Yes

Primary Esrs Gateway: 10.64.xxx.xxx

Secondary Esrs Gateway: –

Use SMTP Failover: No

Email Customer On Failure: No

Gateway Access Pools: subnet0:pool0

dsgsc1-1# isi esrs view

Enabled: No

Primary ESRS Gateway: 10.64.xxx.xxx

Secondary ESRS Gateway: –

Alert on Disconnect: Yes

Gateway Access Pools: subnet0.pool0, subnet1.pool0, subnet1.SyncIQ-pool

Gateway Connectivity Check Period: 3600

License Usage Intelligence Reporting Period: 86400

Gateway Connectivity Status: Disconnected

Customer would first need to setup the new configuration, then disable “legacy” stuff and see if the error is gone.

Install of new Isilon Gen6

isi esrs modify –enabled 1

EMC username and password are required to enable ESRS

Related:

OneFS Shadow Stores

The recent series of articles on SmartDedupe have generated several questions from the field around shadow stores. So this seemed like an ideal topic to explore in a bit more depth over the course of the next couple of articles.

A shadow store is a class of system file that contains blocks which can be referenced by different file – thereby providing a mechanism that allows multiple files to share common data. Shadow stores were first introduced in OneFS 7.0, initially supporting Isilon file clones, and indeed there are many overlaps between cloning and deduplicating files. As we will see, a variant of the shadow store is also used as a container for file packing in OneFS SFSE (Small File Storage Efficiency), often used in archive workflows such as healthcare’s PACS.

Architecturally, each shadow store can contain up to 256 blocks, with each block able to be referenced by 32,000 files. If this reference limit is exceeded, a new shadow store is created. Additionally, shadow stores do not reference other shadow stores. All blocks within a shadow store must be either sparse or point at an actual data block. And snapshots of shadow stores are not allowed, since shadow stores have no hard links.

Shadow stores contain the physical addresses and protection for data blocks, just like normal file data. However, a fundamental difference between a shadow stores and a regular file is that the former doesn’t contain all the metadata typically associated with traditional file inodes. In particular, time-based attributes (creation time, modification time, etc) are explicitly not maintained.

Consider the shadow store information for a regular, undeduped file (file.orig):

# isi get -DDD file.orig | grep –i shadow

* Shadow refs: 0

zero=36 shadow=0 ditto=0 prealloc=0 block=28

A second copy of this file (file.dup) is then created and then deduplicated:

# isi get -DDD file.* | grep -i shadow

* Shadow refs: 28

zero=36 shadow=28 ditto=0 prealloc=0 block=0

* Shadow refs: 28

zero=36 shadow=28 ditto=0 prealloc=0 block=0

As we can see, the block count of the original file has now become zero and the shadow count for both the original file and its copy is incremented to ‘28′. Additionally, if another file copy is added and deduplicated, the same shadow store info and count is reported for all three files.It’s worth noting that even if the duplicate file(s) are removed, the original file will still retain the shadow store layout.

Each shadow store has a unique identifier called a shadow inode number (SIN). But, before we get into more detail, here’s a table of useful terms and their descriptions:

Element

Description

Inode

Data structure that keeps track of all data and metadata (attributes, metatree blocks, etc.) for files and directories in OneFS

LIN

Logical Inode Number uniquely identifies each regular file in the filesystem.

LBN

Logical Block Number identifies the block offset for each block in a file

IFM Tree or Metatree

Encapsulates the on-disk and in-memory format of the inode. File data blocks are indexed by LBN in the IFM B-tree, or file metatree. This B-tree stores protection group (PG) records keyed by the first LBN. To retrieve the record for a particular LBN, the first key before the requested LBN is read. The retried record may or may not contain actual data block pointers.

IDI

Isi Data Integrity checksum. IDI checkcodes help avoid data integrity issues which can occur when hardware provides the wrong data, for example. Hence IDI is focused on the path to and from the drive and checkcodes are implemented per OneFS block.

Protection Group (PG)

A protection group encompasses the data and redundancy associated with a particular region of file data. The file data space is broken up into sections of 16 x 8KB blocks called stripe units. These correspond to the N in N+M notation; there are N+M stripe units in a protection group.

Protection Group Record

Record containing block addresses for a data stripe .There are five types of PG records: sparse, ditto, classic, shadow, and mixed. The IFM B-tree uses the B-tree flag bits, the record size, and an inline field to identify the five types of records.

BSIN

Base Shadow Store, containing cloned or deduped data

CSIN

Container Shadow Store, containing packed data (container or files).

SIN

Shadow Inode Number is a LIN for a Shadow Store, containing blocks that are referenced by different files; refers to a Shadow Store

Shadow Extent

Shadow extents contain a Shadow Inode Number (SIN), an offset, and a count.

Shadow extents are not included in the FEC calculation since protection is provided by the shadow store.

Blocks in a shadow store are identified with a SIN and LBN (logical block number).

# isi get -DD /ifs/data/file.dup | fgrep –A 4 –i “protection group”

PROTECTION GROUPS

lbn 0: 4+2/2

4000:0001:0067:0009@0#64

0,0,0:8192#32

A SIN is essentially a LIN that is dedicated to a shadow store file, and SINs are allocated from a subset of the LIN range. Just as every standard file is uniquely identified by a LIN, every shadow store is uniquely identified by a SIN. It is easy to tell if you are dealing with a shadow store because the SIN will begin with 4000. For example, in the output above:

4000:0001:0067:0009

Correspondingly, in the protection group (PG) they are represented as:

  • SIN
  • Block size
  • LBN
  • Run

The referencing protection group will not contain valid IDI data (this is with the file itself). FEC parity, if required, will be computed assuming a zero block.

When a file references data in a shadow store, it contains meta-tree records that point to the shadow store. This meta-tree record contains a shadow reference, which comprises a SIN and LBN pair that uniquely identifies a block in a shadow store.

A set of extension blocks within the shadow store holds the reference count for each shadow store data block. The reference count for a block is adjusted each time a reference is created or deleted from any other file to that block. If a shadow store block’s reference count drop to zero, it is marked as deleted, and the ShadowStoreDelete job, which runs periodically, deallocates the block.

Be aware that shadow stores are not directly exposed in the filesystem namespace. However, shadow stores and relevant statistics can be viewed using the ‘isi dedupe stats’, ‘isi_sstore list’ and ‘isi_sstore stats’ command line utilities.

Cloning

In OneFS, files can easily be cloned using the ‘cp –c’ command line utility. Shadow store(s) are created during the file cloning process, where the ownership of the data blocks is transferred from the source to the shadow store.

shadow_store_1.png



In some instances, data may be copied directly from the source to the newly created shadow stores. Cloning uses logical references to shadow stores, and the source and the destination data blocks refer to an offset in a shadow store. The source file’s protection group(s) are moved to a shadow store, and the PG is now referenced by both the source file and destination clone file. After cloning a file, both the source and the destination data blocks refer to an offset in a shadow store.

Dedupe

As we have seen in the recent blog articles, shadow Stores are also used for SmartDedupe. The principle difference with dedupe, as compared to cloning, is the process by which duplicate blocks are detected.

shadow_store_2.png

The deduplication job also has to spend more effort to ensure that contiguous file blocks are generally stored in adjacent blocks in the shadow store. If not, both read and degraded read performance may be impacted.

Small File Storage Efficiency

A class of specialized shadow stores are also used as containers for storage efficiency, allowing packing of small file into larger structures that can be FEC protected.

shadow_store_3.png

These shadow stores differ from regular shadow stores in that they are deployed as single-reference stores. Additionally, container shadow stores are also optimized to isolate fragmentation, support tiering, and live in a separate subset of ID space from regular shadow stores.

SIN Cache

OneFS provides a SIN cache, which helps facilitate shadow store allocations. It provides a mechanism to create a shadow store on demand when required, and then cache that shadow store in memory on the local node so that it can be shared with subsequent allocators. The SIN cache segregates stores by disk pool, protection policy and whether or not the store is a container.

Related:

OneFS SmartDedupe: Performance

Deduplication is a compromise: In order to gain increased levels of storage efficiency, additional cluster resources (CPU, memory and disk IO) are utilized to find and execute the sharing of common data blocks.

Another important performance impact consideration with dedupe is the potential for data fragmentation. After deduplication, files that previously enjoyed contiguous on-disk layout will often have chunks spread across less optimal file system regions. This can lead to slightly increased latencies when accessing these files directly from disk, rather than from cache. To help reduce this risk, SmartDedupe will not share blocks across node pools or data tiers, and will not attempt to deduplicate files smaller than 32KB in size. On the other end of the spectrum, the largest contiguous region that will be matched is 4MB.

Because deduplication is a data efficiency product rather than performance enhancing tool, in most cases the consideration will be around cluster impact management. This is from both the client data access performance front, since, by design, multiple files will be sharing common data blocks, and also from the dedupe job execution perspective, as additional cluster resources are consumed to detect and share commonality.

The first deduplication job run will often take a substantial amount of time to run, since it must scan all files under the specified directories to generate the initial index and then create the appropriate shadow stores. However, deduplication job performance will typically improve significantly on the second and subsequent job runs (incrementals), once the initial index and the bulk of the shadow stores have already been created.

If incremental deduplication jobs do take a long time to complete, this is most likely indicative of a data set with a high rate of change. If a deduplication job is paused or interrupted, it will automatically resume the scanning process from where it left off. The SmartDedupe job is a long running process that involves multiple job phases that are run iteratively. In its default, low impact configuration, SmartDedupe typically processes around 1TB or so of data per day, per node.

Deduplication can significantly increase the storage efficiency of data. However, the actual space savings will vary depending on the specific attributes of the data itself. As mentioned above, the deduplication assessment job can be run to help predict the likely space savings that deduplication would provide on a given data set.

For example, virtual machines files often contain duplicate data, much of which is rarely modified. Deduplicating similar OS type virtual machine images (VMware VMDK files, etc, that have been block-aligned) can significantly decrease the amount of storage space consumed. However, as noted previously, the potential for performance degradation as a result of block sharing and fragmentation should be carefully considered first.

Isilon SmartDedupe does not deduplicate across files that have different protection settings. For example, if two files share blocks, but file1 is parity protected at +2:1, and file2 has its protection set at +3, SmartDedupe will not attempt to deduplicate them. This ensures that all files and their constituent blocks are protected as configured. Additionally, SmartDedupe won’t deduplicate files that are stored on different node pools. For example, if file1 and file2 are stored on tier 1 and tier 2 respectively, and tier1 and tier2 are both protected at 2:1, OneFS won’t deduplicate them. This helps guard against performance asynchronicity, where some of a file’s blocks could live on a different tier, or class of storage, than others.

OneFS 8.0.1 introduced performance resource management, which provides statistics for the resources used by jobs – both cluster-wide and per-node. This information is provided via the ‘isi statistics workload’ CLI command. Available in a ‘top’ format, this command displays the top jobs and processes, and periodically updates the information.

For example, the following syntax shows, and indefinitely refreshes, the top five processes on a cluster:



# isi statistics workload --limit 5 –format=top

last update: 2019-01-23T16:45:25 (s)ort: default

CPU Reads Writes L2 L3 Node SystemName JobType

  1. 1.4s 9.1k 0.0 3.5k 497.0 2 Job: 237 IntegrityScan[0]
  2. 1.2s 85.7 714.7 4.9k 0.0 1 Job: 238 Dedupe[0]
  3. 1.2s 9.5k 0.0 3.5k 48.5 1 Job: 237 IntegrityScan[0]
  4. 1.2s 7.4k 541.3 4.9k 0.0 3 Job: 238 Dedupe[0]
  5. 1.1s 7.9k 0.0 3.5k 41.6 2 Job: 237 IntegrityScan[0]

From the output, we can see that two job engine jobs are in progress: Dedupe (job ID 238), which runs at low impact and priority level 4 is contending with IntegrityScan (job ID 237), which runs by default at medium impact and priority level 1.

The resource statistics tracked per job, per job phase, and per node include CPU, reads, writes, and L2 & L3 cache hits. Unlike the output from the ‘top’ command, this makes it easier to diagnose individual job resource issues, etc.

Below are some examples of typical space reclamation levels that have been achieved with SmartDedupe.

Be aware that these dedupe space savings values are provided solely as rough guidance. Since no two data sets are alike (unless they’re replicated), actual results can vary considerably from these examples.

Workflow / Data Type

Typical Space Savings

Virtual Machine Data

35%

Home Directories / File Shares

25%

Email Archive

20%

Engineering Source Code

15%

Media Files

10%

SmartDedupe is included as a core component of Isilon OneFS but requires a valid product license key in order to activate. This license key can be purchased through your Isilon account team. An unlicensed cluster will show a SmartDedupe warning until a valid product license has been purchased and applied to the cluster.

License keys can be easily added via the ‘Activate License’ section of the OneFS WebUI, accessed by navigating via Cluster Management > Licensing.

For optimal cluster performance, observing the following SmartDedupe best practices is recommended.

  • Deduplication is most effective when applied to data sets with a low rate of change – for example, archived data.
  • Enable SmartDedupe to run at subdirectory level(s) below /ifs.
  • Avoid adding more than ten subdirectory paths to the SmartDedupe configuration policy,
  • SmartDedupe is ideal for home directories, departmental file shares and warm and cold archive data sets.
  • Run SmartDedupe against a smaller sample data set first to evaluate performance impact versus space efficiency.
  • Schedule deduplication to run during the cluster’s low usage hours – i.e. overnight, weekends, etc.
  • After the initial dedupe job has completed, schedule incremental dedupe jobs to run every two weeks or so, depending on the size and rate of change of the dataset.
  • Always run SmartDedupe with the default ‘low’ impact Job Engine policy.
  • Run the dedupe assessment job on a single root directory at a time. If multiple directory paths are assessed in the same job, you will not be able to determine which directory should be deduplicated.
  • When replicating deduplicated data, to avoid running out of space on target, it is important to verify that the logical data size (i.e. the amount of storage space saved plus the actual storage space consumed) does not exceed the total available space on the target cluster.
  • Run a deduplication job on an appropriate data set prior to enabling a snapshots schedule.
  • Where possible, perform any snapshot restores (reverts) before running a deduplication job. And run a dedupe job directly after restoring a prior snapshot version.

With dedupe, there’s always trade-off between cluster resource consumption (CPU, memory, disk), the potential for data fragmentation and the benefit of increased space efficiency. Therefore, SmartDedupe is not ideally suited for heavily trafficked data, or high performance workloads.

  • Depending on an application’s I/O profile and the effect of deduplication on the data layout, read and write performance and overall space savings can vary considerably.
  • SmartDedupe will not permit block sharing across different hardware types or node pools to reduce the risk of performance asymmetry.
  • SmartDedupe will not share blocks across files with different protection policies applied.
  • OneFS metadata, including the deduplication index, is not deduplicated.
  • Deduplication is a long running process that involves multiple job phases that are run iteratively.
  • Dedupe job performance will typically improve significantly on the second and subsequent job runs, once the initial index and the bulk of the shadow stores have already been created.
  • SmartDedupe will not deduplicate the data stored in a snapshot. However, snapshots can certainly be created of deduplicated data.
  • If deduplication is enabled on a cluster that already has a significant amount of data stored in snapshots, it will take time before the snapshot data is affected by deduplication. Newly created snapshots will contain deduplicated data, but older snapshots will not.

SmartDedupe is just one of several components of OneFS that enable Isilon to deliver a very high level of raw disk utilization. Another major storage efficiency attribute is the way that Isilon natively manages data protection in the file system. Since OneFS protects data at the file level and, using software-based erasure coding, this translates to raw disk space utilization levels in the 85% range or higher. SmartDedupe serves to further extend this storage efficiency headroom, bringing an even more compelling and demonstrable TCO advantage to primary file based storage.

Related:

OneFS: 7.2.1.0 – 7.2.1.5: cannnot perform a snapshot restore of MS-office file from Previous Tab. Error: Write-Protected

Article Number: 503048 Article Version: 3 Article Type: Break Fix



Isilon OneFS 7.2,Isilon NL-Series,Isilon 108NL,Isilon NL400

OneFS: 7.2.1.0 – 7.2.1.5: When attempting to restore from snapshot using ‘Previous’ Tab for MS-office file, i.e. Excel, Word, one gets the error: Write-Protected. You are unable to restore an Isilon snapshot as a result.


Cannot perform a snapshot restore of MS-office file from Previous Tab. Error: Write-Protected

This issue is addressed in OneFS version 7.2.1.6. The bug number is 187005. .

We recommend you upgrade to 7.2.1.6 avoid the bug in the future.

Workaround

Currently the only workaround we can find is to log into an Isilon node directly, as the root use, and copy over the files/folders from the .snapshot directory onto the folder of your choice. You will need to ensure the snapshot folder is visible, as set in your snapshot settings.

For example:

Folder to restore is: /ifs/.snapshot/DataDailyBackup_35_2017-07-25-_22-30/data/ FolderA

Isilon1-1# cd /ifs/.snapshot/DataDailyBackup_35_2017-07-25-_22-30/data

Isilon1-1# pwd

/ifs/.snapshot/DataDailyBackup_35_2017-07-25-_22-30/data

njsgehisl01-1 # cp -Rp FolderA /ifs/data

FolderA folder has been copied from .snapshot to the current location.

Related:

OneFS and Synchronous Writes

The last article on multi-threaded I/O generated several questions on synchronous writes in OneFS. So thought this would make a useful topic to kick off the New Year and explore in a bit more detail.

OneFS natively provides a caching mechanism for synchronous writes – or writes that require a stable write acknowledgement to be returned to a client. This functionality is known as the Endurant Cache, or EC.

The EC operates in conjunction with the OneFS write cache, or coalescer, to ingest, protect and aggregate small, synchronous NFS writes. The incoming write blocks are staged to NVRAM, ensuring the integrity of the write, even during the unlikely event of a node’s power loss. Furthermore, EC also creates multiple mirrored copies of the data, further guaranteeing protection from single node and, if desired, multiple node failures.

EC improves the latency associated with synchronous writes by reducing the time to acknowledgement back to the client. This process removes the Read-Modify-Write (R-M-W) operations from the acknowledgement latency path, while also leveraging the coalescer to optimize writes to disk. EC is also tightly coupled with OneFS’ multi-threaded I/O (Multi-writer) process, to support concurrent writes from multiple client writer threads to the same file. Plus, the design of EC ensures that the cached writes do not impact snapshot performance.

The endurant cache uses write logging to combine and protect small writes at random offsets into 8KB linear writes. To achieve this, the writes go to special mirrored files, or ‘Logstores’. The response to a stable write request can be sent once the data is committed to the logstore. Logstores can be written to by several threads from the same node, and are highly optimized to enable low-latency concurrent writes.

Note that if a write uses the EC, the coalescer must also be used. If the coalescer is disabled on a file, but EC is enabled, the coalescer will still be active with all data backed by the EC.

So what exactly does an endurant cache write sequence look like?

Say an NFS client wishes to write a file to an Isilon cluster over NFS with the O_SYNC flag set, requiring a confirmed or synchronous write acknowledgement. Here is the sequence of events that occur to facilitate a stable write.

1) A client, connected to node 3, begins the write process sending protocol level blocks.



ec_1.png



4KB is the optimal block size for the endurant cache.

2) The NFS client’s writes are temporarily stored in the write coalescer portion of node 3’s RAM. The Write Coalescer aggregates uncommitted blocks so that the OneFS can, ideally, write out full protection groups where possible, reducing latency over protocols that allow “unstable” writes. Writing to RAM has far less latency that writing directly to disk.

3) Once in the write coalescer, the endurant cache log-writer process writes mirrored copies of the data blocks in parallel to the EC Log Files.



ec_2.png

The protection level of the mirrored EC log files is the same as that of the data being written by the NFS client.

4) When the data copies are received into the EC Log Files, a stable write exists and a write acknowledgement (ACK) is returned to the NFS client confirming the stable write has occurred.



ec_3.png



The client assumes the write is completed and can close the write session.

5) The write coalescer then processes the file just like a non-EC write at this point. The write coalescer fills and is routinely flushed as required as an asynchronous write via to the block allocation manager (BAM) and the BAM safe write (BSW) path processes.

6) The file is split into 128K data stripe units (DSUs), parity protection (FEC) is calculated and FEC stripe units (FSUs) are created.



ec_4.png

7) The layout and write plan is then determined, and the stripe units are written to their corresponding nodes’ L2 Cache and NVRAM. The EC logfiles are cleared from NVRAM at this point. OneFS uses a Fast Invalid Path process to de-allocate the EC Log Files from NVRAM.



ec_5.png

8) Stripe Units are then flushed to physical disk.

9) Once written to physical disk, the data stripe Unit (DSU) and FEC stripe unit (FSU) copies created during the write are cleared from NVRAM but remain in L2 cache until flushed to make room for more recently accessed data.



ec_6.png

As far as protection goes, the number of logfile mirrors created by EC is always one more than the on-disk protection level of the file. For example:

File Protection Level

Number of EC Mirrored Copies

+1n

2

2x

3

+2n

3

+2d:1n

3

+3n

4

+3d:1n

4

+4n

5

The EC mirrors are only used if the initiator node is lost. In the unlikely event that this occurs, the participant nodes replay their EC journals and complete the writes.

If the write is an EC candidate, the data remains in the coalescer, an EC write is constructed, and the appropriate coalescer region is marked as EC. The EC write is a write into a logstore (hidden mirrored file) and the data is placed into the journal.

Assuming the journal is sufficiently empty, the write is held there (cached) and only flushed to disk when the journal is full, thereby saving additional disk activity.

An optimal workload for EC involves small-block synchronous, sequential writes – something like an audit or redo log, for example. In that case, the coalescer will accumulate a full protection group’s worth of data and be able to perform an efficient FEC write.

The happy medium is a synchronous small block type load, particularly where the I/O rate is low and the client is latency-sensitive. In this case, the latency will be reduced and, if the I/O rate is low enough, it won’t create serious pressure.

The undesirable scenario is when the cluster is already spindle-bound and the workload is such that it generates a lot of journal pressure. In this case, EC is just going to aggravate things.

So how exactly do you configure the endurant cache?

Although on by default, setting the efs.bam.ec.mode sysctl to value ‘1’ will enable the Endurant Cache:

# isi_for_array –s isi_sysctl_cluster efs.bam.ec.mode=1

EC can also be enabled & disabled per directory:

# isi set -c [on|off|endurant_all|coal_only] <directory_name>

To enable the coalescer but switch of EC, run:

# isi set -c coal_only

And to disable the endurant cache completely:

# isi_for_array –s isi_sysctl_cluster efs.bam.ec.mode=0

A return value of zero on each node from the following command will verify that EC is disabled across the cluster:

# isi_for_array –s sysctl efs.bam.ec.stats.write_blocks efs.bam.ec.stats.write_blocks: 0

If the output to this command is incrementing, EC is delivering stable writes.

As mentioned previously, EC applies to stable writes. Namely:

  • Writes with O_SYNC and/or O_DIRECT flags set
  • Files on synchronous NFS mounts

When it comes to analyzing any performance issues involving EC workloads, consider the following:

  • What changed with the workload?
  • If upgrading OneFS, did the prior version also have EC enable?
  • If the workload has moved to new cluster hardware:
  • Does the performance issue occur during periods of high CPU utilization?
  • Which part of the workload is creating a deluge of stable writes?
  • Was there a large change in spindle or node count?
  • Has the OneFS protection level changed?
  • Is the SSD strategy the same?

Disabling EC is typically done cluster-wide and this can adversely impact certain workflow elements. If the EC load is localized to a subset of the files being written, an alternative way to reduce the EC heat might be to disable the coalescer buffers for some particular target directories, which would be a more targeted adjustment. This can be configured via the isi set –c off command.

One of the more likely causes of performance degradation is from applications aggressively flushing over-writes and, as a result, generating a flurry of ‘commit’ operations. This can generate heavy read/modify/write (r-m-w) cycles, inflating the average disk queue depth, and resulting in significantly slower random reads. The isi statistics protocol CLI command output will indicate whether the ‘commit’ rate is high.

It’s worth noting that synchronous writes do not require using the NFS ‘sync’ mount option. Any programmer who is concerned with write persistence can simply specify an O_FSYNC or O_DIRECT flag on the open() operation to force synchronous write semantics for that fie handle. With Linux, writes using O_DIRECT will be separately accounted-for in the Linux ‘mountstats’ output. Although it’s almost exclusively associated with NFS, the EC code is actually protocol-agnostic. If writes are synchronous (write-through) and are either misaligned or smaller than 8KB, they have the potential to trigger EC, regardless of the protocol.

The endurant cache can provide a significant latency benefit for small (eg. 4K), random synchronous writes – albeit at a cost of some additional work for the system.

However, it’s worth bearing the following caveats in mind:

  • EC is not intended for more general purpose I/O.
  • There is a finite amount of EC available. As load increases, EC can potentially ‘fall behind’ and end up being a bottleneck.
  • Endurant Cache does not improve read performance, since it’s strictly part of the write process.
  • EC will not increase performance of asynchronous writes – only synchronous writes.

Related:

Re: Questions about Ision

Hi Bros,

i am learning about isilon storage ,but there a bit of questions are confusion me Anyone can help verify below questions?

1. which cache are blocks copied to disk, L2 or Endurant cache?

2.sector size of an Isilon 8 TB drive? 4096 or 512 bytes?

3. In dell EMC practice test ,there is a question as below

You are designing an Isilon 5-node cluster configuration for a customer who wants to store user home directories on the new cluster. Each user has 100 small files of 4KB each, 100 medium sized files of 80KB each, and 100 larger files of 512KB each.

How much space is occupied by the medium sized files data blocks on an N+1n cluster? ,what about small and large sized files, how to calculate?

4. What is the best case performance for Backup Accelerator when backing up to LTO-4 or LTO-5 drives? 2.6TB/hr?

5. On an 18-node Isilon cluster with the default protection setting, what is the actual layout of a 128 KB file? N+2n?

6. An Isilon has 15 X410 nodes and 4 NL410 nodes. How many job engine directors does the cluster have? 19?





Thanks all In advance.



Related: