OneFS Shadow Stores

The recent series of articles on SmartDedupe have generated several questions from the field around shadow stores. So this seemed like an ideal topic to explore in a bit more depth over the course of the next couple of articles.

A shadow store is a class of system file that contains blocks which can be referenced by different file – thereby providing a mechanism that allows multiple files to share common data. Shadow stores were first introduced in OneFS 7.0, initially supporting Isilon file clones, and indeed there are many overlaps between cloning and deduplicating files. As we will see, a variant of the shadow store is also used as a container for file packing in OneFS SFSE (Small File Storage Efficiency), often used in archive workflows such as healthcare’s PACS.

Architecturally, each shadow store can contain up to 256 blocks, with each block able to be referenced by 32,000 files. If this reference limit is exceeded, a new shadow store is created. Additionally, shadow stores do not reference other shadow stores. All blocks within a shadow store must be either sparse or point at an actual data block. And snapshots of shadow stores are not allowed, since shadow stores have no hard links.

Shadow stores contain the physical addresses and protection for data blocks, just like normal file data. However, a fundamental difference between a shadow stores and a regular file is that the former doesn’t contain all the metadata typically associated with traditional file inodes. In particular, time-based attributes (creation time, modification time, etc) are explicitly not maintained.

Consider the shadow store information for a regular, undeduped file (file.orig):

# isi get -DDD file.orig | grep –i shadow

* Shadow refs: 0

zero=36 shadow=0 ditto=0 prealloc=0 block=28

A second copy of this file (file.dup) is then created and then deduplicated:

# isi get -DDD file.* | grep -i shadow

* Shadow refs: 28

zero=36 shadow=28 ditto=0 prealloc=0 block=0

* Shadow refs: 28

zero=36 shadow=28 ditto=0 prealloc=0 block=0

As we can see, the block count of the original file has now become zero and the shadow count for both the original file and its copy is incremented to ‘28′. Additionally, if another file copy is added and deduplicated, the same shadow store info and count is reported for all three files.It’s worth noting that even if the duplicate file(s) are removed, the original file will still retain the shadow store layout.

Each shadow store has a unique identifier called a shadow inode number (SIN). But, before we get into more detail, here’s a table of useful terms and their descriptions:

Element

Description

Inode

Data structure that keeps track of all data and metadata (attributes, metatree blocks, etc.) for files and directories in OneFS

LIN

Logical Inode Number uniquely identifies each regular file in the filesystem.

LBN

Logical Block Number identifies the block offset for each block in a file

IFM Tree or Metatree

Encapsulates the on-disk and in-memory format of the inode. File data blocks are indexed by LBN in the IFM B-tree, or file metatree. This B-tree stores protection group (PG) records keyed by the first LBN. To retrieve the record for a particular LBN, the first key before the requested LBN is read. The retried record may or may not contain actual data block pointers.

IDI

Isi Data Integrity checksum. IDI checkcodes help avoid data integrity issues which can occur when hardware provides the wrong data, for example. Hence IDI is focused on the path to and from the drive and checkcodes are implemented per OneFS block.

Protection Group (PG)

A protection group encompasses the data and redundancy associated with a particular region of file data. The file data space is broken up into sections of 16 x 8KB blocks called stripe units. These correspond to the N in N+M notation; there are N+M stripe units in a protection group.

Protection Group Record

Record containing block addresses for a data stripe .There are five types of PG records: sparse, ditto, classic, shadow, and mixed. The IFM B-tree uses the B-tree flag bits, the record size, and an inline field to identify the five types of records.

BSIN

Base Shadow Store, containing cloned or deduped data

CSIN

Container Shadow Store, containing packed data (container or files).

SIN

Shadow Inode Number is a LIN for a Shadow Store, containing blocks that are referenced by different files; refers to a Shadow Store

Shadow Extent

Shadow extents contain a Shadow Inode Number (SIN), an offset, and a count.

Shadow extents are not included in the FEC calculation since protection is provided by the shadow store.

Blocks in a shadow store are identified with a SIN and LBN (logical block number).

# isi get -DD /ifs/data/file.dup | fgrep –A 4 –i “protection group”

PROTECTION GROUPS

lbn 0: 4+2/2

4000:0001:0067:0009@0#64

0,0,0:8192#32

A SIN is essentially a LIN that is dedicated to a shadow store file, and SINs are allocated from a subset of the LIN range. Just as every standard file is uniquely identified by a LIN, every shadow store is uniquely identified by a SIN. It is easy to tell if you are dealing with a shadow store because the SIN will begin with 4000. For example, in the output above:

4000:0001:0067:0009

Correspondingly, in the protection group (PG) they are represented as:

  • SIN
  • Block size
  • LBN
  • Run

The referencing protection group will not contain valid IDI data (this is with the file itself). FEC parity, if required, will be computed assuming a zero block.

When a file references data in a shadow store, it contains meta-tree records that point to the shadow store. This meta-tree record contains a shadow reference, which comprises a SIN and LBN pair that uniquely identifies a block in a shadow store.

A set of extension blocks within the shadow store holds the reference count for each shadow store data block. The reference count for a block is adjusted each time a reference is created or deleted from any other file to that block. If a shadow store block’s reference count drop to zero, it is marked as deleted, and the ShadowStoreDelete job, which runs periodically, deallocates the block.

Be aware that shadow stores are not directly exposed in the filesystem namespace. However, shadow stores and relevant statistics can be viewed using the ‘isi dedupe stats’, ‘isi_sstore list’ and ‘isi_sstore stats’ command line utilities.

Cloning

In OneFS, files can easily be cloned using the ‘cp –c’ command line utility. Shadow store(s) are created during the file cloning process, where the ownership of the data blocks is transferred from the source to the shadow store.

shadow_store_1.png



In some instances, data may be copied directly from the source to the newly created shadow stores. Cloning uses logical references to shadow stores, and the source and the destination data blocks refer to an offset in a shadow store. The source file’s protection group(s) are moved to a shadow store, and the PG is now referenced by both the source file and destination clone file. After cloning a file, both the source and the destination data blocks refer to an offset in a shadow store.

Dedupe

As we have seen in the recent blog articles, shadow Stores are also used for SmartDedupe. The principle difference with dedupe, as compared to cloning, is the process by which duplicate blocks are detected.

shadow_store_2.png

The deduplication job also has to spend more effort to ensure that contiguous file blocks are generally stored in adjacent blocks in the shadow store. If not, both read and degraded read performance may be impacted.

Small File Storage Efficiency

A class of specialized shadow stores are also used as containers for storage efficiency, allowing packing of small file into larger structures that can be FEC protected.

shadow_store_3.png

These shadow stores differ from regular shadow stores in that they are deployed as single-reference stores. Additionally, container shadow stores are also optimized to isolate fragmentation, support tiering, and live in a separate subset of ID space from regular shadow stores.

SIN Cache

OneFS provides a SIN cache, which helps facilitate shadow store allocations. It provides a mechanism to create a shadow store on demand when required, and then cache that shadow store in memory on the local node so that it can be shared with subsequent allocators. The SIN cache segregates stores by disk pool, protection policy and whether or not the store is a container.

Related:

File Count Per Directory

Got asked the following question from the field recently:

“I have a customer with hundreds of thousands of files per directory in a small number of directories on their cluster. What’s least impactful command to count the number of files per directory?”



Unfortunately, there’s no command currently available that will provide that count instantaneously. Something will have to perform a treewalk to gather these statistics. That said, there are a couple of approaches to this, each with its pros and cons:

  • If the cluster has a SmartQuotas license, an advisory directory quota can be configured on the file count directories they want to check. As mentioned, the first job run will require walking the directory tree, but fast, low impact reports will be available after this first pass.

  • Another approach is using traditional UNIX commands, either from the OneFS CLI or, less desirably, from a UNIX client NFS session.



The two following commands will both take time to run:

# ls -f /path/to/directory | wc –l

# find /path/to/directory -type f | wc -l

It’s worth noting that when counting files with ls, it will probably yield faster results if the ‘-l’ flag is omitted and the ‘-f’ flag used instead. This is because ‘-l’ resolves UID & GIDs to display users/groups, which creates more work thereby slowing the listing. In contrast, ‘-f’ allows the ‘ls’ command to avoid sorting the output. This should be faster, and reduce memory consumption when listing extremely large numbers of files.

That said, there really is no quick way to walk a file system and count the files – especially since both ‘ls’ and ‘find’ are single threaded commands. Running either of these in the background with output redirected to a file is probably the best approach.

Depending on the arguments for either the ‘ls’ or ‘find’ command, you can gather a comprehensive set of context info and metadata on a single pass.

# find /path/to/scan -ls > output.file

It will take quite a while for the command to complete, but once you have the output stashed in a file you can pull all sorts of useful data from it.

Assuming a latency of around 20ms per file, it will take about 33 minutes to parse a directory containing 100,000 files. This estimate is conservative, but there are typically multiple protocol operations that need to be done to each file, and they do add up since ‘ls’ is not multi-threaded.

  • If possible, ensure the directories of interest are stored on a file pool that has at least one of the metadata mirrors on SSD (metadata-read).



  • Windows Explorer can also enumerate the files in a directory tree surprisingly quickly. All you get is a file count, but it can work pretty well.

  • If the directory you wish to know the file count for just happens to be /ifs, you can run the LinCount job, which will tell you how many LINs there are in the file system.

LinCount which (relatively) quickly scans the file system and returns the total count of LINs (logical inodes). The LIN count is equivalent to the total file and directory count on a cluster. The job runs by default on LOW priority, and is the fastest method of determining object count on OneFS – assuming no other job has run to completion.



To kick off the LinCount job, the following command can be run from the OneFS command line interface (CLI):



# isi job start lincount



The output from this will be along the lines of “Added job [52]”.



Note that the number in square brackets is the job ID.



To view results, run the following from the CLI:



# isi job reports view [job ID]



For example:



# isi job reports view 52

LinCount[52] phase 1 (2018-09-17T09:33:33)

——————————————

Elapsed time 1 seconds

Errors 0

Job mode LinCount

LINs traversed 1722

SINs traversed 0



The “LINs traversed” metric indicates that 1722 files and directories were found.



Be aware that the LinCount job output will also include snapshot revisions of LINs in its count.



Alternatively, if another treewalk job has run against the directory you wish to know the count for, you might be in luck.



Some other considerations regarding the scenario presented in the original question:



Hundreds of thousands is an extremely large number of files to store in one directory. To reduce the directory enumeration time, where possible divide the files up into multiple subdirectories.



When it comes to NFS, the behavior is going to partially depend on whether the client is doing READDIRPLUS operations vs READDIR. READDIRPLUS is useful if the client is going to need the metadata. However, ff all you’re trying to do is list the filenames, it actually makes that operation much slower.



If you only read the filenames in the directory, and you don’t attempt to stat any associated metadata, then this requires a relatively small amount of I/O to pull the names from the meta-tree, and should be fairly fast.



If this has already been done recently, some or all of the blocks are likely to already be in L2 cache. As such, a subsequent operation won’t need to read from hard disk and will be substantially faster.



NFS is more complicated regarding what it will and won’t cache on the client side, particularly with the attribute cache and the timeouts that are associated with it.



Here are the options from fastest to slowest:

  • If NFS is using READDIR, as opposed to READDIRPLUS, and the ‘ls’ command is invoked with the appropriate arguments to prevent it polling metadata or sorting the output, execution will be relatively swift.

  • If ‘ls’ polls the metadata (or if NFS uses READDIRPLUS) but doesn’t sort the results, output will be fairly immediately, but will take longer to complete overall.

  • If ‘ls’ sorts the output, nothing will be displayed until the command has read everything and sorted it, then the output will be returned in a deluge at the end.

Related:

OneFS IntegrityScan

Under normal conditions, OneFS typically relies on checksums, identity field, and magic numbers to verify file system health and correctness. Within OneFS, system and data integrity, can be subdivided into four distinct phases:

integrityscan_2.png

Here’s what each of these phases entails:

Phase

Description

Detection

The act of scanning the file system and detecting data block instances that are not what OneFS expects to see at that logical point. Internally, OneFS stores a checksum or IDI (Isi data integrity) for every allocated block under /ifs.

Enumeration

Enumeration involves notifying the cluster administrator of any file system damage uncovered in the detection phase. For example, logging to the /var/log/idi.log file.

Isolation

Isolation is the act of cauterizing the file system, ensuring that any damage identified during the detection phase does not spread beyond the file(s) that are already affected. This typically involves removing all references to the file(s) from the file system.

Repair

Repairing any damage discovered and removing the damaged file(s) from OneFS. Typically a DSR (Dynamic Sector recovery) is all that is required to rebuild a block that fails IDI.

Focused on the detection phase, the primary OneFS tool for uncovering system integrity issues is IntegrityScan. This job is run on across the cluster to discover instances of damaged files, and provide an estimate of the spread of the damage.

Unlike traditional ‘fsck’ style file system integrity checking tools (including OneFS’ isi_cpr utility), IntegrityScan is explicitly designed to run while the cluster is fully operational – thereby removing the need for any downtime. It does this by systematically reading every block and verifying its associated checksum. In the event that IntegrityScan detects a checksum mismatch, it generates an alert, logs the error to the IDI logs (/var/log/idi.log), and provides a full report upon job completion.

IntegrityScan is typically run manually if the integrity of the file system is ever in doubt. By default, the job runs at am impact level of ‘Medium’ and a priority of ‘1’ and accesses the file system via a LIN scan. Although IntegrityScan itself may take several hours or days to complete, the file system is online and completely available during this time. Additionally, like all phases of the OneFS job engine, IntegrityScan can be re-prioritized, paused or stopped, depending on its impact to cluster operations. Along with Collect and MultiScan, IntegrityScan is part of the job engine’s Marking exclusion set.

integrityscan_1.png

OneFS can only accommodate a single marking job at any point in time. However, since the file system is fully journalled, IntegrityScan is only needed in exceptional situations. There are two principle use cases for IntegrityScan:

  • Identifying and repairing corruption on a production cluster. Certain forms of corruption may be suggestive of a bug, in which case IntegrityScan can be used to determine the scope of the corruption and the likelihood of spreading. It can also fix some forms of corruption.

  • Repairing a file system after a lost journal. This use case is much like traditional fsck. This scenario should be treated with care as it is not guaranteed that IntegrityScan fixes everything. This is a use case that will require additional product changes to make feasible.

IntegrityScan can be initiated manually, on demand. The following CLI syntax will kick off a manual job run:

# isi job start integrityscan

Started job [283]

# isi job list

ID Type State Impact Pri Phase Running Time

————————————————————

283 IntegrityScan Running Medium 1 1/2 1s

————————————————————

Total: 1

With LIN scan jobs, even though the metadata is of variable size, the job engine can fairly accurately predict how much effort will be required to scan all LINs. The IntegrityScan job’s progress can be tracked via a CLI command, as follows:

# isi job jobs view 283

ID: 283

Type: IntegrityScan

State: Running

Impact: Medium

Policy: MEDIUM

Pri: 1

Phase: 1/2

Start Time: 2018-09-05T22:20:58

Running Time: 31s

Participants: 1, 2, 3

Progress: Processed 947 LINs and approx. 7464 MB: 867 files, 80 directories; 0 errors

LIN & SIN Estimate based on LIN & SIN count of 3410 done on Sep 5 22:00:10 2018 (LIN) and Sep 5 22:00:10 2018 (SIN)

LIN & SIN Based Estimate: 1m 12s Remaining (27% Complete)

Block Based Estimate: 10m 47s Remaining (4% Complete)

Waiting on job ID: –

Description:

The LIN (logical inode) statistics above include both files and directories.

Be aware that the estimated LIN percentage can occasionally be misleading/anomalous. If concerned, verify that the stated total LIN count is roughly in line with the file count for the cluster’s dataset. Even if the LIN count is in doubt. The estimated block progress metric should always be accurate and meaningful. If the job is in its early stages and no estimation can be given (yet), isi job will instead report its progress as ‘Started’. Note that all progress is reported per phase.



A job’s resource usage can be traced from the CLI as such:



# isi job statistics view

Job ID: 283

Phase: 1

CPU Avg.: 30.27%

Memory Avg.

Virtual: 302.27M

Physical: 24.04M

I/O

Ops: 2223069

Bytes: 16.959G

Finally, upon completion, the IntegrityScan job report, detailing both job stages, can be viewed by using the following CLI command with the job ID as the argument:

# isi job reports view 283

IntegrityScan[283] phase 1 (2018-09-05T22:34:56)

————————————————

Elapsed time 838 seconds (13m58s)

Working time 838 seconds (13m58s)

Errors 0

LINs traversed 3417

LINs processed 3417

SINs traversed 0

SINs processed 0

Files seen 3000

Directories seen 415

Total bytes 178641757184 bytes (166.373G)

IntegrityScan[283] phase 2 (2018-09-05T22:34:56)

————————————————

Elapsed time 0 seconds

Working time 0 seconds

Errors 0

LINs traversed 0

LINs processed 0

SINs traversed 0

SINs processed 0

Files seen 0

Directories seen 0

Total bytes 0 bytes

In addition to the IntegrityScan job, OneFS also contains an ‘isi_iscan_report’ utility. This is a tool to collate the errors from IDI log files (/var/log/idi.log) generated on different nodes. It generates a report file which can be used as an input to ‘isi_iscan_query’ tool. Additionally, it reports the number of errors seen for each file containing IDI errors. At the end of the run, a report file can be found at /ifs/.ifsvar/idi/tjob.<pid>/log.repo.

The associated ‘isi_iscan_query’ utility can then be used to parse the log.repo report file and filter by node, time range, or block address (baddr). The syntax for the isi_iscan_query tool is:

/usr/sbin/isi_iscan_query filename [FILTER FIELD] [VALUE]

FILTER FIELD:

node <logical node number> e.g. 1, 2, 3, …

timerange <start time> <end time> e.g. 2018-09-05T17:38:02Z 2018-09-06T17:38:56Z

baddr <block address> e.g. 2,1,185114624:8192

Related:

OneFS FlexProtect

As we’ve seen previously, OneFS utilizes file system scans to perform such tasks as detecting and repairing drive errors, reclaiming freed blocks, etc. These scans are typically complex sequences of operations which may take many hours to run, so they are implemented via syscalls and coordinated by the Job Engine. These jobs are generally intended to run as minimally disruptive background tasks in the cluster, using spare or reserved capacity.

The file system maintenance jobs which are critical to the function of OneFS include:

FS Maintenance Job

Description

AutoBalance

Restores node and drive free space balance

Collect

Reclaims leaked blocks

FlexProtect

Replaces the traditional RAID rebuild process

MediaScan

Scrub disks for media-level errors

MultiScan

Run AutoBalance and Collect jobs concurrently



The FlexProtect job is responsible for maintaining the appropriate protection level of data across the cluster. For example, it ensures that a file which is configured to be protected at +2n, is actually protected at that level. Given this, FlexProtect is arguably the most critical of the OneFS maintenance jobs because it represents the Mean-Time-To-Repair (MTTR) of the cluster. Any failures or delay has a direct impact on the reliability of the OneFS.



As such, the primary purpose of FlexProtect is to repair nodes and drives which need to be removed from the cluster. In the case of a cluster group change, for example the addition or subtraction of a node or drive, OneFS automatically informs the job engine, which responds by starting a FlexProtect job. Any drives and/or nodes to be removed are marked with OneFS’ ‘restripe_from’ capability. The job engine coordinator notices that the group change includes a newly-smart-failed device and then initiates a FlexProtect job in response.

FlexProtect falls within the job engine’s restriping exclusion set and, similar to AutoBalance, comes in two flavors: FlexProtect and FlexProtectLin.



flexprotect_1.png



FlexProtectLin is run by default when there is a copy of file system metadata available on solid state drive (SSD) storage. FlexProtectLin typically offers significant runtime improvements over its conventional disk based counterpart.

Run automatically after a drive or node removal or failure, FlexProtect locates any unprotected files on the cluster, and repairs them as rapidly as possible. The FlexProtect job runs by default with an impact level of ‘medium’ and a priority level of ‘1’, and includes six distinct job phases:

The regular version of FlexProtect has the following phases:

Job Phase

Description

Drive Scan

Job engine scans the disks for inodes needing repair. If an inode needs repair, the job engine sets the LIN’s ‘needs repair’ flag for use in the next phase.

LIN Verify

This phase scans the OneFS LIN tree to addresses the drive scan limitations.

LIN Re-verify

The prior repair phases can miss protection group and metatree transfers. FlexProtect may have already repaired the destination of a transfer, but not the source. If a LIN is being restriped when a metatree transfer, it is added to a persistent queue, and this phase processes that queue.

Repair

LINs with the ‘needs repair’ flag set are passed to the restriper for repair. This phase needs to progress quickly and the job engine workers perform parallel execution across the cluster.

Check

This phase ensures that all LINs were repaired by the previous phases as expected.

Device Removal

The successfully repaired nodes and drives that were marked ‘restripe from’ at the beginning of phase 1 are removed from the cluster in this phase. Any additional nodes and drives which were subsequently failed remain in the cluster, with the expectation that a new FlexProtect job will handle them shortly.

The FlexProtect job executes in userspace and generally repairs any components marked with the ‘restripe from’ bit as rapidly as possible. Within OneFS, a LIN Tree reference is placed inside the inode, a logical block. A B-Tree describes the mapping between a logical offset and the physical data blocks:

flexprotect_2.png

In order for FlexProtect to avoid the overhead of having to traverse the whole way from the LIN Tree reference -> LIN Tree -> B-Tree -> Logical Offset -> Data block, it leverages the OneFS construct known as the ‘Width Device List’ (WDL). The WDL enables FlexProtect to perform fast drive scanning of inodes because the inode contents are sufficient to determine need for restripe. The WDL keeps a list of the drives in use by a particular file, and is stored as an attribute within an inode, and thus protected by mirroring.There are two WDL attributes in OneFS, one for data and one for metadata. The WDL is primarily used by FlexProtect to determine whether an inode references a degraded node or drive. It New or replaced drives are automatically added to the WDL as part of new allocations.

As mentioned previously, the FlexProtect job has two distinct variants. In the FlexProtectLin version of the job the Disk Scan and LIN Verify phases are redundant and therefore removed, while keeping the other phases identical. FlexProtectLin is preferred when at least one metadata mirror is stored on SSD, providing substantial job performance benefits.

In addition to automatic job execution after a drive or node removal or failure, FlexProtect can also be initiated on demand. The following CLI syntax will kick of a manual job run:

# isi job start flexprotect

Started job [274]

# isi job list

ID Type State Impact Pri Phase Running Time

———————————————————-

274 FlexProtect Running Medium 1 1/6 4s

———————————————————-

Total: 1

The FlexProtect job’s progress can be tracked via a CLI command as follows:

# isi job jobs view 274

ID: 274

Type: FlexProtect

State: Succeeded

Impact: Medium

Policy: MEDIUM

Pri: 1

Phase: 6/6

Start Time: 2018-09-04T17:13:38

Running Time: 17s

Participants: 1, 2, 3

Progress: No work needed

Waiting on job ID: –

Description: {“nodes”: “{}”, “drives”: “{}”}

Upon completion, the FlexProtect job report, detailing all six stages, can be viewed by using the following CLI command with the job ID as the argument:

# isi job reports view 274

FlexProtect[274] phase 1 (2018-09-04T17:13:44)

———————————————-

Elapsed time 6 seconds

Working time 6 seconds

Errors 0

Drives 33

LINs 250

Size 363108486755 bytes (338.171G)

ECCs 0

FlexProtect[274] phase 2 (2018-09-04T17:13:55)

———————————————-

Elapsed time 11 seconds

Working time 11 seconds

Errors 0

LINs 33

Zombies 0

FlexProtect[274] phase 3 (2018-09-04T17:13:55)

———————————————-

Elapsed time 0 seconds

Working time 0 seconds

Errors 0

LINs 0

Zombies 0

FlexProtect[274] phase 4 (2018-09-04T17:13:55)

———————————————-

Elapsed time 0 seconds

Working time 0 seconds

Errors 0

LINs 0

Zombies 0

FlexProtect[274] phase 5 (2018-09-04T17:13:55)

———————————————-

Elapsed time 0 seconds

Working time 0 seconds

Errors 0

Drives 0

LINs 0

Size 0 bytes

ECCs 0

FlexProtect[274] phase 6 (2018-09-04T17:13:55)

———————————————-

Elapsed time 0 seconds

Working time 0 seconds

Errors 0

Nodes marked gone {}

Drives marked gone {}

While a FlexProtect job is running, the following command will detail which LINs the job engine workers are currently accessing:

# sysctl efs.bam.busy_vnodes | grep isi_job_d

vnode 0xfffff802938d18c0 (lin 0) is fd 11 of pid 2850: isi_job_d

vnode 0xfffff80294817460 (lin 1:0002:0008) is fd 12 of pid 2850: isi_job_d

vnode 0xfffff80294af3000 (lin 1:0002:001a) is fd 20 of pid 2850: isi_job_d

vnode 0xfffff8029c7c7af0 (lin 1:0002:001b) is fd 17 of pid 2850: isi_job_d

vnode 0xfffff802b280dd20 (lin 1:0002:000a) is fd 14 of pid 2850: isi_job_d

Using the ‘isi get -L’ command, a LIN address can be translated to show the actual file name and its path. For example:

# isi get -L 1:0002:0008

A valid path for LIN 0x100020008 is /ifs/.ifsvar/run/isi_job_d.lock

Related:

OneFS MediaScan

As we’ve seen previously, OneFS utilizes file system scans to perform such tasks as detecting and repairing drive errors, reclaiming freed blocks, etc. These scans are typically complex sequences of operations which may take many hours to run, so they are implemented via syscalls and coordinated by the Job Engine. These jobs are generally intended to run as minimally disruptive background tasks in the cluster, using spare or reserved capacity.

The file system maintenance jobs which are critical to the function of OneFS are:

FS Maintenance Job

Description

AutoBalance

Restores node and drive free space balance

Collect

Reclaims leaked blocks

FlexProtect

Replaces the traditional RAID rebuild process

MediaScan

Scrub disks for media-level errors

MultiScan

Run AutoBalance and Collect jobs concurrently

MediaScan’s role within the file system protection framework is to periodically check for and resolve drive bit errors across the cluster. This proactive data integrity approach helps guard against a phenomenon known as ‘bit rot’, and the resulting specter of hardware induced silent data corruption.

The MediaScan job reads all of OneFS’ allocated blocks in order to trigger any latent drive sector errors in a process known as ‘disk scrubbing’. Drive sectors errors may occur due physical effects which, over time, could negatively affect the protection of the file system. Periodic disk scrubbing helps ensure that sector errors do not accumulate and lead to data integrity issues.

Sector errors are a relatively common drive fault. They are sometimes referred to as ‘ECCs’ since drives have internal error correcting codes associated with sectors. A failure of these codes to correct the contents of the sector generates an error on a read of the sector.

ECCs have a wide variety of causes. There may be a permanent problem such as physical damage to platter, or a more transient problem such as the head not being located properly when the sector was read. For transient problems, the drive has the ability to retry automatically. However, such retries can be time consuming and prevent further processing.

OneFS typically has the redundancy available to overwrite the bad sector with the proper contents. This is called Dynamic Sector Repair (DSR). It is preferable for the file system to perform DSR than to wait for the drive to retry and possibly disrupt other operations. When supported by the particular drive model, a retry time threshold is also set so that disruption is minimized and the file system can attempt to use its redundancy.

In addition, MediaScan maintains a list of sectors to avoid after an error has been detected. Sectors are added to the list upon the first error. Subsequent I/Os consult this list and, if a match is found, immediately return an error without actually sending the request to the drive, minimizing further issues.

If the file system can successfully write over a sector, it is removed from the list. The assumption is that the drive will reallocate the sector on write. If the file system can’t reconstruct the block, it may be necessary to retry the I/O since there is no other way to access the data. The kernel’s ECC list must be cleared. This is done at the end of the MediaScan job run, but occasionally must also be done manually to access a particular block.

The drive’s own error-correction mechanism can handle some bit rot. When it fails, the error is reported to the MediaScan job. In order for the file system to repair the sector, the owner must be located. The owning structure in the file system has the redundancy that can be used to write over the bad sector, for example an alternate mirror of a block.

Most of the logic in MediaScan handles searching for the owner of the bad sector; the process can be very different depending on the type of structure, but is usually quite expensive. As such, it is often referred to as the ‘haystack’ search, since nearly every inode may be inspected to find the owner. MediaScan works by directly accessing the underlying cylinder groups and disk blocks via a linear drive scan and has more job phases than most job engine jobs for two main reasons:

  • First, significant effort is made to avoid the expense of the haystack search.
  • Second, every effort is made to try all means possible before alerting the administrator.

Here are the eight phases of MediaScan:

Phase #

Phase Name

Description

1

Drive Scan

Scans each drive using the ifs_find_ecc() system call, which issues I/O for all allocated blocks and inodes.

2

Random Drive Scan

Find additional “marginal” ECCs that would not have been detected by the previous phase.

3

Inode Scan

Inode ECCs can be located more quickly from the LIN tree, so this phase scans the LIN tree to determine the (LIN, snapshot ID) referencing any inode ECCs.

4

Inode Repair

Repairs inode ECCs with known (LIN, snapshot ID) owners, plus any LIN tree block ECCs where the owner is the LIN tree itself.

5

Inode Verify

Verifies that any ECCs not fixed in the previous phase still exist. First, it checks whether the block has been freed. Then it clears the ECC list and retries the I/O to verify that the sector is still failing.

6

Block Repair

Drives are scanned and compared against the list of ECCs. When ECCs are found, the (LIN, snapshot ID) is returned and the restripe repairs ECCs in those files. This phase is often referred to as the “haystack search”.

7

Block Verify

Once all file system repair attempts have completed, ECCs are again verified by clearing the ECC list and reissuing I/O.

8

Alert

Any remaining ECCs after repair and verify represent a danger of data loss. This phase logs the errors at the syslog ERR level.

MediaScan falls within the job engine’s restriping exclusion set, and is run as a low-impact, low-priority background process. It is executed automatically by default at 12am on the first Saturday of each month, although this can be reconfigured if desired.

In addition to scheduled job execution, MediaScan can also be initiated on demand. The following CLI syntax will kick off a manual job run:



# isi job jobs start mediascan

Started job [251]

# isi job jobs list

ID Type State Impact Pri Phase Running Time

——————————————————–

251 MediaScan Running Low 8 1/8 1s

——————————————————–

Total: 1

The MediaScan job’s progress can be tracked via a CLI command as follows:

# isi job jobs view 251

ID: 251

Type: MediaScan

State: Running

Impact: Low

Policy: LOW

Pri: 8

Phase: 1/8

Start Time: 2018-08-30T22:16:23

Running Time: 1m 30s

Participants: 1, 2, 3

Progress: Found 0 ECCs on 2 drives; last completed: 2:0; 0 errors

Waiting on job ID: –

Description:

A job’s resource usage can be traced from the CLI as such:

# isi job statistics view

Job ID: 251

Phase: 1

CPU Avg.: 0.21%

Memory Avg.

Virtual: 318.41M

Physical: 28.92M

I/O

Ops: 391

Bytes: 3.05M

Finally, upon completion, the MediaScan job report, detailing all eight stages, can be viewed by using the following CLI command with the job ID as the argument:

# isi job reports view 251

Related:

Re: Best way to consolidate two Isilon shares into one

Second @dynamox’s suggestion of just ‘mv’. But although the permissions will be retained on the 6 files themselves, keep in mind that the permissions on the directory /ifs/data/share1 and /ifs/data/share2 may be different, and might need to be changed for the users to get access. The greatest part about the mv is that you’re just updating the LINs (Logical INodes), so it’ll be almost instantaneous.

Also BTW check the SMB Share ACLs between the two, to ensure that you have merged together the groups from the 2 different shares into the target share.

You could even cheat for a while if you want, and just change the path that ‘\isilonsczoneshare2’ points at; to point it to the directory for share1, that lessens user impact quite a bit.

But in reality here we’re talking about 6 files so I don’t imagine there will be a ton of impact.

~Chris

Related:

  • No Related Posts

OneFS Multi-writer

The last blog article took a look at stable writes and the endurant cache.

https://community.emc.com/community/products/isilon/blog/2018/05/07/endurant-cache

In this this post we casually mentioned that “EC is also tightly coupled with OneFS’ multi-threaded I/O (Multi-writer) process, to support concurrent writes from multiple client writer threads to the same file.” Turns out, this generated a couple of questions from a couple of astute blog readers…

So what is Multi-writer?

Basically, it’s a mechanism allowing OneFS to provide more granular write locking by sub-diving the file into separate regions and granting exclusive data write locks to these individual ranges, as opposed to the entire file. This allows multiple clients, or write threads, attached to a node to simultaneously write to different regions of the same file.



multi-writer_1.png

Concurrent writes to a single file need more than just supporting data locks for ranges. Each writer also needs to update a file’s metadata attributes such as timestamps, block count, etc.

A mechanism for managing inode consistency is also needed, since OneFS is based on the concept of a single inode lock per file. In addition to the standard shared read and exclusive write locks, OneFS also provides the following locking primatives, via journal deltas, to allow multiple threads to simultaneously read or write a file’s metadata attributes:

  • Exclusive: A thread can read or modify any field in the inode. When the transaction is committed, the entire inode block is written to disk, along with any extended attribute blocks.
  • Shared: A thread can read, but not modify, any inode field.
  • DeltaWrite: A thread can modify any inode fields which support deltawrites. These operations are sent to the journal as a set of deltas when the transaction is committed.
  • DeltaRead: A thread can read any field which cannot be modified by inode deltas.

These locks allow separate threads to have a Shared lock on the same LIN, or for different threads to have a DeltaWrite lock on the same LIN. However, it is not possible for one thread to have a Shared lock and another to have a DeltaWrite. This is because the Shared thread cannot perform a coherent read of a field which is in the process of being modified by the DeltaWrite thread.



The DeltaRead lock is compatible with both the Shared and DeltaWrite lock. Typically the filesystem will attempt to take a DeltaRead lock for a read operation, and a DeltaWrite lock for a write, since this allows maximum concurrency, as all these locks are compatible.

Here’s what the write lock compatibilities looks like:

multi-writer_2.png

Data Reprotection

OneFS protects data by writing file blocks (restriping) across multiple drives on different nodes. The Job Engine defines a ‘restripe set’ comprising jobs which involve file system management, protection and on-disk layout. The restripe set contains the following jobs:



  • AutoBalance & AutoBalanceLin
  • FlexProtect & FlexProtectLin
  • MediaScan
  • MultiScan
  • SetProtectPlus
  • SmartPools
  • Upgrade

Multi-writer for restripe, introduced in OneFS 8.0, allows multiple restripe worker threads to operate on a single file concurrently. This in turn improves read/write performance during file re-protection operations, plus helps reduce the window of risk (MTTDL) during drive smartfails, etc. This is particularly true for workflows consisting of large files, while one of the above restripe jobs is running. Typically, the larger the files on the cluster, the more benefit multi-writer for restripe will offer.



With multi-writer for restripe, an exclusive lock is no longer required on the LIN during the actual restripe of data. Instead, OneFS tries to use a delta write lock to update the cursors used to track which parts of the file need restriping. This means that a client application or program should be able to continue to write to the file while the restripe operation is underway. An exclusive lock is only required for a very short period of time while a file is set up to be restriped. A file will have fixed widths for each restripe lock, and the number of range locks will depend on the quantity of threads and nodes which are actively restriping a single file.

Related:

Isi Get & Set

One of the lesser publicized but highly versatile tools in OneFS is the ‘isi get’ command line utility. It can often prove invaluable for generating a vast array of useful information about OneFS filesystem objects. In its most basic form, the command outputs this following information:

  • Protection policy
  • Protection level
  • Layout strategy
  • Write caching strategy
  • File name

For example:

# isi get /ifs/data/file2.txt

POLICY LEVEL PERFORMANCE COAL FILE

default 4+2/2 concurrency on file2.txt

Here’s what each of these categories represents:

POLICY: Indicates the requested protection for the object, in this case a text file. This policy field is displayed in one of three colors:

Requested Protection Policy

Description

Green

Fully protected

Yellow

Degraded protection under a mirroring policy

Red

Under-protection using FEC parity protection



LEVEL: Displays the current actual on-disk protection of the object. This can be either FEC parity protection or mirroring. For example:

Protection Level

Description

+1n

Tolerate failure of 1 drive OR 1 node (Not Recommended)

+2d:1n

Tolerate failure of 2 drives OR 1 node

+2n

Tolerate failure of 2 drives OR 2 nodes

+3d:1n

Tolerate failure of 3 drives OR 1 node

+3d:1n1d

Tolerate failure of 3 drives OR 1 node AND 1 drive

+3n

Tolerate failure of 3 drives or 3 nodes

+4d:1n

Tolerate failure of 4 drives or 1 node

+4d:2n

Tolerate failure of 4 drives or 2 nodes

+4n

Tolerate failure of 4 nodes

2x to 8x

Mirrored over 2 to 8 nodes, depending on configuration



PERFORMANCE: Indicates the on-disk layout strategy, for example:

Data Access Setting

Description

On Disk Layout

Caching

Concurrency

Optimizes for current load on cluster, featuring many simultaneous clients. Recommended for mixed workloads.

Stripes data across the minimum number of drives required to achieve the configured data protection level.

Moderate prefetching

Streaming

Optimizes for streaming of a single file. For example, fast reading by a single client.

Stripes data across a larger number of drives.

Aggressive prefetching

Random

Optimizes for unpredictable access to a file. Performs almost no cache prefetching.

Stripes data across the minimum number of drives required to achieve the configured data protection level.

Little to no prefetching



COAL: Indicates whether the Coalescer, OneFS’s NVRAM based write cache, is enabled. The coalescer provides failure-safe buffering to ensure that writes are efficient and read-modify-write operations avoided.

The isi get command also provides a number of additional options to generate more detailed information output. As such, the basic command syntax for isi get is as follows:

isi get {{[-a] [-d] [-g] [-s] [{-D | -DD | -DDC}] [-R] <path>}

| {[-g] [-s] [{-D | -DD | -DDC}] [-R] -L <lin>}}

Here’s the description for the various flags and options available for the command:

Command Option

Description

-a

Displays the hidden “.” and “..” entries of each directory.

-d

Displays the attributes of a directory instead of the contents.

-g

Displays detailed information, including snapshot governance lists.

-s

Displays the protection status using words instead of colors.

-D

Displays more detailed information.

-DD

Includes information about protection groups and security descriptor owners and groups.

-DDC

Includes cyclic redundancy check (CRC) information.

-L <LIN>

Displays information about the specified file or directory. Specify as a file or directory LIN.

-R

Displays information about the subdirectories and files of the specified directories.

The following command shows the detailed properties of a directory, /ifs/data (note that the output has been truncated slightly to aid readability):



# isi get -D data u

POLICY W LEVEL PERFORMANCE COAL ENCODING FILE IADDRS

default 4x/2 concurrency on v N/A ./ <1,36,268734976:512>, <1,37,67406848:512>, <2,37,269256704:512>, <3,37,336369152:512> ct: 1459203780 rt: 0

*************************************************

* IFS inode: [ 1,36,268734976:512, 1,37,67406848:512, 2,37,269256704:512, 3,37,336369152:512 ] w

*************************************************

* Inode Version: 6

* Dir Version: 2

* Inode Revision: 6

* Inode Mirror Count: 4

* Recovered Flag: 0

* Restripe State: 0

* Link Count: 3

* Size: 54

* Mode: 040777

* Flags: 0xe0

* Stubbed: False

* Physical Blocks: 0

* LIN: 1:0000:0004 x

* Logical Size: None

* Shadow refs: 0

* Do not dedupe: 0

* Last Modified: 1461091982.785802190

* Last Inode Change: 1461091982.785802190

* Create Time: 1459203780.720209076

* Rename Time: 0

* Write Caching: Enabled y

* Parent Lin 2

* Parent Hash: 763857

* Snapshot IDs: None

* Last Paint ID: 47

* Domain IDs: None

* LIN needs repair: False

* Manually Manage:

* Access False

* Protection True

* Protection Policy: default

* Target Protection: 4x

* Disk pools: policy any pool group ID -> data targetzx410_136tb_1.6tb-ssd_256gb:32(32), metadata target x410_136tb_1.6tb-ssd_256gb:32(32)

* SSD Strategy: metadata-write {

* SSD Status: complete

* Layout drive count: 0

* Access pattern: 0

* Data Width Device List:

* Meta Width Device List:

*

* File Data (78 bytes):

* Metatree Depth: 1

* Dynamic Attributes (40 bytes):

ATTRIBUTE OFFSET SIZE

New file attribute 0 23

Isilon flags v2 23 3

Disk pool policy ID 26 5

Last snapshot paint time 31 9

*************************************************

* NEW FILE ATTRIBUTES |

* Access attributes: active

* Write Cache: on

* Access Pattern: concurrency

* At_r: 0

* Protection attributes: active

* Protection Policy: default

  1. 1. * Disk pools: policy any pool group ID

* SSD Strategy: metadata-write

*

*************************************************

Here is what some of these lines indicate:



u OneFS command to display the file system properties of a directory or file.

v The directory’s data access pattern is set to concurrency

w Write caching (Coalescer) is turned on.

x Inode on-disk locations.

y Primary LIN.

z Indicates the disk pools that the data and metadata are targeted to.

{ The SSD strategy is set to metadata-write.

| Files that are added to the directory are governed by these settings, most of which can be changed by applying a file pool policy to the directory.

From the WebUI, a subset of the ‘isi get –D’ output is also available from the OneFS File Explorer. This can be accessed by browsing to File System > File System Explorer and clicking on ‘View Property Details’ for the file system object of interest.



A question that is frequently asked is how to find where a file’s inodes live on the cluster. The ‘isi get -D’ command output makes this fairly straightforward to answer. Take the file /ifs/data/file1, for example:

# isi get -D /ifs/data/file1 | grep -i “IFS inode”

* IFS inode: [ 1,9,8388971520:512, 2,9,2934243840:512, 3,8,9568206336:512 ]



This shows the three inode locations for the file in the *,*,*:512 notation. Let’s take the first of these:



1,9,8388971520:512



From this, we can deduce the following:



  • The inode is on node 1, drive 9 (logical drive number).
  • The logical inode number is 8388971520.
  • It’s an inode block that’s 512 bytes in size (Note: OneFS data blocks are 8kB in size).



Another example of where isi get can be useful is in mapping between a file system object’s pathname and its LIN (logical inode number). This might be for translating a LIN returned by an audit logfile or job engine report into a valid filename, or finding an open file from vnodes output, etc.



For example, say you wish to know which configuration file is being used by the cluster’s DNS service:



First, inspect the busy_vnodes output and filter for DNS:



# sysctl efs.bam.busy_vnodes | grep -i dns

vnode 0xfffff8031f28baa0 (lin 1:0066:0007) is fd 19 of pid 4812: isi_dnsiq_d

This, among other things, provides the LIN for the isi_dnsiq_d process. The output can be further refined to just the LIN address as such:



# sysctl efs.bam.busy_vnodes | grep -i dns | awk ‘{print $4}’ | sed -E ‘s/)//’

1:0066:0007



This LIN address can then be fed into ‘isi get’ using the ‘-L’ flag, and a valid name and path for the file will be output:



# isi get -L `sysctl efs.bam.busy_vnodes | grep -i dns | grep -v “(lin 0)” | awk ‘{print $4}’ | sed -E ‘s/)//’`

A valid path for LIN 0x100660007 is /ifs/.ifsvar/modules/flexnet/flx_config.xml



This confirms that the XML configuration file in use by isi_dnsiq_d is flx_config.xml.



So, to recap, the ‘isi get’ command provides information about an individual or set of file system objects.

OneFS also provides the complimentary ‘isi set’ command, which allows configuration of OneFS-specific file attributes. This command works similarly to the UNIX ‘chmod’ command, but on OneFS-centric attributes, such as protection, caching, encoding, etc. As with isi set, files can be specified by path or LIN. Here are some examples of the command in action.



For example, the following syntax will recursively configure a protection policy of +2d:1n on /ifs/data/testdir1 and its contents:



# isi set –R -p +2:1 /ifs/data/testdir1



To enable write caching coalescer on testdir1 and its contents, run:



# isi set –R -c on /ifs/data/testdir1



With the addition of the –n flag, no changes will actually be made. Instead, the list of files and directories that would have write enabled is returned:



# isi set –R –n -c on /ifs/data/testdir2



The following command will configure ISO-8859-1 filename encoding on testdir3 and contents:



# isi set –R –e ISO-8859-1 /ifs/data/testdir3



To configure streaming layout on the file ‘test1’, run:



# isi set -l streaming test1



The following syntax will set a metadata-write SSD strategy on testdir1 and its contents:



# isi set –R -s metadata-write /ifs/data/testdir1



To performs a file restripe operation on the file2:



# isi set –r file2



To configure write caching on file3 via its LIN address, rather than file name:



# isi set –c on –L `# isi get -DD file1 | grep -i LIN: | awk {‘print $3}’`

1:0054:00f6

The following table describes in more detail the various flags and options available for the isi set command:

Command Option

Description

-f

Suppresses warnings on failures to change a file.

-F

Includes the /ifs/.ifsvar directory content and any of its subdirectories. Without -F, the /ifs/.ifsvar directory content and any of its subdirectories are skipped. This setting allows the specification of potentially dangerous, unsupported protection policies.

-L

Specifies file arguments by LIN instead of path.

-n

Displays the list of files that would be changed without taking any action.

-v

Displays each file as it is reached.

-r

Runs a restripe.

-R

Sets protection recursively on files.

-p <policy>

Specifies protection policies in the following forms: +M Where M is the number of node failures that can be tolerated without loss of data.

+M must be a number from, where numbers 1 through 4 are valid.

+D:M Where D indicates the number of drive failures and M indicates number of node failures that can be tolerated without loss of data. D must be a number from 1 through 4 and M must be any value that divides into D evenly. For example, +2:2 and +4:2 are valid, but +1:2 and +3:2 are not.

Nx Where N is the number of independent mirrored copies of the data that will be stored. N must be a number, with 1 through 8 being valid choices.

-w <width>

Specifies the number of nodes across which a file is striped. Typically, w = N + M, but width can also mean the total of the number of nodes that are used. You can set a maximum width policy of 32, but the actual protection is still subject to the limitations on N and M.

-c {on | off}

Specifies whether write-caching (coalescing) is enabled.

-g <restripe goal>

Specifies the restripe goal. The following values are valid:

  • repair
  • reprotect
  • rebalance
  • retune

-e <encoding>

Specifies the encoding of the filename.

-d <@r drives>

Specifies the minimum number of drives that the file is spread across.

-a <value>

Specifies the file access pattern optimization setting. Ie. default, streaming, random, custom.

-l <value>

Specifies the file layout optimization setting. This is equivalent to setting both the -a and -d flags. Values are concurrency, streaming, or random

–diskpool <id | name>

Sets the preferred diskpool for a file.

-A {on | off}

Specifies whether file access and protections settings should be managed manually.

-P {on | off}

Specifies whether the file inherits values from the applicable file pool policy.

-s <value>

Sets the SSD strategy for a file. The following values are valid: If the value is metadata-write, all copies of the file’s metadata are laid out on SSD storage if possible, and user data still avoids SSDs. If the value is data, Both the file’s meta- data and user data (one copy if using mirrored protection, all blocks if FEC) are laid out on SSD storage if possible.

avoid Writes all associated file data and metadata to HDDs only. The data and metadata of the file are stored so that SSD storage is avoided, unless doing so would result in an out-of-space condition.

metadata Writes both file data and metadata to HDDs. One mirror of the metadata for the file is on SSD storage if possible, but the strategy for data is to avoid SSD storage.

metadata-write Writes file data to HDDs and metadata to SSDs, when available. All copies of metadata for the file are on SSD storage if possible, and the strategy for data is to avoid SSD storage.

data Uses SSD node pools for both data and metadata. Both the metadata for the file and user data, one copy if using mirrored protection and all blocks if FEC, are on SSD storage if possible.

<file> {<path> | <lin>} Specifies a file by path or LIN.

Related: