ISSUE: Policy state pending or failed for Android devices

3. If the XMS servers are load balanced by NetScaler, make sure both MDM Load balancers are enabled for SSL Stickiness persistent (i.e. Persistence is SSLSession).

User-added image

4.

Make sure the device has stable Data connection (Network connection).

5.

Verify if there are any Android Scheduling policies configured on XMS, if there is a scheduling policy configured then the policy updates will happen according to the schedule. Below is the sample screenshot of scheduling policy.

User-added image

Please refer to below link for more details on scheduling policy:



https://docs.citrix.com/en-us/xenmobile/10-3/xmob-device-policy-wrapper/xmob-device-policy-connection-scheduling.html

6. If you are using GCM (Google Cloud Messaging) or FCM (Firebase Cloud Messaging) instead of above scheduling policy, you will get the policy updates immediately on the devices. If the policies are not getting updated immediately even after enabling GCM/FCM please make sure firewall port 443 is open from XMS server to Android.apos.google.com and google.comPlease refer to below link for more details on FCM/GCM:

https://docs.citrix.com/en-us/xenmobile/10-4/provision-devices/google-cloud-messaging.html

7. If you have not configured either Android Scheduling policy or GCM/FCM, Android Secure Hub has the default internal heartbeat mechanism which makes the device check in every 6 hours for any policy updates.

8. If the XMS servers are deployed in clustered environment, try shutting down all the nodes except one node and verify if the issue still exists.Try to run this test on all other nodes as well and see if you are able to figure out the faulty node and replace or shut down this node.

9. If the issue still exists you can collect Android device logs by enabling USB debugging and view logs using the below command

-adb shell setprop log.tag.Secure Hub DEBUG

Please refer to the following link for more information:

https://www.citrix.com/blogs/2015/06/02/mobility-experts-how-to-collect-android-device-logs-for-troubleshooting-xenmobile-issues/

Related:

Isilon Gen6: Addressing Generation 6 Battery Backup Unit (BBU) Test Failures[2]

Article Number: 518165 Article Version: 7 Article Type: Break Fix



Isilon Gen6,Isilon H400,Isilon H500,Isilon H600,Isilon A100,Isilon A2000,Isilon F800

Gen6 nodes may report spurious Battery Backup Unit (BBU) failures similar to the following:

Battery Test Failure: Replace the battery backup unit in chassis <serial number> slot <number> as soon as possible.

Issues were identified with both the OneFS battery test code, and the battery charge controller (bcc) firmware, that can cause these spurious errors to be reported.

The underlying causes for most spurious battery test failures have been resolved in OneFS 8.1.0.4 and newer, and Node Firmware Package 10.1.6 and newer (DEbcc/EPbcc v 00.71); to resolve this issue, please upgrade to these software versions, in that order, as soon as possible. In order to perform these upgrades and resolve this issue, the following steps are required:

Step 1: check the BBU logs for a “Persistent fault” message. This indicates a test failure state that cannot be cleared in the field. Run the following command on the affected node:

# isi_hwmon -b |grep “Battery 1 Status”


If the battery reports a Persistent Fault condition, gather and upload logs using the isi_gather_info command, then contact EMC Isilon Technical Support and reference this KB.

Step 2: Clear the erroneous battery test result by running the following commands:

# isi services isi_hwmon disable

# mv /var/log/nvram.xml /var/log/nvram.xml.old

Step 3: Clear the battery test alert and unset the node read-only state so the upgrade can proceed:

– Check ‘isi event events list’ to get the event ID for the HW_INFINITY_BATTERY_BACKUP_FAULT event. Then run the following commands:

# isi event modify <eventid> –resolved true

# /usr/bin/isi_hwtools/isi_read_only –unset=system-status-not-good

Step 4: Upgrade OneFS to 8.1.0.4 or later

Instructions for upgrading OneFS can be found in the OneFS Release Notes on the support.emc.com web site.

Step 5: Update node firmware using Node Firmware Package 10.1.6 or later

Instructions for upgrading node firmware can be found in the Node Firmware Package Release Notes on the support.emc.com web site.

Once the system is upgraded, no further spurious battery replacement alerts should occur.

If an OneFS upgrade to 8.1.0.4 or newer is not an option at this time, or if the system generates further battery failure alerts after upgrading, please contact EMC Isilon Technical Support for assistance, and reference this KB.

Related:

  • No Related Posts

VxRail: PTAgent upgrade failure, ESXi error “Can not delete non-empty group: dellptagent”[3]

Article Number: 516314 Article Version: 6 Article Type: Break Fix



VxRail 460 and 470 Nodes,VxRail E Series Nodes,VxRail P Series Nodes,VxRail S Series Nodes,VxRail V Series Nodes,VxRail Software 4.0,VxRail Software 4.5

VxRail upgrade process fails when upgrading PTAgent from older version 1.4 (and below) to newer 1.6 (and above).

Error message

[LiveInstallationError]

Error in running [‘/etc/init.d/DellPTAgent’, ‘start’, ‘upgrade’]:

Return code: 1

Output: ERROR: ld.so: object ‘/lib/libMallocArenaFix.so’ from LD_PRELOAD cannot be preloaded: ignored.

ERROR: ld.so: object ‘/lib/libMallocArenaFix.so’ from LD_PRELOAD cannot be preloaded: ignored.

ERROR: ld.so: object ‘/lib/libMallocArenaFix.so’ from LD_PRELOAD cannot be preloaded: ignored.

Errors:

Can not delete non-empty group: dellptagent

It is not safe to continue. Please reboot the host immediately to discard the unfinished update.

Please refer to the log file for more details.

Dell ptAgent upgrade failed on target: <hostname> failed due to Bad script return code:1

PTAgent can’t be removed without ESXi asking for a reboot, due to earlier version of PTAgent (lower than 1.6) had problem dealing with process signals, ESXi is unable to stop it no matter what signal is sent or what method is attempted to kill the process. Rebooting ESXi si required to kill the defunct process so the upgrade can proceed.

PTAgent 1.6 (and above) had this issue fixed, but upgrading from 1.4 to 1.6 can’t be done without human intervene once the issue is encountered.


Impacted VxRail versions (Dell platform only):

  • 4.0.x: VxRail 4.0.310 and below
  • 4.5.x: VxRail 4.5.101 and below

This issue is fixed in recent VxRail releases, but upgrade from earlier VxRail releases are greatly impacted. It’s strongly suggested customer to contact Dell EMC Technical Support to upgrade to PTAgent 1.7-4 which is included in below VxRail releases:

  • VxRail 4.0.500 for customer who stays on vSphere 6.0
  • VxRail 4.5.211 or above for customers who choose vSphere 6.5

Manual workaround if experiencing the PTAgent upgrade failure

  • Enter maintenance mode and reboot the host mentioned in error message
  • Wait until the host is available and showing proper state in vCenter, click retry button in VxRail Manager to retry upgrade.

Related:

  • No Related Posts

ECS: One node will not power on in an ECS Gen1 or Gen2 system.

Article Number: 504631 Article Version: 3 Article Type: Break Fix



ECS Appliance,ECS Appliance Hardware,Elastic Cloud Storage

This KB article addresses when only one node will not power-on in an ECS Gen1 or Gen2 system.

One node will not power on in an ECS Gen1 or Gen2 system.

Bad blade server or bad chassis.

N/A

For ECS Gen1 and Gen2 systems, there are redundant Power Supply Units (PSUs) which supply power to a chassis and up to 4 blade servers in the chassis.

Based on this, if 1 node out of 4 will not power on, the issue can’t be the PSUs because the other nodes in the same chassis are powered on.

The issue has to be the blade server or the chassis itself.

Using an example where node 4 will not power on, one can swap the blade server from the node 3 position to the node 4 position and vice versa.

If the issue stays with the slot where node 4 resides, the issue is the chassis. If the issue follows the blade server, then the blade server is at issue.

Note: This sort of troubleshooting can only be done at install time before the OS and ECS software is loaded on the system.

Related:

  • No Related Posts

Nutanix AFS (Nutanix Files) might not function properly with the ELM

This information is very preliminary and has not been rigorously tested.

AFS appears to use DFS namespace redirection to point you to individual nodes in the AFS cluster where your data is actually held. The ELM does not support DFS redirection, so when the STATUS_PATH_NOT_COVERED comes back from the initial node we reached, we fail the attempt instead of moving to the requested server. If randomly you happen to connect to the node where your data is, there is no redirection and no error.

Unfortunately, there does not appear to be a workaround except to point the ELM to a specific node in the AFS cluster instead of the main cluster address. This node probably has to be the AFS “leader” node.

Related:

  • No Related Posts

Storage Node Network connectivity to Datadomain best practices

I am looking for some advise on the best practices on connecting networker storage nodes in a environment where clients are having backup IP’s in several different VLAN’s . So basically our storage nodes will contact NDMP clients over their backup networker in layer-2 on diff vlans and need send the backup data to data domain on separate vlan.

To depict this here is how we are currently backing up

NDMPClient1-Backup-vlan1———->Storage Node-Backup-Vlan1( Vlan5)———->DataDomain over Vlan5

NDMPClient2-Backup-vlan2———->Storage Node-Backup-Vlan2( Vlan5)———->DataDomain over Vlan5

NDMPClient3-Backup-vlan3 ———->Storage Node-Backup-Vlan3( Vlan5)———->DataDomain over Vlan5

NDMPClient4-Backup-vlan4 ———->Storage Node-Backup-Vlan4( Vlan5)———->DataDomain over Vlan5

So for every NDMP client backup vlan we defined and interface on storage nodes in the same Vlan.

And from Storage node to Datadomain connectivity we have a seperate backup vlan in layer-2

Since this is a 3 way NDMP backp , the traffic flows from clients to Storage nodes in one network and from storage nodes to Dataomdin in a different paths.

is this is a good model or do we have any other model that we can adopt to have better backup/restore performances.

Thanks in advance

Related:

  • No Related Posts

How Microsoft Service Witness Protocol Works in OneFS

The Service Witness Protocol (SWP) remote procedure call (RPC)-based protocol. In a highly available cluster environment, the Service Witness Protocol (SWP) is used to monitor the resource states like servers and NICs, and proactively notify registered clients once the monitored resource states changed.

This blog will talk about how SWP is implemented on OneFS.

In OneFS, SWP is used to notify SMB clients when a node is down/rebooted or NICs are unavailable. So the Witness server in OneFS need to monitor the states of nodes/NICs and the assignment of IP addresses to the interfaces of each pool. These information is provided by SmartConnect/FlexNet and OneFS Group Management Protocol (GMP).

The OneFS GMP is used to create and maintain a group of synchronized nodes. GMP distributes a variety of state information about nodes and drives, from identifiers to usage statistics. So that Witness service can get the states of nodes from the notification of GMP.

As for the information of IP addresses in each pool, SmartConnect/Flexnet provides the following information to support SWP protocol in OneFS:

  1. Locate Flexnet IP Pool given a pool member’s IP Address. Witness server can be aware of the IP pool it belongs to and get the other pool members’ info through a given IP address.
  2. Get SmartConnect Zone name and alias names through a Flexnet IP pool obtained in last step.
  3. Witness can subscribe to changes to the Flexnet IP Pool when the following changes occur:
    • Witness will be notified when an IP address is added to an active pool member or removed from a pool member.
    • Witness will be notified when a NIC goes from DOWN to UP or goes from UP to Down. So that the Witness will know whether an interface is available.
    • Witness will be notified when an IP address is moved from one interface to another.
    • Witness will be notified when an IP address will be removed from the pool or will be moved from one interface to another initiated by an admin or a re-balance process.

The figure below shows the process of Witness selection and after failover occurs.

Drawing1.jpg

  1. SMB CA supported client connect to a OneFS cluster SMB CA share through the SmartConnect FQDN in Node 1.
  2. The client find the CA is enabled, start the Witness register process by sending a GetInterfaceList request to Node 1.
  3. Node 1 returns a list of available Witness interface IP addresses to which the client can connect.
  4. The client select anyone interface IP address from the list (in this example is Node 2 which is selected as the Witness server). Then the client will send a RegisterEx request to Node 2, but this request will failed as OneFS does not this operation. RegisterEx is a new operation introduced in SWP version 2. OneFS only support SWP version 1.
  5. The client send a Register request to node 2 to register for resource state change notification of NetName and IPAddress (In this example, the NetName is the SmartConnect FQDN and IPAddress is the IP of Node 1)
  6. The Witness server (Node 2) process the request and returns a context handle that identifies the client on the server.
  7. The client sends an AsyncNotify request to Node 2 to receive asynchronous notification of the cluster nodes/nodes interfaces states changes.
  8. Assume Node 1 does down unexpectedly. Now, the Witness server Node 2 is aware of the Node 1 broken and sends an AsyncNotify response to notify the client about the server states is down.
  9. The SMB CA feature forces the client to reconnect to OneFS cluster using the SmartConnect FQDN. In this example, the SMB CA successfully failover to Node 3.
  10. The client sends a context handle in an UnRegister request to unregister for notifications from Witness server Node 2.
  11. The Winess server processes the requests by removing the entry and no longer notifies the client about the resource state changes.
  12. Step 12-17. The client starts the register process similar to step 2-7.

Related:

  • No Related Posts

Isilon: No response from isi_stats_d accompanied by performance issues

Article Number: 498278 Article Version: 3 Article Type: Break Fix



Isilon

Performance issue reported by client up to and including data unavailable. Some troubleshooting methods impede quick time to resolution by focus on data collection of symptoms rather than expose during live engagement a “common cause” characterized by all or most of following pattern:

Slow response and timeouts to admin WebUI and CLI commands, especially commands requiring node statistics with many errors per minute in messages, job, and celog (OneFSv7x) or tardis (OneFSv8x) found by running:

isi_for_array -s “tail /var/log/messages”

displays messages such as

“Failed to send worker count stats”

“No response from isi_stats_d after 5 secs”

“Error while getting response from isi_celog_coalescer” (OneFSv7x), or “Unable to get response from…”

Hangdumps, especially if multiple hourly, identified live in messages on any node displayed with

grep -A1 <today, as in 2017-05-11> “LOCK TIMEOUT AT” /var/log/messages

isi_hangdump: Initiating hangdump…

Intermittent high CPU load displayed as 1, 5, and 10 minute CPU load averages with:

isi_for_array -s uptime

High memory utilization on one or more services and on one or more nodes displayed by

isi_for_array -s “ps -auwx”

Service timeouts and Stack traces indicating memory exhaustion and service or “Swatchdog” timeouts displayed by

isi_for_array -s “grep <today as in 2017-05-11> /var/log/messages | grep -A3 -B1 Stack”

The above complex symptoms indicate node resource exhaustion. This can be caused by long wait times, locking, and unbalanced work flow exceeding one or more node’s capabilities, including one of several known causes:

– Uptime bugs

– IB hardware or switch failures

– BMC/CMC gen5 controller unresponsive

– SyncIQ policy when-source-modified configured on an active path

– SyncIQ job degrades performance after upgrade to 8.x with default SyncIQ Performance Rules and limited replication IP pool

– isi commands locking drive_purposing.lock

This KB recommends quickly identify or eliminate these known performance disruptors before proceeding with more detailed symptom troubleshooting.

Depending on workflow the timeouts, memory exhaustion and Stack traces may occur on one service more than another, such as lwio for SMB. Before implementing Troubleshooting guides (collect lwio cores, etc.) on a particular service, when pattern similar to above is present, run following commands and record outcome in case comment to indicate or eliminate a “common cause.”

uname -a

Interpret: Susceptible to 248 or 497 day uptime bugs: OneFS v7.1.0.0-7.1.0.6, 7.1.1.0-7.1.1.5, 7.2.0.0-7.2.0.3, and 7.2.1.0

Susceptible to drive_purposing.lock condition OneFS v7.1.1.0-7.1.1.9, 7.2.1.0-7.2.1.2 or below, and 8.0.0.0

isi_for_array -s uptime

Interpret: Uptime at or about 248 days or 497 days and uname -a indicated a susceptable version? Indicates Uptime bug.

isi status

Interpret: If running slowly, statistics time out or display n/a n/a n/a, isi_stats_d is not communicating on one or more nodes

If uname indicates susceptible to drive_purposing.lock, close multiple WebUI instances and run isi_for_array “killall isi_stats_d”

isi_for_array -s “tail /var/log/ethmixer.log”

Interpret: Many statechanges and ports register as down (change not reporting “is alive”) indicates IB harware cable, card, or switch.

Lack of IB errors reported in ethmixer log while intermittent stat and service failures suggest refocus on Sync or job engine.

/usr/bin/isi_hwtools/isi_ipmicmc -d -V -a bmc | grep firmware

Interpret: If nodes are gen5 (x or n 210, 410m HD400) and OneFS below v8.0.0.4 cluster is susceptible. No version output means controller is unresponsive.

A responding controller does not fully eliminate this as cause since stats_d and dependent services can fail while controller is unresponsive and then a controller restart without those services recovering.

isi sync policies list -v | grep Schedule

Interpret: If any “when-source-modified” then note policy <name> and disable: isi sync policies modify <name> –enabled false

isi sync rule list (for uname indicating 8.x only)

Interpret: if display blank, then no Performance Rules exist, SyncIQ will run with 8.x defaults

isi sync policies list -v | grep -A1 “Source Subnet”

Interpret: If blank there are no IP pool restrictions. If subnet and pool restrictions are listed, and default Sync Performance Rules, nodes participating in the limited replication pool may be overtaxed during SyncIQ jobs.

More troubleshooting and indicator details on each “common cause”:

Uptime bugs

Indicators: more likely to include one or more node offline, recently restarted or split from cluster, and full cluster DU vs. intermittent.

Check uptime and Code version for suceptability at or above 248 days or 497 days:

QuickTip: On any node CLI run

uname -a (OneFS versions Suseptable to 248 or 497 day bugs: 7.2.1.0, 7.2.0.0 – 7.2.0.3, 7.1.1.0 – 7.1.1.5, and 7.1.0.0 – 7.1.0.6)

isi_for_array -s uptime (nodes at or about 248 or 497 days?)

For more information on uptime bug troubleshooting, and identifying as cause of performance and nodes offline behavior:

ETA 209918: Isilon OneFS: Nodes run for more than 248.5 consecutive days may restart without warning which may lead to potential data unavailability https://support.emc.com/kb/301837

ETA 202452: Isilon OneFS: Nodes that have run for 497 consecutive days may restart without warning https://support.emc.com/kb/301837

ETA 491747: Isilon OneFS: Gen5 nodes containing Mellanox ConnectX-3 adapters reboot after 248.5 consecutive days https://support.emc.com/kb/491747

Infiniband (IB) hardware or switch failure

Indicators: Less likely to be intermittent than one or more node offline or split. If all nodes or all of one IB channel (eg. ib0) unable to ping, then issue is likely a failed or un-powered IB switch.

QuickTip

isi status takes long time to run and displays one or more node with throughput/drive stats as n/a n/a n/a

/var/log/ethmixer.log shows many statechanges and ports register as down (change not displaying status “is alive”)

ifconfig shows IP addresses for ib0 and ib1 – ping to another node IB address, failure indicates IB cable, card or switch issue.

For more information on IB errors and troubleshooting them: https://support.emc.com/kb/30183

BMC/CMC unresponsive

Indicators: Job, replication, celog, gconfig errors and unaccountably long job run times when OneFS version is older than 8.0.0.4 and BMC firmware is older than version 1.25. If attempting to query the BMC for F/W version on any node gets no output, then unresponsive BMC/CMC is confirmed, and BMC/CMC is likely a contributor to performance issues.

QuickTip

/usr/bin/isi_hwtools/isi_ipmicmc -d -V -a bmc | grep firmware

Note: A responding controller does not eliminate this as a contributing issue. Patches and versions prior to OneFS 8.0.0.4 added features to restart unresponsive management controllers. However the impact on stats and other dependent services such as jobs, celog and replication can persist following the controller restart. For more information on BMC errors and troubleshooting: https://support.emc.com/kb/466373

Sync policy with Schedule when-source-modified exists in an active folder

Indicators: More likely associated with intermittent hangdumps and stacks are tracking workflow peaks, symptoms quiet off-hours

QuickTip

isi sync policies list -v | grep Schedule

If “when-source-modified” is found on one or more policies, disable that policy while troubleshooting

isi sync policies modify <name> –enabled false

If there are many policies with that attribute, quickest relief can be from disabling isi_migrate service

isi services -a isi_migrate disable

Continue with services troubleshooting and repair, restart node services such as isi_stats_d and isi_stats_hist_d and change sync policy to use scheduled rather than when-source-modified before re-enabling Sync policy or service.

isi services -a isi_migrate enable

Sync job degrades performance after upgrade to 8.x

Indicators: After upgrading to 8.x without adjusting Performace Rules, a cluster without all nodes on network or with administratively-limited replication IP pool can overtax resources on the few participating nodes when SyncIQ jobs run.

QuickTip

isi sync rule list displays nothing (Performance Rules running as default)

isi sync policies list -v | grep -A1 “Source Subnet” (If blank there are no IP pool restrictions, otherwise will list subnet and pool)

If no Performance Rules and replication IP pool is restricted, disable isi_migrate services until adding SyncIQ Performance Rules or increasing number of nodes participating in replication to reduce imbalance of Sync workers on nodes participating in replication.

isi services -a isi_migrate disable

isi commands locking drive_purposing.lock

Indicators: Verification of issue can be made offline from logs with hangdump review, but key indicators during live engagement are

a) most likely to occur when running multiple isi_statistics or WebUI commands at same time from different nodes, and

b) cluster is at OneFS versions 7.1.1.0-7.1.1.9, 7.2.1.0-7.2.1.2, or 8.0.0.0.

QuickTip

If above indicators present, stop any statistics commands and WebUIs running on all nodes and restart the isi_stats_d daemon with

isi_for_array killall isi_stats_d

Related:

  • No Related Posts

Isilon OneFS: How to smartfail out a node pool

Article Number: 504175 Article Version: 3 Article Type: Break Fix



Isilon OneFS,Isilon OneFS 7.1,Isilon OneFS 7.2,Isilon OneFS 8.0,Isilon OneFS 8.1

Here are the steps in order to properly smartfail out a node pool that is no longer needed in the cluster

1 – Move off the majority of the data through File Pool Policies

Either through the CLI or through the WebUI, edit the File Pool Policies in order to point data from the pool being decommission to another pool in the cluster. For assistance on how to best configure this, please reference the Administration Guide for your OneFS version.

Once the File Pool Policies have been changed, start a Smartpools job in order to apply the changes that were made. If the File Pool Policies were configured correctly, this should move the majority of the data.

Note: It is normal for there to still be some space utilized on the Node Pool (generally under 5%, but it can be more). This is fine and won’t cause any issues.

2 – Ensure Global Spillover is enabled so the last bit of data on the nodes are allowed to move to other node pools

CLI:

# isi storagepool settings view

WebUI:

File System -> Storage Pools -> SmartPool Settings

If it isn’t enabled, ensure that you get it enabled

3 – Start the Smartfail process
Smartfail 1 node at a time with the following command:
OneFS 7.x

# isi devices -a smartfail -d <node LNN>

OneFS 8.x

# isi devices node smartfail –node-lnn=<node LNN>


When the Smartfail process completes (which is handled by a Flexprotect Job), move onto the next node.

Smartfail them 1 at a time until you have 2 nodes remaining. Once you are at this point, start the Smartfail process on both nodes. To have Node Pool quorum, you need at least 51% of the devices online. If you Smartfail just 1 node, then the Node Pool no longer has quorum and will be unable to complete the Smartfail process. By putting both nodes in a Smartfail status, the data will be striped to the other node pools.

Related:

  • No Related Posts

DFSIO testing with Isilon F800

I get a lot of requests when it comes to performance on Isilon, and then there is the topic of “Fan-in Ratio”. I had the opportunity this summer to do a Proof-of-Concept with our F800 node type. So i thought i would use the opportunity to answer these questions and hopefully provide some guidance and reference points.

The Cork POC lab environment was configured with 8x Isilon F800 nodes are running OneFS 8.1.1.1, and we used (8) Dell R740 servers. Each Server had:

768 GB of RAM

2 x Intel(R) Xeon(R) Gold 6148 CPU @ 2.4GHz 20 Core

2 SSD 480GB ( RAID-1)

CentOS Linux release 7.5.1804

The backend network between compute nodes and Isilon is 40Gbps with Jumbo Frames set (MTU=9162) for the NICs and the switch ports.

arch-poc2.png

Diagram 1 – Architecture

CDH Config

CDH 5.14.2 was configured to run in an Access Zone on Isilon, service accounts were created in the Isilon Local provider and locally in the client /etc/passwd files. All tests were run using a basic test user with no special privileges. I did a little experimenting with the container size because I wanted as much parallelism as possible.

My yarn math: since i had 2×20 core cpus per box, and 8 boxes that was 320 total physical cores and with hyper-threading 640 vcores for CDH to allocate as 40 containers per server. then for each i allocated 8GB which used 320 GB per server out of the 768 available.

RM webui.png

Diagram 2 – A little picture of the Yarn Resource Manager as it was running.

The first set of tests were run with different number of records per run to get a good feel of what the runs would look like and how long they would take to run but each was done with a file size per container of 1000MB.

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-2.6.0-cdh5.14.2-tests.jar TestDFSIO -write -nrFiles 10000 -fileSize 1000

isilon_net_4runs.png

Diagram 3 – Isilon Network Throughput Initial Tests.

isilon_cpu_4runs.png

Diagram 4 – Isilon CPU Initial Tests

isilon_fileops_4runs.png

Diagram 5 – Isilon File Ops Initial Tests

The above graphs shows 4 different runs for the number of files, 1K=1000, 2K=2000 etc… The last chart below shows the Load balanced across the Isilon nodes. Its good to see an even distribution of the workload, because this is important in Hadoop workflows where many different clients will be attaching to the cluster. Isilon engineering added this data-node load balancing last year with the 8.0.1 code family.

DN_balance.png

Diagram 6 – Active HDFS Connections per Node Initial Tests

The detailed results are posted at the links below from my Grafana dashboards.

Snapshots from Grafana

Datanode

https://snapshot.raintank.io/dashboard/snapshot/kgE3jqWQGjWGN9xw51yeA9spxhsfxSax

CDH

https://snapshot.raintank.io/dashboard/snapshot/Gwt93ncKT7mYU2Ru3XOmapotkHRpn26G

HDFS

https://snapshot.raintank.io/dashboard/snapshot/IlhwuTTDXfuCKx80l5onzL8x758tn12O

This set of runs pretty much show the parallelism and throughput numbers that Isilon is capable of the F800 being all flash array provides the most bandwidth of any of our systems, but the numbers here ~14GB/s writes and 16GB/s reads are very consistent with the specs we publish for our nodes based on NFS v3 streaming / large block workflows. So i feel confident in stating these numbers as we size solutions for Hadoop. The next set of tests are designed to show how we can approach “Fan-in” ratio calculations.

Fan-in Testing Results

Based on the previous set of results i am going to use that 10K, 1GB file test and run it against reducing number of nodes, i.e run 1 = 8 nodes down to Run 8 = 1 node. The number of compute nodes will be 8 in each case exactly configured as the above runs. In order to reduce the number of Isilon nodes all i did was remove the external NIC from the network pool assigned to the Access zone. Pretty easy and non-disruptive to Hadoop.

fan-in-1.PNG.png

Diagram 7 – Summary Results of Network Throughput for Fan-in tests.

The above graphs are pasted from Grafana and represent the Isilon network throughput chart for each run. The scale is GB/s not Gbs. since we did separate runs of READS and WRITES we tried to capture both runs on the same chart, but it wasn’t always possible so my apologies if they are confusing. The first takeaway is the the steady throughput and even though the number of connections is reduced. The jobs elongate (of course) but its very predictable and balanced. A good way to look at the results is to just take the bare run-time from the DFSIO results and plot it as a function of the number of nodes.

Screenshot_20181008_143445.png

Diagram 8 – Trend of Execution Times for Fan-in Tests.

What this tells us is that for 4-8 node counts the limiting factor is the amount of throughput limited by the number of servers, since there really isn’t an appreciable amount of additional performance for those runs. But once we get to 3 nodes things are definitely starting to change. So in the following graphs lets focus on what happens in the 1, 2 and 3 node runs. This is where the bottlenecks will be. Remember in each case its the same job running on the same 8 Dell servers. The first graph below is the Active HDFS connections broken out per node. The leftmost graph showing the 3-node run. One thing that is interesting, we effectively have the same number of active connections per node, however we do see a bit of the deterioration at the single node test.

Screenshot_20181008_133602.png

Diagram 9 – Active HDFS Connections for 3, 2 and Single Node Fan-in Tests.

The graph below shows the network throughput numbers. The 3-node test showing about 10 GB/s for reads and 8.5 GB/s for writes. This drops down to 7.5 and 4.5 GB/s for Reads and 6 and 3.2 GB/s for writes in the 2-node and 1-node tests respectively. Here we notice that the single node test provides a pretty steady network and none of the choppiness like the connections graph. Between the two perspectives I would say that we are operating at the actual network limitations for that one node. The client end of the transaction is represented by the connections chart above. So the choppiness of the chart above can be interpreted that while a few of the clients are getting data from the single Isilon node some of the connections are waiting for their turn. Then looking at the chart below, we don’t see any similar gaps, in either the Read or the Write throughput numbers which is Isilon just forcing as much thorough the network as the NIC card will permit. This is a good thing, since there are no jobs failing, just waiting for their data. This will allow Hadoop teams to fill up there queues and not worry that Isilon will choke on the overall workload.

Screenshot_20181008_140844.png

Diagram 10 – Network throughput for for 3, 2 and Single Node Fan-in Tests.

The distribution of the IO across the disk subsystem is shown in the graphs below. One thing to note is even though I reduced the network connectivity to the compute cluster, The Isilon cluster is still using all 8 nodes to stripe the data, and you will also notice how evenly balanced the disk IO is across the cluster, which itself is very beneficial to Hadoop workflows.

Screenshot_20181008_133916.png

Diagram 11 – Disk Throughput for for 3, 2 and Single Node Fan-in Tests.

The last piece of the disk subsystem would be the IO scheduler, shown below. This now confirms the previous observation that the network interface is the bottleneck. For the 3-node test (feature on the left side of the left graph) the data throughput numbers to the disks pretty much matches the network bandwidth. and we see the queue length is substantial meaning that the network is feeding the disk subsystem pretty steadily, as we remove nodes ( the 2-node and single node runs) the disk queues are reduced as well, showing that the bottleneck is really the network in this setup. The right graph is the Disk IO Schedule latency you can see that everything is even and balanced in spite of the seemingly long disk queues. This is another feature that Hadoop workflows will like.

Screenshot_20181008_141753.png

Diagram 12 – Disk IO Scheduler Metrics for for 3, 2 and Single Node Fan-in Tests.

Conclusions

Its pretty straightforward as a scale out architecture for Hadoop Isilon performs extremely well at high 8:1 fan-in ratios. DFSIO is definitely a batch process and you can see in the last chart that reducing the Isilon nodes just prolongs the job, nothing breaks, nothing is overloaded it just keeps running. We’ve shown that the bottleneck is the network and it really occurs somewhere in the 8:1 or 8:2 fan-in range. We did not discover any real negatives just that overall application execution times are simply just elongated. So the takeaway is when sizing the server portion of the solution, choose the right amount of CPU and network throughput to get the job done in the right amount of time. Meet the business requirements and SLAs there’s no need to oversize for these workflows. So using this case as an example and an hypothetical SLA of 1 Hour for the job to run to completion. One could choose the Fan-in ratio of 8:1 because the longest running job is actually done in less than 3500 seconds (see Diagram 8). This is how you right-size your Hadoop environment and architect the solution to meet the business requirement made possible by separating storage and compute using Isilon OneFS for HDFS storage.

Related:

  • No Related Posts