DFSIO testing with Isilon F800

I get a lot of requests when it comes to performance on Isilon, and then there is the topic of “Fan-in Ratio”. I had the opportunity this summer to do a Proof-of-Concept with our F800 node type. So i thought i would use the opportunity to answer these questions and hopefully provide some guidance and reference points.

The Cork POC lab environment was configured with 8x Isilon F800 nodes are running OneFS 8.1.1.1, and we used (8) Dell R740 servers. Each Server had:

768 GB of RAM

2 x Intel(R) Xeon(R) Gold 6148 CPU @ 2.4GHz 20 Core

2 SSD 480GB ( RAID-1)

CentOS Linux release 7.5.1804

The backend network between compute nodes and Isilon is 40Gbps with Jumbo Frames set (MTU=9162) for the NICs and the switch ports.

arch-poc2.png

Diagram 1 – Architecture

CDH Config

CDH 5.14.2 was configured to run in an Access Zone on Isilon, service accounts were created in the Isilon Local provider and locally in the client /etc/passwd files. All tests were run using a basic test user with no special privileges. I did a little experimenting with the container size because I wanted as much parallelism as possible.

My yarn math: since i had 2×20 core cpus per box, and 8 boxes that was 320 total physical cores and with hyper-threading 640 vcores for CDH to allocate as 40 containers per server. then for each i allocated 8GB which used 320 GB per server out of the 768 available.

RM webui.png

Diagram 2 – A little picture of the Yarn Resource Manager as it was running.

The first set of tests were run with different number of records per run to get a good feel of what the runs would look like and how long they would take to run but each was done with a file size per container of 1000MB.

yarn jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-2.6.0-cdh5.14.2-tests.jar TestDFSIO -write -nrFiles 10000 -fileSize 1000

isilon_net_4runs.png

Diagram 3 – Isilon Network Throughput Initial Tests.

isilon_cpu_4runs.png

Diagram 4 – Isilon CPU Initial Tests

isilon_fileops_4runs.png

Diagram 5 – Isilon File Ops Initial Tests

The above graphs shows 4 different runs for the number of files, 1K=1000, 2K=2000 etc… The last chart below shows the Load balanced across the Isilon nodes. Its good to see an even distribution of the workload, because this is important in Hadoop workflows where many different clients will be attaching to the cluster. Isilon engineering added this data-node load balancing last year with the 8.0.1 code family.

DN_balance.png

Diagram 6 – Active HDFS Connections per Node Initial Tests

The detailed results are posted at the links below from my Grafana dashboards.

Snapshots from Grafana

Datanode

https://snapshot.raintank.io/dashboard/snapshot/kgE3jqWQGjWGN9xw51yeA9spxhsfxSax

CDH

https://snapshot.raintank.io/dashboard/snapshot/Gwt93ncKT7mYU2Ru3XOmapotkHRpn26G

HDFS

https://snapshot.raintank.io/dashboard/snapshot/IlhwuTTDXfuCKx80l5onzL8x758tn12O

This set of runs pretty much show the parallelism and throughput numbers that Isilon is capable of the F800 being all flash array provides the most bandwidth of any of our systems, but the numbers here ~14GB/s writes and 16GB/s reads are very consistent with the specs we publish for our nodes based on NFS v3 streaming / large block workflows. So i feel confident in stating these numbers as we size solutions for Hadoop. The next set of tests are designed to show how we can approach “Fan-in” ratio calculations.

Fan-in Testing Results

Based on the previous set of results i am going to use that 10K, 1GB file test and run it against reducing number of nodes, i.e run 1 = 8 nodes down to Run 8 = 1 node. The number of compute nodes will be 8 in each case exactly configured as the above runs. In order to reduce the number of Isilon nodes all i did was remove the external NIC from the network pool assigned to the Access zone. Pretty easy and non-disruptive to Hadoop.

fan-in-1.PNG.png

Diagram 7 – Summary Results of Network Throughput for Fan-in tests.

The above graphs are pasted from Grafana and represent the Isilon network throughput chart for each run. The scale is GB/s not Gbs. since we did separate runs of READS and WRITES we tried to capture both runs on the same chart, but it wasn’t always possible so my apologies if they are confusing. The first takeaway is the the steady throughput and even though the number of connections is reduced. The jobs elongate (of course) but its very predictable and balanced. A good way to look at the results is to just take the bare run-time from the DFSIO results and plot it as a function of the number of nodes.

Screenshot_20181008_143445.png

Diagram 8 – Trend of Execution Times for Fan-in Tests.

What this tells us is that for 4-8 node counts the limiting factor is the amount of throughput limited by the number of servers, since there really isn’t an appreciable amount of additional performance for those runs. But once we get to 3 nodes things are definitely starting to change. So in the following graphs lets focus on what happens in the 1, 2 and 3 node runs. This is where the bottlenecks will be. Remember in each case its the same job running on the same 8 Dell servers. The first graph below is the Active HDFS connections broken out per node. The leftmost graph showing the 3-node run. One thing that is interesting, we effectively have the same number of active connections per node, however we do see a bit of the deterioration at the single node test.

Screenshot_20181008_133602.png

Diagram 9 – Active HDFS Connections for 3, 2 and Single Node Fan-in Tests.

The graph below shows the network throughput numbers. The 3-node test showing about 10 GB/s for reads and 8.5 GB/s for writes. This drops down to 7.5 and 4.5 GB/s for Reads and 6 and 3.2 GB/s for writes in the 2-node and 1-node tests respectively. Here we notice that the single node test provides a pretty steady network and none of the choppiness like the connections graph. Between the two perspectives I would say that we are operating at the actual network limitations for that one node. The client end of the transaction is represented by the connections chart above. So the choppiness of the chart above can be interpreted that while a few of the clients are getting data from the single Isilon node some of the connections are waiting for their turn. Then looking at the chart below, we don’t see any similar gaps, in either the Read or the Write throughput numbers which is Isilon just forcing as much thorough the network as the NIC card will permit. This is a good thing, since there are no jobs failing, just waiting for their data. This will allow Hadoop teams to fill up there queues and not worry that Isilon will choke on the overall workload.

Screenshot_20181008_140844.png

Diagram 10 – Network throughput for for 3, 2 and Single Node Fan-in Tests.

The distribution of the IO across the disk subsystem is shown in the graphs below. One thing to note is even though I reduced the network connectivity to the compute cluster, The Isilon cluster is still using all 8 nodes to stripe the data, and you will also notice how evenly balanced the disk IO is across the cluster, which itself is very beneficial to Hadoop workflows.

Screenshot_20181008_133916.png

Diagram 11 – Disk Throughput for for 3, 2 and Single Node Fan-in Tests.

The last piece of the disk subsystem would be the IO scheduler, shown below. This now confirms the previous observation that the network interface is the bottleneck. For the 3-node test (feature on the left side of the left graph) the data throughput numbers to the disks pretty much matches the network bandwidth. and we see the queue length is substantial meaning that the network is feeding the disk subsystem pretty steadily, as we remove nodes ( the 2-node and single node runs) the disk queues are reduced as well, showing that the bottleneck is really the network in this setup. The right graph is the Disk IO Schedule latency you can see that everything is even and balanced in spite of the seemingly long disk queues. This is another feature that Hadoop workflows will like.

Screenshot_20181008_141753.png

Diagram 12 – Disk IO Scheduler Metrics for for 3, 2 and Single Node Fan-in Tests.

Conclusions

Its pretty straightforward as a scale out architecture for Hadoop Isilon performs extremely well at high 8:1 fan-in ratios. DFSIO is definitely a batch process and you can see in the last chart that reducing the Isilon nodes just prolongs the job, nothing breaks, nothing is overloaded it just keeps running. We’ve shown that the bottleneck is the network and it really occurs somewhere in the 8:1 or 8:2 fan-in range. We did not discover any real negatives just that overall application execution times are simply just elongated. So the takeaway is when sizing the server portion of the solution, choose the right amount of CPU and network throughput to get the job done in the right amount of time. Meet the business requirements and SLAs there’s no need to oversize for these workflows. So using this case as an example and an hypothetical SLA of 1 Hour for the job to run to completion. One could choose the Fan-in ratio of 8:1 because the longest running job is actually done in less than 3500 seconds (see Diagram 8). This is how you right-size your Hadoop environment and architect the solution to meet the business requirement made possible by separating storage and compute using Isilon OneFS for HDFS storage.

Related:

Leave a Reply