Adding Lustre Storage to the HPC Equation

For organizations that need extreme scalability in high-performance computing systems, Lustre is often the file system of choice — for a lot of good reasons. When it comes to high-performance computing applications, there is basically no such thing as too much data storage. Who doesn’t need more storage? Everywhere you look, HPC applications are ballooning in size. A few examples: AccuWeather, the world’s largest source of weather forecasts and warnings, responds to more than 30 billion data requests daily.[1] The wave of medical data washing over the global healthcare industry is expected to swell to 2,314 … READ MORE

Related:

Drive Your Business Today and Innovate for Tomorrow Using the PowerEdge Server Platform

Who says you can’t have your cake and eat it too? We all know the key ingredients, flour, oil and sugar, in the right proportion, make a great cake. Similarly, automation, security and scalability are the building blocks of the server infrastructure. This foundation enables you to get the best results for your core business applications and prepares you to take on more demanding, complex workloads. Dell EMC PowerEdge servers help you develop the perfect recipe for success, so you can drive business today while you prepare for tomorrow’s new applications and business models. From deployments … READ MORE

Related:

Scaling NMT with Intel® Xeon® Scalable Processors

With the continuous research interest in the field of machine translation and efficient neural network architecture designs improving translation quality, there’s a great need to improve its time to solution. Training a better performing Neural Machine Translation (NMT) model still takes days to weeks depending on the hardware, size of the training corpus and the model architecture.



Intel® Xeon® Scalable processors provide incredible leap in scalability and over 90% of the Top500 super computers run on Intel. In this article we show some of the training considerations and effectiveness when scaling a NMT model using Intel® Xeon® Scalable processors.



An NMT model reads a source sentence in a language and passes it to an encoder which builds an intermediate representation and the decoder processes the intermediate representation to produce the translated target sentence in another language.

enc-dec-architecture.pngFigure 1: Encoder-decoder architecture



The figure above shows an encoder-decoder architecture. The English source sentence, “Hello! How are you?” is read and processed by the architecture to produce a translated German sentence “Hallo! Wie geht sind Sie?”. Traditionally, Recurrent Neural Network (RNN) were used in encoders and decoders, but other neural network architectures such as Convolutional Neural Network (CNN) and attention mechanismbased models are also used.



Architecture and environment

Transformer model is one of the interesting architectures in the field of NMT, which is built with variants of attention mechanism in the encoder-decoder part there by replacing the traditional RNNs in the architecture. This model was able to achieve state of the art results in English-German and English-French translation tasks.



multi_head_attention.png

Figure 2: Multi-head attention block



The above figure shows the multi-head attention block used in the transformer model. At a high-level, the scaled dot-product attention can be thought as finding the relevant information, values (V) based on Query (Q) and Keys (K) and multi-head attention could be thought as several attention layers in parallel to get distinct aspects of the input.



We use the Tensorflows’ official model implementation of the transformer model and we’ve also added Horovod to perform distributed training. WMT English-German parallel corpus with 4.5M sentences was used to train the model.



The tests described in this article were performed in house on Zenith super computer in Dell EMC HPC and AI Innovation lab. Zenith is Dell EMC PowerEdge C6420-based cluster, consisting of 388 dual socket nodes powered by Intel® Xeon® Scalable Gold 6148 processors and interconnected with an Intel® Omni-path fabric.



System Information

CPU Model

Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz

Operating System

Red Hat Enterprise Linux Server release 7.4 (Maipo)

Tensorflow Version

1.10.1 with Intel® MKL

Horovod Version

0.15.0

MPI

Open MPI 3.1.2

Note: We used a specific Horovod branch to handle sparse gradients. Which is now part of the main branch in their GitHub repository.



Weak scaling, environment variables and TF configurations

When training using CPUs, weak scaling, environment variables and TF configurations play a vital role in improving the throughput of the deep learning model. Setting it optimally can help with additional performance gains.



Below are the suggestions based on our empirical tests when running 4 processes per node for the transformer (big) model on 50 zenith nodes. We found setting these variables on all our experiments seem to improve the throughput and modifying OMP_NUM_THREADS based on the number of processes per node.



Environment Variables:

export OMP_NUM_THREADS=10

export KMP_BLOCKTIME=0

export KMP_AFFINITY=granularity=fine,verbose,compact,1,0



TF Configurations:

intra_op_parallelism_threads=$OMP_NUM_THREADS

inter_op_parallelism_threads=1



Experimenting with weak scaling options allows to find the optimal number of processes run per node such that the model is fit in the memory and performance doesn’t deteriorate. For some reason TensorFlow creates an extra thread. Hence, to avoid oversubscription it’s better to set the OMP_NUM_THREADS to 9, 19 or 39 when training with 4,2,1 process per node respectively. Although we didn’t see it affecting the throughput performance in our experiments but may affect performance in a very large-scale setup.



Performance can be improved by threading. This can be done by setting OMP_NUM_THREADS, such that the product of its value and number of MPI ranks per node equals the number of available CPU cores per node. KMP_AFFINITY environment variable provides a way to control the interface which binds OpenMP threads to physical processing units. KMP_BLOCKTIME, sets the time in milliseconds that a thread should wait after completing a parallel execution before sleep. TF configurations such as intra_op_parallelism_threads and inter_op_parallelism_threads are used to adjust the thread pools there by optimizing the CPU performance.



effect_of_environment_variables_bold.png

Figure 3: Effect of environment variables



The above results show that there’s a 1.67x improvement when environment variables are set correctly.



Faster distributed training

Training a large neural network architecture could be time consuming even to perform rapid prototyping or hyper parameter tuning. Thanks to distributed training and open source frameworks like Horovod which allows to train a model using multiple workers. In our previous blog we showed the effectiveness of training an AI radiologist with distributed deep learning and using Intel® Xeon® Scalable processors. Here, we show how distributed training improves the performance of machine translation task.





scaling_performance_bold.png

Figure 4: Scaling Performance



The above chart shows the throughput of the transformer (big) model when trained using 1 – 100 zenith nodes. We get a near linear performance when scaling up the number of nodes. Based on our tests, which include setting the correct environment variables and optimal number of processes, we see an 79x improvement on 100 Zenith nodes with 2 processes per node compared to the throughput on single node with 4 processes.



Translation Quality

NMT models’ translation quality is measured in terms of BLEU (Bi-Lingual Evaluation Understudy) score. It’s a measure to compute the difference between the human and machine translated output.



In a previous blog post we explained some of the challenges of large-batch training of deep learning models. Here, we experimented using a large global batch size of 402k tokens to determine the models’ performance on English to German task. Most of the hyper parameters were set same as that of transformer (big) model, the model was trained using 50 Zenith nodes with 4 processes per node and 2010 being the local batch size. The learning rate grows linearly for 4000 steps to 0.001 and then follows inverse square root decay.



Case-Insensitive BLEU Case-Sensitive BLEU
TensorFlow Official Benchmark Results 28.9
Our results 29.15 28.56

Note: Case-Sensitive score not reported in the Tensorflow Official Benchmark.



Above table shows our results on the test set (newstest2014) after training the model for around 2.7 days (26000 steps). We can also see a clear improvement in the translation quality compared to the results posted on Tensorflow Official Benchmarks.



Conclusion

In this post we showed how to effectively train an NMT system using Intel® Xeon® Scalable processors. We also showed some of the best practices for setting the environment variables and the corresponding scaling performance. Based on our experiments and also following other research work on NMT to understand some of the important aspects of scaling an NMT system, we were able obtain a better translation quality and to speed up the training process. With growing research interest in the field of neural machine translation, we expect to see much interesting and improved NMT models in the future.

Related:

Designing for Large Datasets

Received a couple of recent inquiries around how to best accommodate big, unstructured datasets and varied workloads, so it seemed like an interesting topic for a blog article. Essentially, when it comes to designing and scaling large Isilon clusters for large quantities and growth rates of data, there are some key tenets to bear in mind. These include:



  • Strive for simplicity
  • Plan ahead
  • Just because you can doesn’t necessarily mean you should



Distributed systems tend to be complex by definition, and this is amplified at scale. OneFS does a good job of simplifying cluster administration and management, but a solid architectural design and growth plan is crucial. Because of its single, massive volume and namespace, Isilon is viewed by many as a sort of ‘storage Swiss army knife’. Left unchecked, this methodology can result in unnecessary complexities as a cluster scales. As such, decision making that favors simplicity is key.



Despite OneFS’ extensibility, allowing an Isilon system to simply grow organically into a large cluster often results in various levels of technical debt. In the worst case, some issues may have grown so large that it becomes impossible to correct the underlying cause. This is particularly true in instances where a small cluster is initially purchased for an archive or low performance workload and with a bias towards cost optimized storage. As the administrators realize how simple and versatile their clustered storage environment is, more applications and workflows are migrated to Isilon. This kind of haphazard growth, such as morphing from a low-powered, near-line platform into something larger and more performant, can lead to all manner of scaling challenges. However, compromises, living with things, or fixing issues that could have been avoided can usually be mitigated by starting out with a scalable architecture, workflow and expansion plan.



Beginning the process with a defined architecture, sizing and expansion plan is key. What do you anticipate the cluster, workloads, and client access levels will look like in six months, one year, three years, or five years? How will you accommodate the following as the cluster scales?



  • Contiguous rack space for expansion
  • Sufficient power & Cooling
  • Network infrastructure
  • Backend switch capacity
  • Availability SLAs
  • Serviceability and spares plan
  • Backup and DR plans
  • Mixed protocols
  • Security, access control, authentication services, and audit
  • Regulatory compliance and security mandates
  • Multi-tenancy and separation
  • Bandwidth segregation – client I/O, replication, etc.
  • Application and workflow expansion



There are really two distinct paths to pursue when initially designing an Isilon clustered storage architecture for a large and/or rapidly growing environment – particularly one that includes a performance workload element to it. These are:

  • Single Large Cluster
  • Storage Pod Architecture



A single large, or extra-large, cluster is often deployed to support a wide variety of workloads and their requisite protocols and performance profiles – from primary to archive – within a single, scalable volume and namespace. This approach, referred to as a ‘data lake architecture’, usually involves more than one style of node.

Isilon can support up to fifty separate tenants in a single cluster, each with their own subnet, routing, DNS, and security infrastructure. OneFS’ provides the ability to separate data layout with SmartPools, export and share level segregation, granular authentication and access control with Access Zones, and network partitioning with SmartConnect, subnets, and VLANs.

Furthermore, analytics workloads can easily be run against the datasets in a single location and without the need for additional storage and data replication and migration.



data_lake_pod_architecture_1.png



For the right combination of workloads, the data lake architecture has many favorable efficiencies of scale and centralized administration.

Another use case for large clusters is in a single workflow deployment, for example as the content repository for the asset management layer of a content delivery workflow. This is a considerably more predictable, and hence simpler to architect, environment that the data lake.

Often, as in the case of a MAM for streaming playout for example, a single node type is deployed. The I/O profile is typically heavily biased towards streaming reads and metadata reads, with a smaller portion of writes for ingest.

There are trade-offs to be aware of as cluster size increases into the extra-large cluster scale. The larger the node count, the more components are involved, which increases the likelihood of a hardware failure. When the infrastructure becomes large and complex enough, there’s more often than not a drive failing or a node in an otherwise degraded state. At this point, the cluster can be in a state of flux such that composition, or group, changes and drive rebuilds/data re-protects will occur frequently enough that they can start to significantly impact the workflow.

Higher levels of protection are required for large clusters, which has a direct impact on capacity utilization. Also, cluster maintenance becomes harder to schedule since many workflows, often with varying availability SLAs, need to be accommodated.

Additional administrative shortcomings that also need to be considered when planning on an extra-large cluster include that InsightIQ only supports monitoring clusters of up to eighty nodes and the OneFS Cluster Event Log (CELOG) and some of the cluster WebUI and CLI tools can prove challenging at an extra-large cluster scale.

That said, there can be wisdom in architecting a clustered NAS environment into smaller buckets and thereby managing risk for the business vs putting the ‘all eggs in one basket’. When contemplating the merits of an extra-large cluster, also consider:

  • Performance management,
  • Risk management
  • Accurate workflow sizing
  • Complexity management.



A more practical approach for more demanding, HPC, and high-IOPS workloads often lies with the Storage Pod architecture. Here, design considerations for new clusters revolve around multiple (typically up to 40 node) homogenous clusters, with each cluster itself acting as a fault domain – in contrast to the monolithic extra-large cluster described above.



Pod clusters can easily be tailored to the individual demands of workloads as necessary. Optimizations per storage pod can include size of SSDs, drive protection levels, data services, availability SLAs, etc. In addition, smaller clusters greatly reduce the frequency and impact of drive failures and their subsequent rebuild operations. This, coupled with the ability to more easily schedule maintenance, manage smaller datasets, simplify DR processes, etc, can all help alleviate the administrative overhead for a cluster.



A Pod infrastructure can be architected per application, workload, similar I/O type (ie. streaming reads), project, tenant (ie. business unit), availability SLA, etc. This pod approach has been successfully adopted by a number of large Isilon customers in industries such as semiconductor, automotive, life sciences, and others with demanding performance workloads.



This Pod architecture model can also fit well for global organizations, where a cluster is deployed per region or availability zone. An extra-large cluster architecture can be usefully deployed in conjunction with Pod clusters to act as a centralized disaster recovery target, utilizing a hub and spoke replication topology. Since the centralized DR cluster will be handling only predictable levels of replication traffic, it can be architected using capacity-biased nodes.



data_lake_pod_architecture_2.png



Before embarking upon either a data lake or Pod architectural design, it is important to undertake a thorough analysis of the workloads and applications that the cluster(s) will be supporting.

Despite the flexibility offered by the data lake concept, not all unstructured data workloads or applications are suitable for a large Isilon cluster. Each application or workload that is under consideration for deployment or migration to a cluster should be evaluated carefully. Workload analysis involves reviewing the ecosystem of an application for its suitability. This requires an understanding of the configuration and limitations of the infrastructure, how clients see it, where data lives within it, and the application or use cases in order to determine:

  • How the application works?
  • How users interact with the application?
  • What is the network topology?
  • What are the workload-specific metrics for networking protocols, drive I/O, and CPU & memory usage?

More information onhow to perform workload analysiscan be found in this recentblog article.

Related:

Querying multiple ATP appliances at once

I need a solution

Hi,

We’ve 3 seperate ATP box (because of scalability issues told by our partner and our network architecture) but this became a problem for us (We approx. have 26.000 clients)

For example; when I want to query an IOC or event, i need to do the same this for each ATP box and check results for each box etc…

Is there any way to consoldiate them so that I can query all the ATP appliances from a single screen and see the resulting events from same UI?

Regards,

0

Related:

Community Webinar – Expand your small Filr deployment to a Bigger & Smarter deployment

qmangus

We have a great Filr Community Webinar coming up on October 31 at 11:00 am EDT, 4:00 pm CET! This webinar, “Expand your small Filr deployment to a Bigger & Smarter deployment“, will show you how your organization can benefit from expanding your Filr deployment to a large, clustered environment that offers improved performance, scalability, and resilience. …

+read more

The post Community Webinar – Expand your small Filr deployment to a Bigger & Smarter deployment appeared first on Cool Solutions. qmangus

Related:

  • No Related Posts

What to Expect from Oracle Autonomous Transaction Processing

Today Larry Ellison announced the general availability of Oracle Autonomous Transaction Processing Cloud Service, the newest member of the Oracle Autonomous Database family, combining the flexibility of cloud with the power of machine learning to deliver data management as a service.

Traditionally, creating a database management system required a team of experts to custom build and manually maintain a complex hardware and software stack. With each system being unique, this approach led to poor economies of scale and a lack of the agility typically needed to give the business a competitive edge.

Try Autonomous Transaction Processing—Sign up for the trial

Autonomous Transaction Processing enables businesses to safely run a complex mix of high-performance transactions, reporting, and batch processing using the most secure, available, performant, and proven platform – Oracle Database on Exadata in the cloud. Unlike manually managed transaction processing databases, Autonomous Transaction Processing provides instant, elastic compute and storage, so only the required resources are provisioned at any given time, greatly decreasing runtime costs.

But What Does the Autonomous in Autonomous Transaction Processing Really Mean?

Self-Driving

Autonomous Transaction Processing is a self-driving database, meaning it eliminates the human labor needed to provision, secure, update, monitor, backup, and troubleshoot a database. This reduction in database maintenance tasks reduces costs and frees scarce administrator resources to work on higher-value tasks.

When an Autonomous Transaction Processing database is requested, an Oracle Real-Application-Cluster (RAC) database is automatically provisioned on Exadata Cloud Infrastructure. This high-availability configuration automatically benefits from many of the performance-enhancing Exadata features such as smart flash cache, Exafusion communication over a super-fast InfiniBand network, and automatic storage indexes.

In addition, when it comes time to update Autonomous Transaction Processing, patches are applied in a rolling fashion across the nodes of the cluster, eliminating unnecessary down time. Oracle automatically applies all clusterware, OS, VM, hypervisor, and firmware patches as well.

In Autonomous Transaction Processing the user does not get OS login privileges or SYSDBA privileges, so even if you want to do the maintenance tasks yourself, you cannot. It is like a car with the hood welded shut so you cannot change the oil or add coolant or perform any other maintenance yourself.

Many customers want to move to the cloud because of the elasticity it can offer. The ability to scale both in terms of compute and storage only when needed, allows people to truly pay per use. Autonomous Transaction Processing not only allows you to scale compute and storage resources, but it also allows you to do it independently online (no application downtime required).

Self-Securing

Autonomous Transaction Processing is also self-securing, as it protects itself from both external attacks and malicious internal users. Security patches are automatically applied every quarter. This is much sooner than most manually operated databases, narrowing an unnecessary window of vulnerability. Patching can also occur off-cycle if a zero-day exploit is discovered. Again, these patches are applied in a rolling fashion across the nodes of the cluster, avoiding application downtime.

But patching is just part of the picture. Autonomous Transaction Processing also protects itself with always-on encryption. This means data is encrypted at rest but also during any communication with the database. Customers control their own encryption keys to further improve security.

Autonomous Transaction Processing also secures itself from Oracle cloud administrators using Oracle Database Vault. Database Vault uniquely allows Oracle’s cloud administrators to do their jobs but prevents them from being able to see any customer data store in Autonomous Transaction Processing.

Finally, customers are not given access to either the operating system or the SYSDBA privilege to prevent security breaches from malicious internal users or from stolen administrator credentials via a phishing attack.

Self-Repairing

Autonomous Transaction Processing automatically recovers from any failures without downtime. The service is deployed on our Exadata cloud infrastructure, which has redundancy built-in at every level of the hardware configuration to protect against any server, storage, or network failures.

Autonomous Transaction Processing automatically backs up the database nightly and gives the ability to restore the database from any of the backups in the archive. It also has the ability to rewind data to a point in time in the past to back out any user errors using Oracle’s unique Flashback Database capabilities.

Since users don’t have access to the OS, Oracle is on the hook to diagnose any problems that may occur. Machine learning is used to detect and diagnose any anomalies. If the database detects an impending error, it gathers statistics and feeds them to AI diagnostics to determine the root cause. If it’s a known issue, the fix is quickly applied. If it’s a new issue a service request will be automatically opened with Oracle support.

How Does Autonomous Transaction Processing Differ from the Autonomous Data Warehouse?

Up until now, all of the functionality I have described is shared between both Autonomous Data Warehouse and Autonomous Transaction Processing. Where the two services differ is actually inside the database itself. Although both services use Oracle Database 18c, they have been optimized differently to support two very different but complementary workloads. The primary goal of the Autonomous Data Warehouse is to achieve fast complex analytics, while Autonomous Transaction Processing has been designed to efficiently execute a high volume of simple transactions.

Configuration

The differences in the two services begin with how we configure them. In Autonomous Data Warehouse, the majority of the memory is allocated to the PGA to allow parallel joins and complex aggregations to occur in-memory, rather than spilling to disk. While on Autonomous Transaction Processing, the majority of the memory is allocated to the SGA to ensure the critical working set can be cached to avoid IO.

Data Formats

We also store the data differently in each service. In the Autonomous Data Warehouse, data is stored in a columnar format as that’s the best format for analytics processing. While in Autonomous Transaction Processing, data is stored in a row format. The row format is ideal for transaction processing, as it allows quick access and updates to all of the columns in an individual record since all of the data for a given record is stored together in-memory and on-storage.

Statistics Gathering

Regardless of which type of autonomous database service you use, optimizer statistics will be automatically maintained. On the Autonomous Data Warehouse, statistics (including histograms) are automatically maintained as part of all bulk-load activities. With Autonomous Transaction Processing, data is added using more traditional insert statements, so statistics are automatically gathered when the volume of data changes significantly enough to make a difference to the statistics.

Query Optimization

Queries executed on the Autonomous Data Warehouse are automatically parallelized, as they tend to access large volumes of data in order to answer the business question. While indexes are used on Autonomous Transaction Processing to access only the specific rows of interest. We also use RDMA on Autonomous Transaction Processing to provide low response time direct access to data stored in-memory on other servers in the cluster.

Resource Management

Both Autonomous Data Warehouse and Autonomous Transaction Processing offer multiple database “services” to make it easy for users to control the priority and parallelism used by each session. The services predefine three priority levels: Low, Medium, and High. Users can just choose the best priority for each aspect of their workload. For each database service you have the ability to define the criteria of a runaway SQL statement. Any SQL statement that excesses these parameters either in terms of elapse time or IO will be automatically terminated. On Autonomous Data Warehouse only one service (LOW) automatically runs SQL statements serially. While on Autonomous Transaction Processing, only one service (PARALLEL) automatically runs SQL statements with parallel execution. You can also use the Medium priority service by default which allows the Low priority service to be used for requests such as reporting and batch to prevent them from interfering with mainstream transaction processing. The High priority level can be used for more important users or actions.

Can I Use Autonomous Transaction Processing to Develop New Applications?

Autonomous Transaction Processing is the ideal platform for new application development. Developers no longer have to wait on others to provision hardware, install software, and create a database for them. With Autonomous Transaction Processing, developers can easily deploy an Oracle database in a matter of minutes, without worrying about manual tuning or capacity planning.

Autonomous Transaction Processing also has the most advanced SQL and PL/SQL support accelerating developer productivity by minimizing the amount of application code required to implement complex business logic. It also has a complete set of integrated Machine Learning algorithms, simplifying the development of applications that perform real-time predictions such as personalized shopping recommendations, customer churn rates, and fraud detection.

Where Can I Get More Information and Get My Hands on Autonomous Transaction Processing?

The first place to visit is the Autonomous Transaction Processing Documentation. There you will find details on exactly what you can expect from the service.

We also have a great program that lets you get started with Oracle Cloud with $300 in free credits, which last much longer than you would expect since the trial service has very low pricing. Using your credits (which will probably last you around 30 days depending on how you configure Autonomous Transaction Processing) you will be able to get valuable hands-on time to try loading some your own workloads. Below is a quick video to help you get started.

Related:

Workload Analysis

Despite the flexibility offered by an Isilon cluster and the data lake concept, not all unstructured data workloads or applications are suitable contenders. When planning migrations or deployments on Isilon, carefully evaluate each application or workload that is under consideration. Workload analysis typically involves comprehensive investigation of an application’s ecosystem, which requires an understanding of the configuration and limitations of the infrastructure, how it’s presented to clients, where data lives within it, and the application or use case(s).

How does the application work?

First, determine whether an application or workflow has a unique or unbalanced dataset, plus other core characteristics. For example, does it include any of the following?

  • Wide and deeply nested directory trees with few files per directory
  • Shallow nested directories with large numbers of files per directory
  • Massive file counts (billions)
  • Large files (>TB in size)
  • Flat files (VMDKs, databases, etc)

Next, discover how the application utilizes the stored data. For example:

  • Are there heavy metadata reads and writes to the dataset?
  • Are there moderate metadata writes and heavy metadata reads?
  • Are heavy or light data reads and/or writes required?
  • Are the data reads and writes more or less random or sequential?
  • Are the writes creates or appends?

If the application or workload is latency-sensitive, are there expected timeouts for data requests built into it and can they be tuned? Are there other external applications from this application that might cause latency problems, such as FindFirstFile or FindNextFile crawls to do repetitive work that is not well-suited to NAS use and needs further investigation?



If an application can benefit from caching, how much of the unique dataset will be read once and then reread and over what periods of time (hourly, daily, weekly, and so on.) This can help in appropriately sizing a cluster’s L2 and L3 cache requirements.

When to analyze workloads?

Best practice is to analyze performance any time a workload performance profile changes, and particularly when disruption has occurred. For example, perform investigation and measurement when:

  • Migrating to or from a new pool.
  • An application is upgraded.
  • New functionality is enabled.

How do users interact with the application?

Discovering exactly how users interact with an application can be tough to profile. However, determining the following helps provide an understanding of the required performance levels:

  • If users will interact with the application through direct requests and responses from a flat data structure.
  • If the application uses efficient parallelized requests, or inefficient serialized requests.

An example of the latter would be an electronic design application that requires 50,000 objects to be loaded from storage before an image can be rendered user’s workstation.

What does the network topology look like?

Maps and visualizations can help identity and resolve many issues, so it’s worth comprehensively cataloging the network topology in its entirety.

  • Conduct a network performance study using a CLI tool such as ‘iperf’.
  • For a LAN, inventory network hardware models, link speeds and feeds, MTUs, layer 2 & 3 routing, and anticipated latencies.
  • For a WAN, itemize providers, topologies, SLAs (rate limits and guarantees), direct versus indirect pathways, and perform an ‘iperf’ analysis.



Review the application or workload’s change control process. Do network and storage teams have clear lines of communication and responsibility? Analyze the application logfiles, review prior issues and trouble tickets, and interview the administrators.

What is the application’s performance profile?

Storage stack performance can be determined by measuring ‘normal’ performance levels over time, and determining how to recognize when it deviates from this baseline. Each cluster has a unique configuration, dataset and workload, and so will yield a unique result from its ecosystem. This effect is amplified at large cluster scale, highlighting the need for diligent and detailed monitoring.



What are the application’s I/O requirements?

Investigate the I/O data rates per node, per node pool, and per network pool, as applicable. Measure and understand the I/O protocol read and write rates and mix as a whole in the workload.

  • Use the ‘isi statistics protocol –-nodes all –-top –-orderby=timeavg’ CLI command to display performance statistics by protocol.
  • Consider the per-protocol breakouts of the client requests – particularly read, write, getattr, setattr, open, close, create, and delete operations.

What does disk latency look like?

The disk subsystem is invariably the primary bottleneck in the storage stack. Investigating and measuring the impact of non-cached workflow and transfer rates (xfers) to disks will help lead to an understanding of how a unique dataset will deliver a particular result.

  • At a minimum, workflows can be profiled fairly simply by using the ‘isi statistics’ CLI command to understand how an application drives the workload to and from disk.
  • The ‘isi statistics drive –-nodes all –-top –-long –-orderby timeavg’ CLI command is useful for display performance statistics by drive, ordered by OpsIn and OpsOut values. Be aware that this is not measuring software transfer commands to storage, as opposed to physical disk IOPS. Disks manage their own physical ordering of these requests, which OneFS does not see or measure in the form of physical I/O.
  • Unfortunately, there are no available disk transfer rate counters that can determine how much is too much to cause performance degradation. OneFS does not report statistics from this level of storage stack depth.
  • Monitoring the ‘isi statistics drive’ CLI command’s Busy%, Queued, TimeInAQ, and TimeAvg counters will help to determine whether storage is being overloaded (within the context of performance requirements).

How much CPU utilization?

OneFS contains an array of useful tools for helping to monitor and measure CPU utilization. A good example here is the ‘isi statistics system –-nodes –top’ command, which provides per-node CPU performance statistics across the cluster.

When analyzing CPU utilization, bear in mind the following attributes:

  • CPU utilization is of limited use for general profiling of a workload, but is a significant consideration when sizing against existing equipment.
  • Processor load is the result of a workload, not a leading indicator.
  • Writes typically have a far higher impact on CPU load than reads.



For large cluster architectures that are intended to support multiple workflows in a data lake scenario, a similar investigation should be performed for all the applications and workloads involved and the aggregate results examined. Contact your account team for assistance with analyzing your workload results.

Related:

Performance of Scylla running on IBM Power Systems

Scylla is an open source NoSQL database that is compatible with Apache Cassandra. Scylla offers several key advantages over Cassandra such as scalability and better performance. Scylla was tested with IBM POWER8 processor-based servers and superior throughput was achieved with both database read and write operations. This article describes the tests that were done with IBM POWER processor-based servers, the performance results, and the value of IBM POWER processor-based servers for Scylla.

Related:

Feedback on SQL database Cluster

I need a solution

Hi,

We are re-thinking our SEP management infrastructure and we are considering setting up a SQL database Cluster as the backend SEP site database.

We have noticed that the document “Symantec™ Endpoint Protection 14 Sizing and Scalability Best Practices White Paper” mentionned using database cluster but we cannot find anymore information about it.

Anyone have any feedback on using SQL database Cluster with SEPM, any limitations or problems ?

Thanks in advance !

0

Related: