Dealing with Uncertainty in Machine Learning and AI

Picture1.jpgExplaining the relationship between machine learning and artificial intelligence is one of the most challenging concepts that I encounter when talking to people new to these topics. I don’t pretend to have the definitive answer, but, I have developed a story that seems to get enough affirmative head shakes that I want to share it here.

The diagram above has appeared in many introductory books and articles that I’ve seen. I have reproduced it here to highlight the challenge of talking about “subsets” of abstract concepts – none of which have widely accepted definitions. So, what does this graphic mean or imply? How is deep learning a subset of artificial intelligence? These are the questions I’m going to try to answer by telling you a story I use for briefings on artificial intelligence during the rest of this article.



Since so many people have read about and studied examples of using deep learning for image classification, that is my starting point. I am not however going to talk about cats and dogs, so please hang with me for a bit longer. I’m going to use an example of facial recognition. My scenario is that there is a secure area in a building that only 4 people (Angela, Chris, Lucy and Marie) are permitted to enter. We want to use facial recognition to determine if someone attempting to gain access should be allowed in. You and I can easily look at a picture and say whether it is someone we know. But how does a deep learning model do that and how could we use the result of the model to create an artificial intelligence application?



I frequently use the picture below to discuss the use of deep neural networks for doing model training for supervised classification. Now when looking at the network consider that the goal of all machine learning and deep learning is to transform input data into some meaningful output. For facial recognition, the input data is a representation of the pixel intensity and color or grey scale value from a picture and the output is probability that the picture is either Angela, Chris, Lucy or Marie. That means we are going to have to train the network using recent photos of these four people.



A highly stylized neural network representation



network.jpgThis picture above is a crude simplification of how a modern convolutional neural network (ConvNet) used for image recognition would be constructed, however, it is useful to highlight many of the important elements of what we mean by transforming raw data into meaningful outputs. For example, each line or edge drawn between the neurons of each layer represent a weight (parameter) that must be calculated during training. These weights are the primary mechanism used to transform the input data into something useful. Because this picture only includes 5 layers with less than 10 nodes per layer it is easy to visualize how fully connected layers can quickly increase the number of weights that must be computed. The ConvNets in wide spread use today typically have from 16 to 200 or more layers, although not all fully connected for the deeper designs, and can have 10’s of millions to 100’s of millions of weights or more.



We need that many weights to “meaningfully” transform the input data since the image is broken down into many small regions of pixels (typically 3×3 or 5×5) before getting ingested by the input layer. The numerical representation of the pixel values is then transformed by the weights so that the output of the transformation indicates if this region of cells adds to the evidence that this is a picture of Angela or negates the likelihood that this is Angela. If Angela has black hair and the network does not detect many regions of solid black color, then there not be much evidence that this picture is Angela.



Finally, I want to tie everything discussed so far to an explanation of the output layer. In the picture above, there are 4 neurons in the output layer and that is why I setup my facial recognition story to have 4 people that I am trying to recognize. During training I have a set of pictures that have been labeled with the correct name. One way to look at how I might do that is like this:



Table 1 – Representation of labeled training data

File

IsAngela

IsChris

IsLucy

IsMarie

Picture1

1

0

0

0

Picture2

0

0

0

1

PictureN

0

0

1

0

The goal during training is to come up with a single set of weights that will transform the data from every picture in the training data set into a set of four values (vector) for each picture where the values match as close as possible to the labels assigned as above. For Picture1 the first value is 1 and the other three are zeros and for Picture2 the set of 4 training values are set to zero for the first 3 elements and the fourth value is 1. We are telling the model that we are 100% sure (probability = 1) that this is a picture of Angela and certain that it is not Chris, Lucy, or Marie (probability = 0). The training process tries to find a set of weights that will transform the pixel data for Picture1 in to the vector (1,0,0,0) and Picture2 into the vector (0,0,0,1) and so on for the entire data set.

Of course, no deep learning model training algorithm can do that because of variations in the data so we try to get as close as possible for each input image. The process of testing a model with known data or processing new unlabeled images is called inferencing. When we pass in unlabeled data we get back a list of four probabilities that reflect the evidence in the data that the image is one of the four know people, for example we might get something back like (.5, .25, .15, .1). For most classification algorithms the set of probabilities will add to 1. What does this result tell us?

Our model says we are most confident that the unlabeled picture is Angela since that is the outcome with the highest probability, but, it also tells us that we can only be 50% sure that it is not one of the other three people. What does it mean if we get an inference result back that is (..25, .25, .25, .25)? This result tells us the model can’t do better than a random process like picking a number between 1 and 4. This picture could be anyone of our known people or it could be a picture of a truck. The model provides us no information. How intelligent is that? This is where the connection with artificial intelligence gets interesting.

What we like to achieve is getting back inference predictions where one value is very close to 1 and all the others are very close to zero. Then we have high confidence that person requesting access to a restricted area is one of our authorized employees. That is rarely the case, so we must deal with uncertainty in our applications that use our trained machine learning models. If the area that we are securing is the executive dining room then perhaps we want to open the door even if we are only 50% sure that the person requesting access is one of our known people. If the application is securing access to sensitive computer and communication equipment, then perhaps we want to set a threshold of 90% certainty before we unlock the door. The important point is that machine learning is usually not sufficient alone to build an intelligent application. Therefore, fear that the machines are going to get smarter than people and therefore be able to make “better” decisions is still a long way off, maybe a very long way…

Phil Hummel

@GotDisk

Related:

  • No Related Posts

Challenges of Large-batch Training of Deep Learning Models

Introduction

The process of training a deep neural network is akin to finding the minimum of a function in a very high-dimensional space. Deep neural networks are usually trained using stochastic gradient descent (or one of its variants). A small batch (usually 16-512), randomly sampled from the training set, is used to approximate the gradients of the loss function (the optimization objective) with respect to the weights. The computed gradient is essentially an average of the gradients for each data-point in the batch. The natural way to parallelize the training across multiple nodes/workers is to increase the batch size and have each node compute the gradients on a different chunk of the batch. Distributed deep learning differs from traditional HPC workloads where scaling out only affects how the computation is distributed but not the outcome.



Challenges of large-batch training



It has been consistently observed that the use of large batches leads to poor generalization performance, meaning that models trained with large batches perform poorly on test data. One of the primary reason for this is that large batches tend to converge to sharp minima of the training function, which tend to generalize less well. Small batches tend to favor flat minima that result in better generalization [1]. The stochasticity afforded by small batches encourages the weights to escape the basins of attraction of sharp minima. Also, models trained with small batches are shown to converge farther away from the starting point. Large batches tend to be attracted to the minimum closest to the starting point and lack the explorative properties of small batches.

The number of gradient updates per pass of the data is reduced when using large batches. This is sometimes compensated by scaling the learning rate with the batch size. But simply using a higher learning rate can cause destabilize the training. Another approach is to just train the model longer, but this can lead to overfitting. Thus, there’s much more to distributed training than just scaling out to multiple nodes.



sharp_vs_flat.png

An illustration showing how sharp minima lead to poor generalization. The sharp minimum of the training function corresponds to a maximum of the testing function which hurts the model’s performance on test data. [1]





How can we make large batches work?



There has been a lot of interesting research recently in making large-batch training more feasible. The training time for ImageNet has now been reduced from weeks to minutes by using batches as large as 32K without sacrificing accuracy. The following methods are known to alleviate some of the problems described above:

  1. Scaling the learning rate [2]

    The learning rate is multiplied by k, when the batch size is multiplied by k. However, this rule does not hold in the first few epochs of the training since the weights are changing rapidly. This can be alleviated by using a warm-up phase. The idea is to start with a small value of the learning rate and gradually ramp up to the linearly scaled value.

  2. Layer-wise adaptive rate scaling [3]

    A different learning rate is used for each layer. A global learning rate is chosen and it is scaled for each layer by the ratio of the Euclidean norm of the weights to Euclidean norm of the gradients for that layer.

  3. Using regular SGD with momentum rather than Adam

    Adam is known to make convergence faster and more stable. It is usually the default optimizer choice when training deep models. However, Adam seems to settle to less optimal minima, especially when using large batches. Using regular SGD with momentum, although more noisy than Adam, has shown improved generalization.

  4. Topologies also make a difference

    In a previous blog post, my colleague Luke showed how using VGG16 instead of DenseNet121 considerably sped up the training for a model that identified thoracic pathologies from chest x-rays while improving area under ROC in multiple categories. Shallow models are usually easier to train, especially when using large batches.

Conclusion

Large-batch distributed training can significantly reduce training time but it comes with its own challenges. Improving generalization when using large batches is an active area of research, and as new methods are developed, the time to train a model will keep going down.







  1. On large-batch training for deep learning: Generalization gap and sharp minima. Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter. 2016. arXiv preprint arXiv:1609.04836.
  2. Accurate, large minibatch SGD: Training imagenet. Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. arXiv preprint arXiv:1706.02677.
  3. Large Batch Training of Convolutional Networks . Yang You, Igor Gitman, Boris Ginsburg. 2017. arXiv preprint arXiv:1708.03888.


Related:

  • No Related Posts

Training an AI Radiologist with Distributed Deep Learning

The potential of neural networks to transform healthcare is evident. From image classification to dictation and translation, neural networks are achieving or exceeding human capabilities. And they are only getting better at these tasks as the quantity of data increases.

But there’s another way in which neural networks can potentially transform the healthcare industry: Knowledge can be replicated at virtually no cost. Take radiology as an example: To train 100 radiologists, you must teach each individual person the skills necessary to identify diseases in x-ray images of patients’ bodies. To make 100 AI-enabled radiologist assistants, you take the neural network model you trained to read x-ray images and load it into 100 different devices.

The hurdle is training the model. It takes a large amount of cleaned, curated, labeled data to train an image classification model. Once you’ve prepared the training data, it can take days, weeks, or even months to train a neural network. Even once you’ve trained a neural network model, it might not be smart enough to perform the desired task. So, you try again. And again. Eventually, you will train a model that passes the test and can be used out in the world.

neural-network-workflow.pngWorkflow for Developing Neural Network Models

In this post, I’m going to talk about how to reduce the time spent in the Train/Test/Tune cycle by speeding up the training portion with distributed deep learning, using a test case we developed in Dell EMC’s HPC and AI Innovation Lab to classify pathologies in chest x-ray images. Through a combination of distributed deep learning, optimizer selection, and neural network topology selection, we were able to not only speed the process of training models from days to minutes, we were also able to improve the classification accuracy significantly.

We began by surveying the landscape of AI projects in healthcare, and Andrew Ng’s group at Stanford University provided our starting point. CheXNet was a project to demonstrate a neural network’s ability to accurately classify cases of pneumonia in chest x-ray images.

The dataset that Stanford used was ChestXray14, which was developed and made available by the United States’ National Institutes of Health (NIH). The dataset contains over 120,000 images of frontal chest x-rays, each potentially labeled with one or more of fourteen different thoracic pathologies. The data set is very unbalanced, with more than half of the data set images having no listed pathologies.

Stanford decided to use DenseNet, a neural network topology which had just been announced as the Best Paper at the 2017 Conference on Computer Vision and Pattern Recognition (CVPR), to solve the problem. The DenseNet topology is a deep network of repeating blocks over convolutions linked with residual connections. Blocks end with a batch normalization, followed by some additional convolution and pooling to link the blocks. At the end of the network, a fully connected layer is used to perform the classification.

densenet.jpg

An Illustration of the DenseNet Topology (source: Kaggle)

Stanford’s team used a DenseNet topology with the layer weights pretrained on ImageNet and replaced the original ImageNet classification layer with a new fully connected layer of 14 neurons, one for each pathology in the ChestXray14 dataset.

Building CheXNet in Keras

It’s sounds like it would be difficult to setup. Thankfully, Keras (provided with TensorFlow) provides a simple, straightforward way of taking standard neural network topologies and bolting-on new classification layers.

from tensorflow import kerasfrom keras.applications import DenseNet121orig_net = DenseNet121(include_top=False, weights=’imagenet’, input_shape=(256,256,3))

Importing the base DenseNet Topology using Keras

In this code snippet, we are importing the original DenseNet neural network (DenseNet121) and removing the classification layer with the include_top=False argument. We also automatically import the pretrained ImageNet weights and set the image size to 256×256, with 3 channels (red, green, blue).

With the original network imported, we can begin to construct the classification layer. If you look at the illustration of DenseNet above, you will notice that the classification layer is preceded by a pooling layer. We can add this pooling layer back to the new network with a single Keras function call, and we can call the resulting topology the neural network’s filters, or the part of the neural network which extracts all the key features used for classification.

from keras.layers import GlobalAveragePooling2Dfilters = GlobalAveragePooling2D()(orig_net.output)

Finalizing the Network Feature Filters with a Pooling Layer

The next task is to define the classification layer. The ChestXray14 dataset has 14 labeled pathologies, so we have one neuron for each label. We also activate each neuron with the sigmoid activation function, and use the output of the feature filter portion of our network as the input to the classifiers.

from keras.layers import Denseclassifiers = Dense(14, activation=’sigmoid’, bias_initializer=’ones’)(filters)

Defining the Classification Layer

The choice of sigmoid as an activation function is due to the multi-label nature of the data set. For problems where only one label ever applies to a given image (e.g., dog, cat, sandwich), a softmax activation would be preferable. In the case of ChestXray14, images can show signs of multiple pathologies, and the model should rightfully identify high probabilities for multiple classifications when appropriate.

Finally, we can put the feature filters and the classifiers together to create a single, trainable model.

from keras.models import Modelchexnet = Model(inputs=orig_net.inputs, outputs=classifiers)

The Final CheXNet Model Configuration

With the final model configuration in place, the model can then be compiled and trained.

To produce better models sooner, we need to accelerate the Train/Test/Tune cycle. Because testing and tuning are mostly sequential, training is the best place to look for potential optimization.

How exactly do we speed up the training process? In Accelerating Insights with Distributed Deep Learning, Michael Bennett and I discuss the three ways in which deep learning can be accelerated by distributing work and parallelizing the process:

  • Parameter server models such as in Caffe or distributed TensorFlow,
  • Ring-AllReduce approaches such as Uber’s Horovod, and
  • Hybrid approaches for Hadoop/Spark environments such as Intel BigDL.



Which approach you pick depends on your deep learning framework of choice and the compute environment that you will be using. For the tests described here we performed the training in house on the Zenith supercomputer in the Dell EMC HPC & AI Innovation Lab. The ring-allreduce approach enabled by Uber’s Horovod framework made the most sense for taking advantage of a system tuned for HPC workloads, and which takes advantage of Intel Omni-Path (OPA) networking for fast inter-node communication. The ring-allreduce approach would also be appropriate for solutions such as the Dell EMC Ready Solutions for AI, Deep Learning with NVIDIA.

ring-allreduce.png

The MPI-RingAllreduce Approach to Distributed Deep Learning



Horovod is an MPI-based framework for performing reduction operations between identical copies of the otherwise sequential training script. Because it is MPI-based, you will need to be sure that an MPI compiler (mpicc) is available in the working environment before installing horovod.

Adding Horovod to a Keras-defined Model

Adding Horovod to any Keras-defined neural network model only requires a few code modifications:

  1. Initializing the MPI environment,
  2. Broadcasting initial random weights or checkpoint weights to all workers,
  3. Wrapping the optimizer function to enable multi-node gradient summation,
  4. Average metrics among workers, and
  5. Limiting checkpoint writing to a single worker.

Horovod also provides helper functions and callbacks for optional capabilities that are useful when performing distributed deep learning, such as learning-rate warmup/decay and metric averaging.

Initializing the MPI Environment

Initializing the MPI environment in Horovod only requires calling the init method:

import horovod.keras as hvdhvd.init()

This will ensure that the MPI_Init function is called, setting up the communications structure and assigning ranks to all workers.

Broadcasting Weights

Broadcasting the neuron weights is done using a callback to the Model.fit Keras method. In fact, many of Horovod’s features are implemented as callbacks to Model.fit, so it’s worthwhile to define a callback list object for holding all the callbacks.

callbacks = [ hvd.callbacks.BroadcastGlobalVariablesCallback(0) ]

You’ll notice that the BroadcastGlobalVariablesCallback takes a single argument that’s been set to 0. This is the root worker, which will be responsible for reading checkpoint files or generating new initial weights, broadcasting weights at the beginning of the training run, and writing checkpoint files periodically so that work is not lost if a training job fails or terminates.

Wrapping the Optimizer Function

The optimizer function must be wrapped so that it can aggregate error information from all workers before executing. Horovod’s DistributedOptimizer function can wrap any optimizer which inherits Keras’ base Optimizer class, including SGD, Adam, Adadelta, Adagrad, and others.

import keras.optimizersopt = hvd.DistributedOptimizer(keras.optimizers.Adadelta(lr=1.0))

The distributed optimizer will now use the MPI_Allgather collective to aggregate error information from training batches onto all workers, rather than collecting them only to the root worker. This allows the workers to independently update their models rather than waiting for the root to re-broadcast updated weights before beginning the next training batch.

Averaging Metrics

Between steps error metrics need to be averaged to calculate global loss. Horovod provides another callback function to do this called MetricAverageCallback.

callbacks = [ hvd.callbacks.BroadcastGlobalVariablesCallback(0), hvd.callbacks.MetricAverageCallback() ]



This will ensure that optimizations are performed on the global metrics, not the metrics local to each worker.

Writing Checkpoints from a Single Worker

When using distributed deep learning, it’s important that only one worker write checkpoint files to ensure that multiple workers writing to the same file does not produce a race condition, which could lead to checkpoint corruption.

Checkpoint writing in Keras is enabled by another callback to Model.fit. However, we only want to call this callback from one worker instead of all workers. By convention, we use worker 0 for this task, but technically we could use any worker for this task. The one good thing about worker 0 is that even if you decide to run your distributed deep learning job with only 1 worker, that worker will be worker 0.



callbacks = [ ... ]if hvd.rank() == 0: callbacks.append(keras.callbacks.ModelCheckpoint(‘./checkpoint-{epoch].h5’))

Once a neural network can be trained in a distributed fashion across multiple workers, the Train/Test/Tune cycle can be sped up dramatically.

The figure below shows exactly how dramatically. The three tests shown are the training speed of the Keras DenseNet model on a single Zenith node without distributed deep learning (far left), the Keras DenseNet model with distributed deep learning on 32 Zenith nodes (64 MPI processes, 2 MPI processes per node, center), and a Keras VGG16 version using distributed deep learning on 64 Zenith nodes (128 MPI processes, 2 MPI processes per node, far right). By using 32 nodes instead of a single node, distributed deep learning was able to provide a 47x improvement in training speed, taking the training time for 10 epochs on the ChestXray14 data set from 2 days (50 hours) to less than 2 hours!

vgg_throughput.png

Performance comparisons of Keras models with distributed deep learning using Horovod



The VGG variant, trained on 128 Zenith nodes, was able to complete the same number of epochs as was required for the single-node DenseNet version to train in less than an hour, although it required more epochs to train. It also, however, was able to converge to a higher-quality solution. This VGG-based model outperformed the baseline, single-node model in 4 of 14 conditions, and was able to achieve nearly 90% accuracy in classifying emphysema.

vgg_accuracy.jpg

Accuracy comparison of baseline single-node DenseNet model vs VGG variant with distributed deep learning



Conclusion

In this post we’ve shown you how to accelerate the Train/Test/Tune cycle when developing neural network-based models by speeding up the training phase with distributed deep learning. We walked through the process of transforming a Keras-based model to take advantage of multiple nodes using the Horovod framework, and how these few simple code changes, coupled with some additional compute infrastructure, can reduce the time needed to train a model from days to minutes, allowing more time for the testing and tuning pieces of the cycle. More time for tuning means higher-quality models, which means better outcomes for patients, customers, or whomever will benefit from the deployment of your model.


Lucas A. Wilson, Ph.D. is the Lead Data Scientist in Dell EMC’s HPC & AI Engineering group. (Twitter: @lucasawilson)

Related:

  • No Related Posts

AML Report option not available

I need a solution

Hi, 

I’m looking to generate the Advanced Machine Learning report referrenced here: https://support.symantec.com/en_US/article.HOWTO125816.html

I’ve followed the steps of Scheduled Reports > Add > Computer Status but there is not an option for Advanced Machine Learning (Static) Content Distribution

We’re currently on 14.0.1 as suggested. And I confirmed all of the required AML settings are enabled in our environment. Am I missing something or has this report option been removed? Any help would be appreciated. 

Thanks.

0

Related:

  • No Related Posts

Our Customer Journey from Big Data to AI

EMC logo


Analytics – A journey to AI Artificial intelligence (AI) has been around in concept since the 1950s when Arthur L. Samuel, created a learning algorithm that allowed a machine to beat the local state checkers champion.  Yet, it took the largest supercomputer at that time and all its compute power to run that single algorithm to teach the machine how to play. The barrier to entry was so high it was out of reach for most business and research facilities, thus it never took off. Fast forward to today—the game has changed; the cost of compute … READ MORE



ENCLOSURE:https://blog.dellemc.com/uploads/2018/08/AI-hair-illustration_1000x500-600×356.jpg

Update your feed preferences


   

   


   


   

submit to reddit
   

Related:

Model Compatibility using Intel BigDL

Deep learning has exploded over the landscape of both the popular and business media landscapes. Current and upcoming technology capable of powering the calculations required by deep learning algorithms has enabled a rapid transition from new theories to new applications. One of current supporting technologies that is expanding at an increasing rate is in the area of faster and more use case specific hardware accelerators for deep learning such as GPUs with tensor cores and FPGAs hosted inside of servers. Another foundational deep learning technology that has advanced very rapidly is the software that enables implementations of complex deep learning networks. New frameworks, tools and applications are entering the landscape quickly to accomplish this, some compatible with existing infrastructure and others that require workflow overhauls.



As organizations begin to develop more complex strategies for incorporating deep learning they are likely to start to leverage multiple frameworks and application stacks for specific use cases and to compare performance and accuracy. But training models is time consuming and ties up expensive compute resources. In addition, adjustments and tuning can vary between frameworks, creating a large number of framework knobs and levers to remember how to operate. What if there was a framework that could just consume these models right out the box?



BigDL is a distributed deep learning framework with native Spark integration, allowing it to leverage Spark during model training, prediction, and tuning. One of the things that I really like about Intel BigDL is how easy it is to work with models built and/or trained in Tensorflow, Caffe and Torch. This rich interop support for deep learning models allows BigDL applications to leverage the plethora of models that currently exist with little or no additional effort. Here are just a few ways this might be used in your applications:



  • Efficient Scale Out – Using BigDL you can scale out a model that was trained on a single node or workstation and leverage it at scale with Apache Spark. This can be useful for training on a large distributed dataset that already exists in your HDFS environment or for performing inferencing such as prediction and classification on a very large and often changing dataset.



  • Transfer Learning – Load a pretrained model with weights and then freeze some layers, append new layers and train / retrain layers. Transfer learning can improve accuracy or reduce training time by allowing you to start with a model that is used to do one thing, such as classify a different objects, and use it to accelerate development of a model to classify something else, such as specific car models.

  • High Performance on CPU – GPUs get all of the hype when it comes to deep learning. By leveraging Intel MKL and multi threading Spark tasks you can achieve better CPU driven performance leveraging BigDL than you would see with Tensorflow, Caffe or Torch when using Xeon processors.

  • Dataset Access – Designed to run in Hadoop, BigDL can compute where your data already exists. This can save time and effort since data does not need to be transferred to a seperate GPU environment to be used with the deep learning model. This means that your entire pipeline from ingest to model training and inference can all happen in one environment, Hadoop.



Real Data + Real Problem



Recently I had a chance to take advantage of the model portability feature of BigDL. After learning of an internal project here at Dell EMC, leveraging deep learning and telemetry data to predict component failures, my team decided we wanted to take our Ready Solution for AI – Machine Learning with Hadoop and see how it did with the problem.



The team conducting the project for our support organization was using Tensorflow with GPU accelerators to train an LSTM model. The dataset was sensor readings from internal components collected at 15 minute intervals showing all kinds of metrics like temperature, fan speeds, runtimes, faults etc.



Initially my team wanted to focus on testing out two use cases for BigDL:



  • Using BigDL model portability to perform inference using the existing tensorflow model
  • Implement an LSTM model in BigDL and train it with this dataset



As always, there were some preprocessing and data cleaning steps that had happened before we could get to modeling and inference. Luckily for us though we received the clean output of those steps from our support team to get started quickly. We received the data in the form of multiple csv files, already balanced with records of devices that did fail and those that did not. We got over 200,000 rows of data that looked something like this:



device_id,timestamp,model,firmware,sensor1,sensor2,sensor3,sensorN,label

String,string,string,string,float,float,float,float,int



Converting the data to a tfrecord format used by Tensorflow was being done with Python and pandas dataframes. Moving this process to be done in Spark is another area we knew we wanted to dig in to, but to start we wanted to focus on our above mentioned goals. When we started the pipeline looked like this:



From Tensorflow to BigDL

For BigDL, instead of creating tfrecords we needed to end up with an RDD of Sample(s). Each Sample is one record of your dataset in the form of feature, label. Feature and label are in the form of one or more tensors and we create the sample from ndarray. Looking at the current pipeline we were able to simple take the objects created before writing to tfrecord and instead wrote a function that took these arrays and formed our RDD of Sample for BigDL.



def convert_to(x, y): sequences = x labels = y record = zip(x,y) record_rdd = sc.parallelize(record) sample_rdd = record_rdd.map(lambda x:Sample.from_ndarray(x[0], x[1])) return sample_rddtrain = convert_to(x_train,y_train)val = convert_to(x_val,y_val)test = convert_to(x_test,y_test)



After that we took the pb and bin files representing the pretrained models definition and weights and loaded it using the BigDL Model.load_tensorflow function. It requires knowing the input and output names for the model, but the tensorflow graph summary tool can help out with that. It also requires a pb and bin file specifically, but if what you have is a ckpt file from tensorflow that can be converted with tools provided by BigDL.



model_def = "tf_modell/model.pb"model_variable = "tf_model/model.bin"inputs = ["Placeholder"]outputs = ["prediction/Softmax"]trained_tf_model = Model.load_tensorflow(model_def, inputs, outputs, byte_order = "little_endian", bigdl_type="float", bin_file=model_variable)



Now with our data already in the correct format we can go ahead and inference against our test dataset. BigDL provides Model.evaluate and we can pass it our RDD as well as the validation method to use, in this case Top1Accuracy.



results = trained_tf_model.evaluate(test,128,[Top1Accuracy()])

Defining a Model with BigDL

After testing out loading the pretrained tensorflow model the next experiment we wanted to conduct was to train an LSTM model defined with BigDL. BigDL provides a Sequential API and a Functional API for defining models. The Sequential API is for simpler models, with the Functional API being better for complex models. The Functional API describes the model as a graph. Since our model is LSTM we will use the Sequential API.



Defining an LSTM model is as simple as:



def build_model(input_size, hidden_size, output_size): model = Sequential() recurrent = Recurrent() recurrent.add(LSTM(input_size, hidden_size)) model.add(InferReshape([-1, input_size], True)) model.add(recurrent) model.add(Select(2, -1)) model.add(Linear(hidden_size, output_size)) return modellstm_model = build_model(n_input, n_hidden, n_classes)

After creating our model the next step is the optimizer and validation logic that our model will use to train and learn.



Create the optimizer:



optimizer = Optimizer( model=lstm_model, training_rdd=train, criterion=CrossEntropyCriterion(), optim_method=Adam(), end_trigger=MaxEpoch(50), batch_size=batch_size)

Set the validation logic:

optimizer.set_validation( batch_size=batch_size, val_rdd=val, trigger=EveryEpoch(), val_method=[Top1Accuracy()])

Now we can do trained_model = optimizer.optimize() to train our model, in this case for 50 epochs. We also set our TrainSummary folder so that the data was logged. This allowed us to also get visualizations in Tensorboard, something that BigDL supports.



At this point we had completed the two initial tasks we had set out to do, load a pretrained Tensorflow model using BigDL and train a new model with BigDL. Hopefully you found some of this process interesting, and also got an idea for how easy BigDL is for this use case. The ability to leverage deep learning models inside Hadoop with no specialized hardware like Infiniband, GPU accelerators etc provides a great tool that is sure to change up the way you currently view your existing analytics.





Related:

Accelerating Insights with Distributed Deep Learning

By: Lucas A. Wilson, Ph.D. and Michael Bennett

Artificial intelligence (AI) is transforming the way businesses compete in today’s marketplace. Whether it’s improving business intelligence, streamlining supply chain or operational efficiencies, or creating new products, services, or capabilities for customers, AI should be a strategic component of any company’s digital transformation.

Deep neural networks have demonstrated astonishing abilities to identify objects, detect fraudulent behaviors, predict trends, recommend products, enable enhanced customer support through chatbots, convert voice to text and translate one language to another, and produce a whole host of other benefits for companies and researchers. They can categorize and summarize images, text, and audio recordings with human-level capability, but to do so they first need to be trained.

Deep learning, the process of training a neural network, can sometimes take days, weeks, or months, and effort and expertise is required to produce a neural network of sufficient quality to trust your business or research decisions on its recommendations. Most successful production systems go through many iterations of training, tuning and testing during development. Distributed deep learning can speed up this process, reducing the total time to tune and test so that your data science team can develop the right model faster, but requires a method to allow aggregation of knowledge between systems.

There are several evolving methods for efficiently implementing distributed deep learning, and the way in which you distribute the training of neural networks depends on your technology environment. Whether your compute environment is container native, high performance computing (HPC), or Hadoop/Spark clusters for Big Data analytics, your time to insight can be accelerated by using distributed deep learning. In this article we are going to explain and compare systems that use a centralized or replicated parameter server approach, a peer-to-peer approach, and finally a hybrid of these two developed specifically for Hadoop distributed big data environments.

Container native (e.g., Kubernetes, Docker Swarm, OpenShift, etc.) have become the standard for many DevOps environments, where rapid, in-production software updates are the norm and bursts of computation may be shifted to public clouds. Most deep learning frameworks support distributed deep learning for these types of environments using a parameter server-based model that allows multiple processes to look at training data simultaneously, while aggregating knowledge into a single, central model.

The process of performing parameter server-based training starts with specifying the number of workers (processes that will look at training data) and parameter servers (processes that will handle the aggregation of error reduction information, backpropagate those adjustments, and update the workers). Additional parameters servers can act as replicas for improved load balancing.

parameter-server.png

Parameter server model for distributed deep learning

Worker processes are given a mini-batch of training data to test and evaluate, and upon completion of that mini-batch, report the differences (gradients) between produced and expected output back to the parameter server(s). The parameter server(s) will then handle the training of the network and transmitting copies of the updated model back to the workers to use in the next round.

This model is ideal for container native environments, where parameter server processes and worker processes can be naturally separated. Orchestration systems, such as Kubernetes, allow neural network models to be trained in container native environments using multiple hardware resources to improve training time. Additionally, many deep learning frameworks support parameter server-based distributed training, such as TensorFlow, PyTorch, Caffe2, and Cognitive Toolkit.

High performance computing (HPC) environments are generally built to support the execution of multi-node applications that are developed and executed using the single process, multiple data (SPMD) methodology, where data exchange is performed over high-bandwidth, low-latency networks, such as Mellanox InfiniBand and Intel OPA. These multi-node codes take advantage of these networks through the Message Passing Interface (MPI), which abstracts communications into send/receive and collective constructs.

Deep learning can be distributed with MPI using a communication pattern called Ring-AllReduce. In Ring-AllReduce each process is identical, unlike in the parameter-server model where processes are either workers or servers. The Horovod package by Uber (available for TensorFlow, Keras, and PyTorch) and the mpi_collectives contributions from Baidu (available in TensorFlow) use MPI Ring-AllReduce to exchange loss and gradient information between replicas of the neural network being trained. This peer-based approach means that all nodes in the solution are working to train the network, rather than some nodes acting solely as aggregators/distributors (as in the parameter server model). This can potentially lead to faster model convergence.

ring-allreduce.png

Ring-AllReduce model for distributed deep learning

The Dell EMC Ready Solutions for AI, Deep Learning with NVIDIA allows users to take advantage of high-bandwidth Mellanox InfiniBand EDR networking, fast Dell EMC Isilon storage, accelerated compute with NVIDIA V100 GPUs, and optimized TensorFlow, Keras, or Pytorch with Horovod (or TensorFlow with tensorflow.contrib.mpi_collectives) frameworks to help produce insights faster.

Hadoop and other Big Data platforms achieve extremely high performance for distributed processing but are not designed to support long running, stateful applications. Several approaches exist for executing distributed training under Apache Spark. Yahoo developed TensorFlowOnSpark, accomplishing the goal with an architecture that leveraged Spark for scheduling Tensorflow operations and RDMA for direct tensor communication between servers.

BigDL is a distributed deep learning library for Apache Spark. Unlike Yahoo’s TensorflowOnSpark, BigDL not only enables distributed training – it is designed from the ground up to work on Big Data systems. To enable efficient distributed training BigDL takes a data-parallel approach to training with synchronous mini-batch SGD (Stochastic Gradient Descent). Training data is partitioned into RDD samples and distributed to each worker. Model training is done in an iterative process that first computes gradients locally on each worker by taking advantage of locally stored partitions of the training data and model to perform in memory transformations. Then an AllReduce function schedules workers with tasks to calculate and update weights. Finally, a broadcast syncs the distributed copies of model with updated weights.

bigdl.png

BigDL implementation of AllReduce functionality

The Dell EMC Ready Solutions for AI, Machine Learning with Hadoop is configured to allow users to take advantage of the power of distributed deep learning with Intel BigDL and Apache Spark. It supports loading models and weights from other frameworks such as Tensorflow, Caffe and Torch to then be leveraged for training or inferencing. BigDL is a great way for users to quickly begin training neural networks using Apache Spark, widely recognized for how simple it makes data processing.

One more note on Hadoop and Spark environments: The Intel team working on BigDL has built and compiled high-level pipeline APIs, built-in deep learning models, and reference use cases into the Intel Analytics Zoo library. Analytics Zoo is based on BigDL but helps make it even easier to use through these high-level pipeline APIs designed to work with Spark Dataframes and built in models for things like object detection and image classification.

Regardless of whether you preferred server infrastructure is container native, HPC clusters, or Hadoop/Spark-enabled data lakes, distributed deep learning can help your data science team develop neural network models faster. Our Dell EMC Ready Solutions for Artificial Intelligence can work in any of these environments to help jumpstart your business’s AI journey. For more information on the Dell EMC Ready Solutions for Artificial Intelligence, go to dellemc.com/readyforai.




Lucas A. Wilson, Ph.D. is the Lead Data Scientist in Dell EMC’s HPC & AI Engineering group. (Twitter: @lucasawilson)

Michael Bennett is a Senior Principal Engineer at Dell EMC working on Ready Solutions.

Related:

Self-Driving Storage, Part 1: AI’s Role in Intelligent Storage

EMC logo


Artificial Intelligence (AI) is here! With a rapidly growing number of success stories proving the possibilities and some bloopers too, there is no question that AI and machine learning technology have moved from science fiction to reality. Why now? In essence, I see it as a confluence of two trends: multi-layered recursive learning technologies inspired by a deeper understanding of how the human brain learns, and exponentially cheaper and more powerful computing. Some of the latest advances made by leveraging these trends are truly amazing: machines that take advantage of their own “bodies” to learn, machines … READ MORE



ENCLOSURE:https://blog.dellemc.com/uploads/2018/07/Storage-cropeed-600×356.png

Update your feed preferences


   

   


   


   

submit to reddit
   

Related:

  • No Related Posts

What’s the Difference Between AI, Machine Learning, and Deep Learning?

Peter Jeffcock

Big Data Product Marketing

AI, machine learning, and deep learning – these terms overlap and are easily confused, so let’s start with some short definitions.

AI means getting a compute to mimic human behavior in some way.

Machine learning is a subset of AI, and it consists of the techniques that enable computers to figure things out from the data and deliver AI applications.

Deep learning, meanwhile, is a subset of machine learning that enables computers to solve more complex problems.

Download your free ebook, “Demystifying Machine Learning.”

Those descriptions are correct, but they are a little concise. So I want to explore each of these areas and provide a little more background.

Difference Between AI, Machine Learning and Deep Learning

What Is AI?

Artificial intelligence as an academic discipline was founded in 1956. The goal then, as now, was to get computers to perform tasks regarded as uniquely human: things that required intelligence. Initially, researchers worked on problems like playing checkers and solving logic problems.

If you looked at the output of one of those checkers playing programs you could see some form of “artificial intelligence” behind those moves, particularly when the computer beat you. Early successes caused the first researchers to exhibit almost boundless enthusiasm for the possibilities of AI, matched only by the extent to which they misjudged just how hard some problems were.

Artificial intelligence, then, refers to the output of a computer. The computer is doing something intelligent, so it’s exhibiting intelligence that is artificial.

The term AI doesn’t say anything about how those problems are solved. There are many different techniques including rule-based or expert systems. And one category of techniques started becoming more widely used in the 1980s: machine learning.

What Is Machine Learning?

The reason that those early researchers found some problems to be much harder is that those problems simply weren’t amenable to the early techniques used for AI. Hard-coded algorithms or fixed, rule-based systems just didn’t work very well for things like image recognition or extracting meaning from text.

The solution turned out to be not just mimicking human behavior (AI) but mimicking how humans learn.

Think about how you learned to read. You didn’t sit down and learn spelling and grammar before picking up your first book. You read simple books, graduating to more complex ones over time. You actually learned the rules (and exceptions) of spelling and grammar from your reading. Put another way, you processed a lot of data and learned from it.

That’s exactly the idea with machine learning. Feed an algorithm (as opposed to your brain) a lot of data and let it figure things out. Feed an algorithm a lot of data on financial transactions, tell it which ones are fraudulent, and let it work out what indicates fraud so it can predict fraud in the future. Or feed it information about your customer base and let it figure out how best to segment them. Find out more about machine learning techniques here.

As these algorithms developed, they could tackle many problems. But some things that humans found easy (like speech or handwriting recognition) were still hard for machines. However, if machine learning is about mimicking how humans learn, why not go all the way and try to mimic the human brain? That’s the idea behind neural networks.

The idea of using artificial neurons (neurons, connected by synapses, are the major elements in your brain) had been around for a while. And neural networks simulated in software started being used for certain problems. They showed a lot of promise and could solve some complex problems that other algorithms couldn’t tackle.

But machine learning still got stuck on many things that elementary school children tackled with ease: how many dogs are in this picture or are they really wolves? Walk over there and bring me the ripe banana. What made this character in the book cry so much?

It turned out that the problem was not with the concept of machine learning. Or even with the idea of mimicking the human brain. It was just that simple neural networks with 100s or even 1000s of neurons, connected in a relatively simple manner, just couldn’t duplicate what the human brain could do. It shouldn’t be a surprise if you think about it; human brains have around 86 billion neurons and very complex interconnectivity.

What is Deep Learning?

Put simply, deep learning is all about using neural networks with more neurons, layers, and interconnectivity. We’re still a long way off from mimicking the human brain in all its complexity, but we’re moving in that direction.

And when you read about advances in computing from autonomous cars to Go-playing supercomputers to speech recognition, that’s deep learning under the covers. You experience some form of artificial intelligence. Behind the scenes, that AI is powered by some form of deep learning.

Let’s look at a couple of problems to see how deep learning is different from simpler neural networks or other forms of machine learning.

How Deep Learning Works

If I give you images of horses, you recognize them as horses, even if you’ve never seen that image before. And it doesn’t matter if the horse is lying on a sofa, or dressed up for Halloween as a hippo. You can recognize a horse because you know about the various elements that define a horse: shape of its muzzle, number and placement of legs, and so on.

Deep learning can do this. And it’s important for many things including autonomous vehicles. Before a car can determine its next action, it needs to know what’s around it. It must be able to recognize people, bikes, other vehicles, road signs, and more. And do so in challenging visual circumstances. Standard machine learning techniques can’t do that.

Take natural language processing, which is used today in chatbots and smartphone voice assistants, to name two. Consider this sentence and work out what the last part should be:

I was born in Italy and, although I lived in Portugal and Brazil most of my life, I still speak fluent ________.

Hopefully you can see that the most likely answer is Italian (though you would also get points for French, Greek, German, Sardinian, Albanian, Occitan, Croatian, Slovene, Ladin, Latin, Friulian, Catalan, Sardinian, Sicilian, Romani and Franco-Provencal and probably several more). But think about what it takes to draw that conclusion.

First you need to know that the missing word is a language. You can do that if you understand “I speak fluent…”. To get Italian you have to go back through that sentence and ignore the red herrings about Portugal and Brazil. “I was born in Italy” implies learning Italian as I grew up (with 93% probability according to Wikipedia), assuming that you understand the implications of born, which go far beyond the day you were delivered. The combination of “although” and “still” makes it clear that I am not talking about Portuguese and brings you back to Italy. So Italian is the likely answer.

Imagine what’s happening in the neural network in your brain. Facts like “born in Italy” and “although…still” are inputs to other parts of your brain as you work things out. And this concept is carried over to deep neural networks via complex feedback loops.

Conclusion

So hopefully that first definition at the beginning of the article makes more sense now. AI refers to devices exhibiting human-like intelligence in some way. There are many techniques for AI, but one subset of that bigger list is machine learning – let the algorithms learn from the data. Finally, deep learning is a subset of machine learning, using many-layered neural networks to solve the hardest (for computers) problems.

Related: