Deep Learning Model

Deep learning models

KC Santosh , ... Swarnendu Ghosh , in Deep Learning Models for Medical Imaging, 2022

3.1.1 Learning different objectives

Deep learning models in general are trained on the basis of an objective function, but the way in which the objective function is designed reveals a lot about the purpose of the model. With this, for more understanding, in what follows, we discuss learning models with and without labels, reward-based models, and multiobjective optimization.

1)

Learning with labels:

Standard supervised models require a predefined desired output. Deep learning models take this supervision into account by designing the loss function as the difference between the predicted output and a representation of the desired output. The two basic categories of supervised learning techniques, classification and regression, are both easily interpreted by deep neural networks [1]. Classification problems generally have a sample associated with one or more classes from a predefined set of categories. The output layer of the network has same number of neurons as the total number of classes. The objective function generally computes a loss function that signifies the difference between the output vector and a probability distribution representing the desired class values. For single-class problems, this probability distribution would be a one-hot vector [2]. In cases like these, negative log likelihood is an ideal option for a loss function as it is extremely fast and the number of computations is independent of the number of classes. For multilabel problems where a sample can belong to more than one class, things can get tricky. One option can be representing the output as a marginal distribution, and hence every output neuron follows a binomial distribution. The loss function can be represented as the sum over several binary cross entropy functions [3]. Another approach is modeling the output as a multinomial distribution, where each class has a certain probability of occurrence. A more generic categorical cross entropy function is more appropriate for cases like these. Many other tasks can also be reduced to classification problems. For example, image segmentation can be treated as a pixel-level classification problem, where an output vector generates a probability distribution for each pixel of the image [4]. Similarly, a sequence generation problem can be treated as a sequence of classifications in a vocabulary space. Other examples of supervised learning are in the form of regression problems. Deep learning networks can also be used to generated real-valued outputs. In cases like this the output tensor is of the same shape as a desired tensor, and a mean squared error loss between the two is used for back-propagation. This has several applications in case of generative tasks such as sample reconstruction or score predictions. Another variant of regression is for density estimation problem where a KL divergence loss function is used.

2)

Learning without labels:

One of the most common implementations of unsupervised learning algorithms can be seen in case of autoencoders [5]. Autoencoders are networks designed to map the samples from an input space to a fixed-dimensional feature space, from which the input is reconstructed. The reconstruction loss can be a simple mean squared error of the predicted and actual inputs. No other ground truth is necessary for computing the feature space. Naturally, the compressed feature space must be quite successful in encoding the information of the input space if the reconstruction is almost equal to the input. This kind of networks has several applications such as unsupervised clustering, feature extraction, or even in transfer learning problems.

3)

Reward-based models:

Rewards are numerical values denoting the success or failure of an operation. The goal of reward-based models is computing an optimal policy gradient that updates the decision-making process based on the probability of gaining a reward. The most common application for such models is in reinforcement learning [6].

4)

Multiobjective optimization:

Deep learning networks need not be limited to a single objective function. Loss values from multiple objective functions can be combined in various ways to learn several objectives at the same time [7]. An example of such an application is the joint optimization in a joint reduction of classification loss along with parameter norm penalty for regularization.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128235041000131

Efficient Deep Learning Approaches for Health Informatics

T.M. Navamani ME, PhD , in Deep Learning and Parallel Computing Environment for Bioengineering Systems, 2019

Genomics

Deep learning models are widely used in extracting high-level abstract features, providing improved performance over the traditional models, increasing interpretability and also for understanding and processing biological data. To predict splicing action of exons, a fully connected feedforward neural network was designed by Xiong et al. [60]. In recent years, CNNs were applied on the DNA dataset directly without the requirement of defining features a priori [2], [44]. Compared to a fully connected network, CNNs use less parameters by applying a convolution operation on the input data space and also parameters are shared between the regions. Hence, large DNA sequence data can be trained using these models and also improved pattern detection accuracy can be obtained. Deepbind, a deep architecture based on CNNs, was proposed by Alipanathi et al. [57], which predicts specificities of DNA and RNA binding proteins. CNNs were also used for predicting chromatin marks from a DNA sequence [44]. Angermueller et al. [35] have incorporated CNNs for predicting DNA methylation states. Like CNNs, other deep architectures were also applied for extracting features from raw DNA sequence data and for processing the data.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128167182000142

Hardware Accelerator Systems for Artificial Intelligence and Machine Learning

Won Jeon , ... Won Woo Ro , in Advances in Computers, 2021

5.3 Overcoming GPU memory capacity limits with CPU memory space

Training deep learning models requires a substantial amount of memory, sometimes exceeding the memory capacity limit of a single GPU. For instance, training VGG-16, a heavy CNN model, with a batch size of 256 requires 28  GB of memory space, while high-end GPUs, such as Titan V is shipped with the memory capacity of 12   GB [17]. The disparity between the memory space requirement of deep learning models and insufficient memory capacity of a GPU forces to train these models with a smaller batch size or use multiple GPUs [71] or a new GPU shipped with higher memory capacity [20]. However, reducing batch size takes longer time to train a deep learning model, and building a new system with multiple GPUs or a new GPU is costly. Moreover, using GPUs with higher memory capacity is not a panacea. As deep learning models are including more layers and becoming more complex, even current state-of-the-art GPUs may not be able to train future models with higher batch size. To mitigate memory capacity limits of GPUs in training deep learning models, researchers have proposed memory management schemes by observing memory access patterns of deep learning models.

Rhu et al. noted that memory allocation scheme in deep learning frameworks does not effectively manage precious GPU memory space [57]. They observed DNN models are trained using a stochastic gradient descent (SGD) algorithm in layer-wise. To train a neural network model, a forward propagation is first computed from an input layer to an output layer creating intermediate values (feature maps) on each layer. Then, a loss function is calculated at the end of the forward propagation. Backward propagation trains each layer, with the result of the loss function and feature maps created from forward propagation, but this time, its computation is done from the output layer to an input layer. As a result, data in the layers on the front will not be accessed until the layers are trained in backward propagation. Furthermore, when a backward propagation is finished on a layer, data for computing the layer will not be further accessed. Therefore, the memory space for the data can be saved. Deep learning frameworks, however, allocate the entire memory space required for training. This policy wastes precious GPU memory space since a GPU must hold unused data in its memory.

To overcome the memory capacity limit of a single GPU, Rhu et al. proposed virtualized Deep Neural Networks (vDNN), a transparent runtime memory management solution virtualizing memory usage of both GPU and CPU [57]. Based on the previous observation that not all data is used in GPUs during training deep learning models, vDNN transparently manages data in the following three cases. First, vDNN offloads data of a layer to a CPU at forward propagation if a computation is finished. Migrating data is overlapped with layer computation in a GPU, so most of the offloading overhead is hidden. Fig. 24A shows unused data for computing forward propagation for a layer which can be migrated to the CPU. Second, vDNN prefetches the offloaded data to the GPU before the processor begins computation for backward propagation. For instance, Fig. 24B illustrates an example of prefetching data for computing backward propagation of a layer. Lastly, vDNN saves additional memory space by releasing data that will not be further accessed. With layer-wise memory management, vDNN achieves memory savings up to 92%, only requiring GPU memory of 4.2   GB when training a VGG-style network containing hundreds of layers.

Fig. 24

Fig. 24. An example of vDNN saving GPU memory in (A) forward propagation and (B) backward propagation [57].

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/S0065245820300905

Introduction to machine reading comprehension

Chenguang Zhu , in Machine Reading Comprehension, 2021

1.3 Deep learning

Deep learning is currently one of the hottest areas of research in AI. Models based on deep learning play major roles in image recognition, speech recognition, NLP, and many other applications. The vast majority of MRC models nowadays are based on deep learning as well. Therefore this section will describe the characteristics of deep learning and the successful use cases.

1.3.1 Features of deep learning

Why can deep learning, as a branch of machine learning, stand out from the numerous directions? There are several important reasons as follows.

First, most deep learning models have a large model complexity. Deep learning is based on artificial neural networks (ANN), and one of the characteristics of ANN is that its model size is controllable: even with a fixed input dimension, the number of model parameters can be regulated by adjusting the number of network layers, number of connections, and layer size. As a result, deep learning makes it easy to increase model complexity to make a more efficient use of massive data. At the same time, studies have shown that the accuracy of deep learning models can increase with a larger size of data. As the field of MRC continues to evolve, more and more datasets emerge, making deep learning the most common machine learning architecture in reading comprehension.

Second, deep learning has a powerful feature learning ability. In machine learning, the performance of a model largely depends on how it learns a good representation of the data, that is, representation learning. Traditional machine learning models require a predefined procedure of extracting task-specific features. Prior to the advent of deep learning, feature extraction was often manual and required knowledge from domain experts. On the contrary, deep learning relies on neural networks to automatically learn effective feature representations via a nonlinear transformation on the primitive data features, for example, word vectors, picture pixels. In other words, deep learning can effectively obtain salient features that are helpful to the target task, without the need for model designers to possess special domain knowledge. As a result, it greatly increases the efficiency of designing deep learning models for tasks from various applications.

Third, deep learning enables end-to-end learning. Previously, many machine learning models proposed multistep solutions in the form of pipelines, such as feature learning→feature categorization→modeling each category→model ensembling. However, since each step can only be independently optimized, it is difficult to simultaneously optimize the whole system to improve its performance. Moreover, if any step within the model is updated, it is likely that all downstream steps have to be adapted as well, which greatly reduces the efficiency. One advantage of deep learning is that it enables end-to-end learning via the featurization ability of neural networks: feed in the raw data as input, and output the required result. This approach can optimize all parameters in an orchestrated manner to boost accuracy. For example, in MRC, the model takes in the article and question text, and outputs the answer text. This greatly simplifies the optimization and is also easy to use and deploy.

Fourth, the hardware for deep learning, especially Graphics Processing Unit (GPU), is being continuously upgraded. As deep learning models are usually large, computational efficiency has become a very important factor for the progress of deep learning. Fortunately, the improved design of GPU has greatly accelerated the computation. Compared with CPU, GPU has greater floating-point computing power, faster read–write speed, and better parallelism. The development of GPUs over the last decade follows the Moore's law of early day CPUs, where computing speed and device complexity grow exponentially over time. The GPU industry, represented by companies such as NVIDIA, Intel, and Google, continues to evolve and develop specialized GPUs for deep learning, contributing to the development and application of the entire deep learning field.

Fifth, the emergence and prosperity of deep learning frameworks and community immensely help prompt the booming of deep learning. With the advent of frameworks, such as TensorFlow, PyTorch, and Keras, neural networks can be automatically optimized and the most commonly used network modules are predefined, making deep learning development much simpler. Meanwhile, deep learning communities quickly thrive. Every time a new research result appears, there will be developers who immediately implement, validate, and open source models, making the popularization of new technologies to be at an unprecedented level. Academic paper repositories (e.g., arXiv) and open-source code platforms (e.g., GitHub) greatly facilitate the communication between researchers and developers, and considerably lower the threshold for participation in deep learning research. For example, within a few months of the publication and open source of the breakthrough Bidirectional Encoder Representations from Transformers (BERT) model in 2018 (more details in Chapter 6, Pretrained Language Model), models utilizing BERT had taken top places in MRC competitions such as SQuAD and CoQA (Fig. 1.1).

Figure 1.1. The top three models in the machine reading comprehension competition SQuAD 2.0 are all based on BERT.

1.3.2 Achievements of deep learning

Since the advent of deep learning, it has achieved many remarkable results in various fields such as speech, vision, and NLP.

In 2009 the father of deep learning, Geoffrey Hinton, worked with Microsoft Research to significantly improve the accuracy of speech recognition systems through the Deep Belief Network, which was quickly reproduced by IBM, Google, and HKUST. This is also one of the earliest success stories of deep learning. Seven years later, Microsoft further used a large-scale deep learning network to reduce the word error rate of speech recognition to 5.9%. This is the first time ever a computer model achieved the same performance as a professional stenographer.

In 2012 the deep learning model AlexNet achieved 84.6% in Top-5 accuracy in the large-scale image recognition contest ILSVRC2012, outperforming the second place by over 10%.

In 2016 Stanford University introduced the MRC dataset SQuAD, which includes 500 articles and over 100,000 questions. Just 2 years later, Google's pretrained deep learning model BERT reached an accuracy of 87.4% in exact match and 93.2% in F1 score, surpassing the human performance (82.3% in exact match and 91.2% in F1 score), which impressed the whole industry.

In 2018 Microsoft developed a deep learning translation system which for the first time achieved the same level of translation quality and accuracy as a human translator on the Chinese–English News dataset.

These achievements manifest the power of deep learning from different aspects, and also lay a solid foundation for its landing in the industry. However, we also observe that deep learning has some unresolved issues. For example, many deep learning models are still a "black box," making it impossible to explain how the model produces output for a particular input instance, and very hard to correct specific errors. In addition, most deep learning models lack the ability of reasoning, induction, and common sense. There are many ongoing researches to solve these issues. Hopefully, in the near future, deep learning will solve these problems and enable computers to have the same level of intelligence as humans.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780323901185000011

Artificial intelligence and deep learning in retinal image analysis

Philippe Burlina , ... Aurélio Campilho , in Computational Retinal Image Analysis, 2019

4 Deep learning for retinal biomarker extraction

Besides the tasks of eye disease diagnosis or semantic segmentation, as was reported in the previous sections, retinal biomarker extraction is an area of interest for the application of deep learning methods. In addition to ophthalmic and retinal diseases being of interest, the eye, and the retina in particular, are a window to the human body, allowing a direct noninvasive detection and diagnosis of several systemic diseases beyond diabetes, for example, hypertension, cardiovascular and brain diseases like Alzheimer, Parkinson, and other age-related impairments. These are often related with some specific biomarkers that can be extracted from retinal images, usually amounting to the detection and segmentation of the main anatomical regions as the retinal vasculature (including the arteriolar and venular networks), the optic disc, the macular region, and fovea. These landmarks can then be used for the characterization of higher-level retinal biomarkers, namely arteriovenous nicking, retinal arteriolar constriction, vascular calibers, vascular tortuosity, arteriolar-venular ratio (AVR), optic disc/optic cup ratio, etc. These are the base for computation of several biomarkers for systemic diseases as hypertension, diabetic retinopathy, cardiovascular diseases, and glaucoma.

Deep learning approaches have consistently improved previous performance on the task of locating, segmenting, and/or analyzing the aforementioned retinal biomarkers. It is beyond the limits of this work to provide a detailed overview of every recent method leveraging deep neural networks for such tasks, but we believe it is worth to introduce a concise summarization of most relevant recent advances. Most of them are related to either retinal vasculature or optic disc extraction and analysis. Accordingly, and in order to offer a compact view of these two groups of problems, we provide two tables and refer the reader to recent surveys of the state of the art in these topics like [67] or [68] for further information.

Table 1 provides an overview of a selection of recent papers describing the use of DL-based techniques for solving the task of vasculature segmentation and the related goal of artery/vein classification, whereas in Table 2, the detection and segmentation of the optic disc are addressed.

Table 1. AI for retinal image analysis.

Method Task Observations
Liskowski et al. [69] Vessel segmentation Standard CNN architecture Patch-center pixel classification
Li et al. [70] Vessel segmentation U-Net-like autoencoder architecture Full-patch prediction and reconstruction
Maninis et al. [71] Vessel segmentation Base CNN + two auxiliary side layers Joint vessel and optic disc segmentation
Fu et al. [72] Vessel segmentation Multiscale CNN + auxiliary side layer Conditional random field (CRF) refinement
Mo et al. [73] Vessel segmentation U-Net-like autoencoder architecture (FCN) Auxiliary classifiers and transfer learning
Zhao et al. [74] Vessel segmentation GANs for training data synthesis Independent second trained model
Lin et al. [75] Vessel segmentation U-Net-like autoencoder architecture (FCN) Holistic integration of CNN and CRFs
Yan et al. [76] Vessel segmentation U-Net-like autoencoder architecture Pixel-wise and segment-wise balancing loss
Wu et al. [77] Vessel segmentation Concatenation of two CNNs Probability map followed by refinement
Yan et al. [78] Vessel segmentation U-Net-like autoencoder architecture Thick/thin vessel segmentation + fusion
Welikala et al. [79] Artery/vein classification Standard CNN architecture Vessel centerline pixel classification
Meyer et al. [80] Artery/vein classification U-Net-like autoencoder architecture Studies optimal color representation
Xu et al. [81] Artery/vein segmentation U-Net-like autoencoder architecture Weighted cross-entropy loss
Galdran et al. [82] Artery/vein segmentation U-Net-like autoencoder architecture Takes into account label uncertainty

Notes: Overview of recent DL-based techniques for retinal vasculature analysis.

Table 2. AI for retinal image analysis.

Method Task Observations
Fu et al. [83] OD/cup segmentation Multiscale network, multilabel loss Application in glaucoma screening
Zilly et al. [84] OD/cup segmentation Improved entropy-based sampling Unsupervised graph-cut refinement
Gu et al. [85] OD segmentation Pretrained resnet + atrous convolutions Multiscale spatial pyramid pooling
Liu et al. [86] OD segmentation U-Net-like autoencoder architecture Patch-level adversarial module
Sun et al. [87] OD segmentation Faster R-CNN for OD detection Bounding box mapped to ellipse
Fu et al. [88] OD/cup segmentation Hierarchical ensemble of four networks Application in glaucoma screening
Sedai et al. [89] OD/cup segmentation Semisupervised variational autoencoder Supervised training on few examples
Sedai et al. [90] Fovea segmentation Coarse-to-fine approach F-CNNs for localization + segmentation
Al-Bander et al. [91] OD/fovea localization Standard CNN architecture Problem formulated as regression
Meyer et al. [92] OD/fovea localization U-Net-like autoencoder architecture Bi-distance map regression
Araujo et al. [93] OD/fovea localization Combination of Yolo and U-Net Optional OD segmentation

Notes: Overview of selected DL-based techniques for several retinal image analysis tasks for retinal biomarker extraction.

4.1 Automatic retinal biomarker discovery

The ability of deep learning models to automatically extract relevant information from retinal images may lead to the possibility of training such algorithms to identify patterns in these images related to clinical findings that practitioners could not be able to discern. Research in this direction has only recently started to appear. Most notably, in Ref. [ 94] the authors manage to predict several cardiovascular risk factors (as age with an MAE = 3.26 years, gender, AUC = 0.97, smoking status, AUC = 0.71, systolic blood pressure, MAE = 11.23 mmHg) and major adverse cardiac events, AUC = 070. The model trained for this was relatively standard (Inception V3), and it was trained on retinal images from 48,101 patients from the UK Biobank and 236,234 patients from EyePACS. The performance of this model was close to that of other cardiovascular risk calculators requiring blood sample analysis. This new and exciting research venue may result in the understanding and association with of retinal findings to other cardiovascular or neurodegenerative diseases.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780081028162000198

Machine learning algorithms for medical image security

J. Jennifer Ranjani , C. Jeyamala , in Intelligent Data Security Solutions for e-Health Applications, 2020

2.1 Brief insight into deep learning networks

Before describing the deep learning models for steganography, it is necessary to understand the overall perspective of deep learning networks. A generic structure of deep learning along with different steganography and steganalysis scenarios is discussed in [3]. Deep learning networks for images learn inherent features along with the discriminating boundaries that differentiate the classes. A computing unit in the neural network can be represented as a node in an oriented graph. The network modifies the parameters of these computing units by learning from the training samples. Pixel intensities are fed as input to these computing blocks, which then transmit the value to subsequent blocks. A convolution module is one of the important stages in any deep network. Dependent on the quality of the input images a preprocessing module can precede a convolution module. The preprocessing module can contain one or more filters to make the image suitable for the subsequent stages. The filtered image is then fed as an input to the convolution module. The convolution module comprises operations such as convolution, activation, pooling, and normalization.

The depth-wise separable convolution and inception networks contain two or more convolution layers. An activation function is applied to the filtered image after each convolution. Some of the activation functions could be f(x)   =   |  x  | or f(x)   =   sin(x) or f x = e x 2 σ 2 or f(x)   =   max(0, x), etc. To perform back propagation, the activation function must be differentiable; hence, typically, those functions that require little computation to determine the derivative are chosen. However, functions like hyperbolic tangent are not chosen because they could cause the back propagation to cancel the gradient. A pooling operation computes the average or maximum within a local neighborhood. Often downsampling is coupled with pooling to reduce the size of the feature map. Normalization is required to condition the data. Types of normalization commonly used are batch normalization [4], layer normalization [5], instance normalization [6], group normalization [7], weight normalization [8], cosine normalization [9], etc. Batch normalization is mostly preferred irrespective of its dependence on batch size. Weight and cosine normalizations are used to normalize the network weight rather normalizing the input.

Different types of deep learning models are available in the literature [10] with their share of merits and demerits as indicated in Table 1.

Table 1. Comparison of different deep learning models.

Model name and description Merits Demerits
Deep neural network allows nonlinear relationship and is suitable for regression and classification problems. Commonly used because it has high accuracy. Learning is comparatively slow.
Convolutional neural network transforms 2D data into a 3D feature map using the convolutional filters. Fast learning and improved performance. Requires increased amount of labeled data.
Recurrent neural network has the ability to learn sequential events. Time-related dependencies can be effectively modeled in sequential events. Requires large datasets.
Deep Boltzmann machine can be used to model unidirectional connections between the hidden layers. Robust inference is possible even if the data is ambiguous. Parameter optimization is not possible for larger datasets.
Deep belief network makes the hidden layer of each subnetwork visible to the next layer. The likelihood of each layer can be maximized using the greedy approach. Computationally complex initialization process.
Deep autoencoder is used in supervised learning for feature extraction and dimensionality reduction. Does not require a labeled dataset. Requires a pretraining step.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128195116000091

Deep reinforcement learning

Avraam Tsantekidis , ... Anastasios Tefas , in Deep Learning for Robot Perception and Cognition, 2022

6.1 Introduction

Most of the Deep Learning (DL) models discussed until now came to prominence by improving their performance on two fundamental machine learning tasks, namely, supervised and unsupervised learning. As [1] notes, although the terms supervised and unsupervised learning seem to exclusively cover the whole spectrum of ML methods, that is not the case. Indeed, the most succinct and simple explanation of learning is the maximization of a reward signal through trial-and-error. This simple concept that exists throughout nature as a mechanism that helps humans and animals to survive is systematically explored in the field of Reinforcement Learning (RL). The interaction with the environment and the feedback received, shapes our behavior depending on the reward signal we receive. When a positive reward is received, such as finding food and enjoying eating, the behavior (or policy) leading up to that feedback is reinforced. Similarly, when a negative feedback is experienced, such as falling and hurting ourselves, the behavior that leads to it is diminished or avoided in the future.

In some cases trial-and-error is used to describe how supervised learning works, where a model learns from its mistake in predicting the desired output. This is very different from RL, since the error signal in supervised learning is generated from prior knowledge that must be previously extracted and exist in the labels used. Although supervised learning from labeled data can achieve generalization to previously unseen data, in the RL case, an agent can continue learning from any new and unknown situations based on the reward signal alone. Unsupervised learning, which aims at discovering structure in unlabeled data, is also different from RL for a similar reason. Although in RL learning structures might benefit an agent, the end goal is simply to maximize the reward. Along the way some structures of the observable state might be identified by the employed model, but this is merely a side effect of the learning process and not the end goal, as in unsupervised learning.

Reinforcement learning is comprised of the following basic components: the policy, reward signal, and value function. Some methods also include a model of the observable environment and are called model-based approaches. These methods allow RL agents to improve its policy without interacting with a real environment, thus enabling a form of planning based on the understanding of the environment. When an environment can be simulated adequately without concern about the number of interaction from which an agent can learn from, then a model-free approach can be used. In Fig. 6.1 a diagram is drawn of how RL components are used by different methods. In this chapter, we will cover model-free approaches that utilize value-based and policy-based learning.

Figure 6.1

Figure 6.1. Components utilized in reinforcement learning and where known methods land among them. Diagram recreated from [2].

Both value and policy-based methods assign a value of the state of the environment, but value-based methods aim to select the actions that will lead to higher value states, while policy-based methods direct learn a policy utilizing state values in an indirect way. Value-based methods have been shown to be less sample-efficient and slower to train, while policy-based methods can be trapped in local minima because the policy itself is used to sample the environment. Both methods have been developed and augmented with techniques to improve upon their shortcoming, which we discuss later in this section.

In recent years, after the unprecedented success of deep learning in several tasks of machine learning [3–8], both supervised and unsupervised, the existing RL methods started to be adapted to the deep learning paradigm. The newly developed frameworks such as Tensorflow [9] and Pytorch [10], and the ever growing hardware acceleration capabilities of Graphics Processing Units (GPUs) and Application-Specific Integrated Circuits (ASICs) have allowed for significant performance improvements in RL tasks. In this chapter, we explore the conventional RL methods and how deep learning was applied to augment them and remarkably improve their performance, solving tasks previously thought impossible to be solved by machine learning and even surpassing human performance in many of them, such as Atari games, chess [11], go [12], and others.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780323857871000117

Hardware Accelerator Systems for Artificial Intelligence and Machine Learning

Amitabh Biswal , ... Zakir Hussain , in Advances in Computers, 2021

2.5 Benefits of using deep learning approaches

Practical advantages of deep learning models is that it ensures user's privacy. With today's focus on privacy in the age of internet more people are concerned about their data. Training of the model can be done locally and only the model parameters have to be sent online on centralized system for prediction. So the users data are not kept online because of that it can ensures security and privacy of users' data.

Space efficiency: Generally the size of trained models which are kept online is much smaller compared to matrix methods used in memory-based models.

Prediction speed: Prediction speed of deep learning models are much faster compared to other methods. Because instead of using whole database prediction is only done using learned model which greatly improves the prediction time.

Deep learning has the ability to perform task such as such as extracting features from audio, pictures which can also be used to provide recommendation using these features. It has the ability to deal with nonnumerical data.

Accuracy: Prediction performance of deep learning is better than nondeep learning methods.

Elimination of feature engineering: In machine learning feature engineering is one the important task where domain knowledge is required. Benefits of deep learning is the automatic detection of features and performing computation on that.

Elimination of costs: The use of deep learning eliminates some of the direct cost such as cost involved in labeling the data. It can also detect defects which would be difficult to detect otherwise.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/S0065245821000139

Knowledge distillation

Nikolaos Passalis , ... Anastasios Tefas , in Deep Learning for Robot Perception and Cognition, 2022

8.1 Introduction

The increasing complexity of Deep Learning (DL) models led to the development of a wide variety of methods for having faster, more lightweight, and flexible models, ranging from applying quantization methods [1,2] and pruning techniques [3,4] to architectures that are lightweight by design, such as MobileNets [5]. All of these methods allowed for developing smaller and faster models. However, this came with some shortcomings. These smaller models are typically less accurate, leading to significant challenges for many critical applications where accuracy is as important as speed, for example, for developing crowd detection and avoidance systems for autonomous unmanned aerial vehicles [6]. This phenomenon might sound like a reasonable trade-off, since it is expected that decreasing the number of parameters in a neural network would lead to a lower accuracy. However, most (if not all) state-of-the-art DL architectures are largely overparameterized and the additional parameters/complexity in many cases just help to discover better solutions, instead of merely increasing its representational capacity [7,8]. This means that in many cases the over-parametrized DL models could be potentially "compressed" into smaller ones, if we had the appropriate tools for training them [7].

These observations fueled the interest for better understanding the learning dynamics of neural networks, as well as for developing more effective training methods that could mitigate the aforementioned phenomena. Among the most well-known methods for improving the effectiveness of the training process for DL models is knowledge distillation [9], also known as knowledge transfer [10] or neural network distillation [11]. These methods are capable of improving the effectiveness of the training process by transferring the knowledge encoded in a large and complex neural network into a smaller one. Typically, the larger model is called the teacher model, while the smaller one is called the student model, to highlight the similarity with the anthropocentric training approaches, where a teacher transfers its knowledge to a student using the most appropriate teaching techniques. In the case of neural networks, the knowledge encoded by the teacher model, which is also known as dark knowledge [12,13], describes various aspects regarding the inner workings of the larger models, which the student model should mimic.

Even though a solid theoretical underpinning of why and when knowledge distillation methods work has not been fully established yet [15], there are several explanations on why this process is so effective. First, knowledge distillation acts as a regularizer, encoding our prior knowledge and beliefs regarding the regions of the hypothesis space that should be considered. Therefore the teacher model steers the student toward a potentially better region of the optimization landscape, often leading to a better solution and improved generalization. Note that, as we mentioned earlier, the overparameterization of the teacher model sometimes only helps to discover a better solution [7,8], thus improving the effective capacity of the teacher, even through the correct hypotheses can be described by a model with lower representational capacity.

Furthermore, knowledge distillation also has information-theoretic consequences, since the student model is trained by using more information compared with regular training using only the ground truth information. This can be better understood by an example. Consider the case of a neural network trained to classify handwritten digits, as shown in Fig. 8.1. Regular supervised learning algorithms typically aim to maximize the response of the neuron that corresponds to the correct class, ignoring that the similarities of each digit with the rest of the classes might also provide useful information to the model. On the other hand, knowledge distillation methods, such as distillation between the last (decision) layers of the networks [11], extract such information from the teacher model and then trains the student models using both the ground truth labels, as well as the implicit (dark) knowledge extracted from the teacher. Even though this dark knowledge can be provided in various forms, most methods employ either similarities or relations between samples and/or classes, as shown in Fig. 8.1.

Figure 8.1

Figure 8.1. The process of knowledge distillation. Instead of directly training a DL model with the ground-truth labels/annotations, we employ another model, which acts as a teacher, and provides additional implicit / dark knowledge (e.g., similarities between the training samples and the labels) to the student model. An image from the well-known MNIST data set was used in this figure [14].

The effectiveness of knowledge distillation methods led to a very rich literature on the topic. This chapter aims to provide a brief introduction to knowledge distillation, as well as present some of the representative methods that will equip the reader with the necessary knowledge and tools to apply these methods in practice, as well as follow this rapidly advancing field. The interested reader is referred to [16], for an extensive review of knowledge distillation methods, as well as applications that span to other areas, such as defending neural networks against adversarial attacks and application in natural language processing (NLP). The rest of this chapter is structured as follows. In Section 8.2, we present the seminal neural network distillation approach [11], which kick-started the field. Then we present a generalization of this approach (Section 8.3), which provides a general probabilistic view on knowledge distillation, allowing for going beyond classification tasks and overcoming some significant limitations that existed in earlier methods [10]. Note that both of these methods employ only one layer for transferring the knowledge. However, other methods also demonstrated that employing multiple layers of the teacher model allow for even more effectively training the student networks. Thus in Section 8.4, we present multilayer knowledge distillation methods that employ multiple layers for the process of knowledge distillation, further improving its effectiveness. Finally, in Section 8.5 we present some more advanced ways to train the teacher model, allowing for deriving more effective distillation methods, such as online distillation methods, which allow for simultaneously training both the teacher and student models, as well as, self-distillation methods that do not even use a different teacher model and are capable of reusing the knowledge extracted from the student model.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780323857871000130

Deep Learning and Its Parallelization

X. Li , ... W. Zheng , in Big Data, 2016

Scalability of deep models

To train large-scale deep learning models faster, an important method is to accelerate the training process with distributed computing and powerful computing devices (eg, clusters and GPUs). The existing approaches of parallel training includes data parallel, model parallel, and data-model parallel. But when training on large-scale deep models, each of them will be of low efficiency for the sake of parameter synchronization that needs frequent communications between different computing nodes (such as different server nodes in distributed system, and heterogeneous computing systems between CPU and GPUs). Additional, the memory limitation of modern GPUs can also lead to scalability of deep networks. The big challenge is how to optimize and balance workload computation and communication in large-scale deep learning networks.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128053942000040