diff --git a/Corpus/CORPUS.txt b/Corpus/CORPUS.txt index b7e7925..24b6584 100644 --- a/Corpus/CORPUS.txt +++ b/Corpus/CORPUS.txt @@ -8786,4 +8786,2257 @@ TABLE III. The Boston underground transportation system (MBTA) consists of N = 1 <> +<> <> <> + + +<> <> <> + Efficient Processing of Deep Neural Networks: A Tutorial and Survey + + Vivienne Sze,Senior Member, IEEE,Yu-Hsin Chen,Student Member, IEEE,Tien-Ju Yang,Student + Member, IEEE,Joel Emer,Fellow, IEEE + + Abstract + + Deep neural networks (DNNs) are currently widely representation of an input space. This is different from earlier + used for many artificial intelligence (AI) applications including approaches that use hand-crafted features or rules designed by + While DNNs experts. deliver state-of-the-art accuracy on many AI tasks, it comes at the The superior accuracy of DNNs, however, comes at the cost of high computational complexity. Accordingly, techniques + that enable efficient processing of DNNs to improve energy cost of high computational complexity. While general-purpose + efficiency and throughput without sacrificing application accuracy compute engines, especially graphics processing units (GPUs), + or increasing hardware cost are critical to the wide deployment have been the mainstay for much DNN processing, increasingly of DNNs in AI systems. there is interest in providing more specialized acceleration of This article aims to provide a comprehensive tutorial and the DNN computation. This article aims to provide an overview survey about the recent advances towards the goal of enabling + efficient processing of DNNs. Specifically, it will provide an of DNNs, the various tools for understanding their behavior, + overview of DNNs, discuss various hardware platforms and and the techniques being explored to efficiently accelerate their + architectures that support DNNs, and highlight key trends in computation. reducing the computation cost of DNNs either solely via hardware This paper is organized as follows: design changes or via joint hardware design and DNN algorithm + changes. It will also summarize various development resources Section II provides background on the context of why + that enable researchers and practitioners to quickly get started DNNs are important, their history and applications. + in this field, and highlight important benchmarking metrics and Section III gives an overview of the basic components of design considerations that should be used for evaluating the DNNs and popular DNN models currently in use. rapidly growing number of DNN hardware designs, optionally + including algorithmic co-designs, being proposed in academia Section IV describes the various resources used for DNN + and industry. research and development. + The reader will take away the following concepts from this Section V describes the various hardware platforms used + article: understand the key design considerations for DNNs; be to process DNNs and the various optimizations used able to evaluate different DNN hardware implementations with to improve throughput and energy efficiency without benchmarks and comparison metrics; understand the trade-offs impacting application accuracy (i.e., produce bit-wise between various hardware architectures and platforms; be able to + evaluate the utility of various DNN design techniques for efficient identical results). + processing; and understand recent implementation trends and Section VI discusses how mixed-signal circuits and new + opportunities. memory technologies can be used for near-data processing + to address the expensive data movement that dominates + throughput and energy consumption of DNNs. + + I. INTRODUCTION Section VII describes various joint algorithm and hardware + Deep neural networks (DNNs) are currently the foundation optimizations that can be performed on DNNs to improve + for many modern artificial intelligence (AI) applications [1]. both throughput and energy efficiency while trying to + Since the breakthrough application of DNNs to speech recogni- minimize impact on accuracy. + tion [2] and image recognition [3], the number of applications Section VIII describes the key metrics that should be + that use DNNs has exploded. These DNNs are employed in a considered when comparing various DNN designs. + myriad of applications from self-driving cars [4], to detecting + cancer [5] to playing complex games [6]. In many of these II. B ACKGROUND ON DEEP NEURAL NETWORKS (DNN) + domains, DNNs are now able to exceed human accuracy. The In this section, we describe the position of DNNs in thesuperior performance of DNNs comes from its ability to extract context of AI in general and some of the concepts that motivatedhigh-level features from raw sensory data after using statistical its development. We will also present a brief chronology oflearning over a large amount of data to obtain an effective the major steps in its history, and some current domains to + which it is being applied. V. Sze, Y.-H. Chen and T.-J. Yang are with the Department of Electrical + Engineering and Computer Science, Massachusetts Institute of Technol- + ogy, Cambridge, MA 02139 USA. (e-mail: sze@mit.edu; yhchen@mit.edu, A. Artificial Intelligence and DNNs tjy@mit.edu) + J. S. Emer is with the Department of Electrical Engineering and Computer DNNs, also referred to as deep learning, are a part of Science, Massachusetts Institute of Technology, Cambridge, MA 02139 USA, the broad field of AI, which is the science and engineering and also with Nvidia Corporation, Westford, MA 01886 USA. (e-mail: + jsemer@mit.edu) of creating intelligent machines that have the ability to 2 + + <
> + + Fig. 2. Connections to a neuron in the brain. <>,<>,<>, and b are the + activations, weights, non-linear function and bias, respectively. (Figure adopted + from [7].)Fig. 1. Deep Learning in the context of Artificial Intelligence. + + + to be10 14 to10 15 synapses in the average human brain. + achieve goals like humans do, according to John McCarthy, A key characteristic of the synapse is that it can scale the + the computer scientist who coined the term in the 1950s. signal (x_i) crossing it as shown in Fig. 2. That scaling factor + The relationship of deep learning to the whole of artificial can be referred to as a weight (<>), and the way the brain is + intelligence is illustrated in Fig. 1. believed to learn is through changes to the weights associated + Within artificial intelligence is a large sub-field called with the synapses. Thus, different weights result in different + machine learning, which was defined in 1959 by Arthur Samuel responses to an input. Note that learning is the adjustment + as the field of study that gives computers the ability to learn of the weights in response to a learning stimulus, while the + without being explicitly programmed. That means a single organization (what might be thought of as the program) of the + program, once created, will be able to learn how to do some brain does not change. This characteristic makes the brain an + intelligent activities outside the notion of programming. This is excellent inspiration for a machine-learning-style algorithm. + in contrast to purpose-built programs whose behavior is defined Within the brain-inspired computing paradigm there is a + by hand-crafted heuristics that explicitly and statically define subarea called spiking computing. In this subarea, inspiration + their behavior. is taken from the fact that the communication on the dendrites + The advantage of an effective machine learning algorithm and axons are spike-like pulses and that the information being + is clear. Instead of the laborious and hit-or-miss approach of conveyed is not just based on a spike’s amplitude. Instead, + creating a distinct, custom program to solve each individual it also depends on the time the pulse arrives and that the + problem in a domain, the single machine learning algorithm computation that happens in the neuron is a function of not just + simply needs to learn, via a processes called training, to handle a single value but the width of pulse and the timing relationship + each new problem. between different pulses. An example of a project that was + Within the machine learning field, there is an area that is inspired by the spiking of the brain is the IBM TrueNorth [8]. + often referred to as brain-inspired computation. Since the brain In contrast to spiking computing, another subarea of brain- + is currently the best ‘machine’ we know for learning and inspired computing is called neural networks, which is the + solving problems, it is a natural place to look for a machine focus of this article. 1 + learning approach. Therefore, a brain-inspired computation is + a program or algorithm that takes some aspects of its basic B. Neural Networks and Deep Neural Networks (DNNs) + form or functionality from the way the brain works. This is in Neural networks take their inspiration from the notion that + contrast to attempts to create a brain, but rather the program a neuron’s computation involves a weighted sum of the input + aims to emulate some aspects of how we understand the brain values. These weighted sums correspond to the value scaling + to operate. performed by the synapses and the combining of those values + Although scientists are still exploring the details of how the in the neuron. Furthermore, the neuron doesn’t just output that + brain works, it is generally believed that the main computational weighted sum, since the computation associated with a cascade + element of the brain is the neuron. There are approximately of neurons would then be a simple linear algebra operation. + 86 billion neurons in the average human brain. The neurons Instead there is a functional operation within the neuron that + themselves are connected together with a number of elements is performed on the combined inputs. This operation appears + entering them called dendrites and an element leaving them to be a non-linear function that causes a neuron to generate + called an axon as shown in Fig. 2. The neuron accepts the an output only if the inputs cross some threshold. Thus by + signals entering it via the dendrites, performs a computation on analogy, neural networks apply a non-linear function to the + those signals, and generates a signal on the axon. These input weighted sum of the input values. We look at what some of + and output signals are referred to as activations. The axon of those non-linear functions are in Section III-A1. + one neuron branches out and is connected to the dendrites of + many other neurons. The connections between a branch of the 1 Note: Recent work using TrueNorth in a stylized fashion allows it to be + used to compute reduced precision neural networks [9]. These types of neural axon and a dendrite is called asynapse. There are estimated networks are discussed in Section VII-A. 3 + + <
> + + Fig. 3. Simple neural network example and terminology (Figure adopted (a) Compute the gradient of the loss (b) Compute the gradient of the lossfrom [7]). relative to the filter inputs relative to the weights + + <
> + + Fig. 4. An example of backpropagation through a neural network. + + <
> + + Fig. 3(a) shows a diagrammatic picture of a computational neural network. The neurons in the input layer receive some + values and propagate them to the neurons in the middle layer and is referred to as training the network. + + Once trained, the + of the network, which is also frequently called a ‘hidden program can perform its task by computing the output of + layer’. The weighted sums from one or more hidden layers are the network using the weights determined during the training + ultimately propagated to the output layer, which presents the process. Running the program with these weights is referred + final outputs of the network to the user. To align brain-inspired to as inference. + terminology with neural networks, the outputs of the neurons In this section, we will use image classification, as shown + are often referred to as activations, and the synapses are often in Fig. 6, as a driving example for training and using a DNN. + referred to as weights as shown in Fig. 3(a). We will use the When we perform inference using a DNN, we give an input + activation/weight nomenclature in this article. image and the output of the DNN is a vector of scores, one for + Fig. 3(b) shows an example of the computation at each each object class; the class with the highest score indicates the + most likely class of object in the image. The overarching goal layer: <>, where W_ij ,x_i and y_j are the for training a DNN is to determine the weights that maximize + weights, input activations and output activations, respectively, i=1 the score of the correct class and minimize the scores of the + and <> is a non-linear function described in SectionIII-A1. incorrect classes. When training the network the correct class + The bias term b is omitted from Fig. 3(b) for simplicity. is often known because it is given for the images used for + Within the domain of neural networks, there is an area called training (i.e., the training set of the network). The gap between + deep learning, in which the neural networks have more than the ideal correct scores and the scores computed by the DNN + three layers, i.e., more than one hidden layer. Today, the typical based on its current weights is referred to as theloss(L). + numbers of network layers used in deep learning range from Thus the goal of training DNNs is to find a set of weights to + five to more than a thousand. In this article, we will generally minimize the average loss over a large training set. + use the terminologydeep neural networks (DNNs)to refer to When training a network, the weights (wij ) are usually + the neural networks used in deep learning. updated using a hill-climbing optimization process called + DNNs are capable of learning high-level features with more gradient descent. A multiple of the gradient of the loss relative + complexity and abstraction than shallower neural networks. An to each weight, which is the partial derivative of the loss with + example that demonstrates this point is using DNNs to process respect to the weight, is used to update the weight (i.e., updated + visual data. In these applications, pixels of an image are fed into <>, where <> is called the learning rate). + Note <> the first layer of a DNN, and the outputs of that layer can be that this gradient indicates how the weights should change in ij + + interpreted as representing the presence of different low-level order to reduce the loss. The process is repeated iteratively to + features in the image, such as lines and edges. At subsequent reduce the overall loss. + layers, these features are then combined into a measure of the An efficient way to compute the partial derivatives of + likely presence of higher level features, e.g., lines are combined the gradient is through a process called backpropagation. + into shapes, which are further combined into sets of shapes. Backpropagation, which is a computation derived from the + And finally, given all this information, the network provides a chain rule of calculus, operates by passing values backwards + probability that these high-level features comprise a particular through the network to compute how the loss is affected by + object or scene. This deep feature hierarchy enables DNNs to each weight. + achieve superior performance in many tasks. This backpropagation computation is, in fact, very similar + in form to the computation used for inference as shown in Fig. 4 [10]. 2 Thus, techniques for efficiently performing + + C. Inference versus Training + + Since DNNs are an instance of a machine learning algorithm, 2 To backpropagate through each filter: (1) compute the gradient of the loss + the basic program does not change as it learns to perform its relative to the weights from the filter inputs (i.e., the forward activations) and + given tasks. In the specific case of DNNs, this learning involves the gradients of the loss relative to the filter outputs; (2) compute the gradient + of the loss relative to the filter inputs from the filter weights and the gradients determining the value of the weights (and bias) in the network, of the loss relative to the filter outputs. 4 + + + inference can sometimes be useful for performing training. DNN Timeline + It is, however, important to note a couple of points. First, + backpropagation requires intermediate outputs of the network 1940s - Neural networks were proposed + to be preserved for the backwards computation, thus training 1960s - Deep neural networks were proposed + has increased storage requirements. Second, due to the gradients 1989 - Neural networks for recognizing digits (LeNet) + use for hill-climbing, the precision requirement for training 1990s - Hardware for shallow neural nets (Intel ETANN) + is generally higher than inference. Thus many of the reduced 2011 - Breakthrough DNN-based speech recognition + (Microsoft)precision techniques discussed in Section VII are limited to + inference only. 2012 - DNNs for vision start supplanting hand-crafted + approaches (AlexNet)A variety of techniques are used to improve the efficiency + and robustness of training. For example, often the loss from 2014+ - Rise of DNN accelerator research (Neuflow, + DianNao...)multiple sets of input data, i.e., abatch, are collected before a + single pass of weight update is performed; this helps to speed Fig. 5. A concise history of neural networks. ’Deep’ refers to the number of + up and stabilize the training process. layers in the network. + There are multiple ways to train the weights. The most + common approach, as described above, is called supervised + learning, where all the training samples are labeled (e.g., with amount of available information to train the networks. To learn + the correct class).Unsupervised learning is another approach a powerful representation (rather than using a hand-crafted + where all the training samples are not labeled and essentially approach) requires a large amount of training data. For example, + the goal is to find the structure or clusters in the data.Semi- Facebook receives over 350 millions images per day, Walmart + supervised learning falls in between the two approaches where creates 2.5 Petabytes of customer data hourly and YouTube + only a small subset of the training data is labeled (e.g., use has 300 hours of video uploaded every minute. As a result, + unlabeled data to define the cluster boundaries, and use the the cloud providers and many businesses have a huge amount + small amount of labeled data to label the clusters). Finally, of data to train their algorithms. + reinforcement learning can be used to the train weights such The second factor is the amount of compute capacity + that given the state of the current environment, the DNN can available. Semiconductor device and computer architecture + output what action the agent should take next to maximize advances have continued to provide increased computing + expected rewards; however, the rewards might not be available capability, and we appear to have crossed a threshold where the + immediately after an action, but instead only after a series of large amount of weighted sum computation in DNNs, which + actions. is required for both inference and training, can be performed + Another commonly used approach to determine weights is in a reasonable amount of time. + fine-tuning, where previously-trained weights are available and The successes of these early DNN applications opened the + are used as a starting point and then those weights are adjusted floodgates of algorithmic development. It has also inspired the + for a new dataset (e.g., transfer learning) or for a new constraint development of several (largely open source) frameworks that + (e.g., reduced precision). This results in faster training than make it even easier for researchers and practitioners to explore + starting from a random starting point, and can sometimes result and use DNNs. Combining these efforts contributes to the third + in better accuracy. factor, which is the evolution of the algorithmic techniques that + This article will focus on the efficient processing of DNN have improved application accuracy significantly and broadened + inference rather than training, since DNN inference is often the domains to which DNNs are being applied. + performed on embedded devices (rather than the cloud) where An excellent example of the successes in deep learning can + resources are limited as discussed in more details later. be illustrated with the ImageNet Challenge [14]. This challenge + is a contest involving several different components. One of the + components is an image classification task where algorithmsD. Development History are given an image and they must identify what is in the image,Although neural nets were proposed in the 1940s, the first as shown in Fig. 6. The training set consists of 1.2 millionpractical application employing multiple digital neurons didn’t images, each of which is labeled with one of 1000 objectappear until the late 1980s with the LeNet network for hand- categories that the image contains. For the evaluation phase,written digit recognition [11]3 . Such systems are widely used the algorithm must accurately identify objects in a test set ofby ATMs for digit recognition on checks. However, the early images, which it hasn’t previously seen.2010s have seen a blossoming of DNN-based applications with Fig. 7 shows the performance of the best entrants in thehighlights such as Microsoft’s speech recognition system in ImageNet contest over a number of years. One sees that 2011 [2] and the AlexNet system for image recognition in the accuracy of the algorithms initially had an error rate2012 [3]. A brief chronology of deep learning is shown in of 25% or more. In 2012, a group from the University ofFig. 5. Toronto used graphics processing units (GPUs) for their highThe deep learning successes of the early 2010s are believed compute capability and a deep neural network approach, namedto be a confluence of three factors. The first factor is the AlexNet, and dropped the error rate by approximately 10% [3]. + Their accomplishment inspired an outpouring of deep learning In the early 1960s, single analog neuron systems were used for adaptive + style algorithms that have resulted in a steady stream of filtering [12, 13]. 5 + + Speech and LanguageDNNs have significantly improved + the accuracy of speech recognition [21] as well as many + related tasks such as machine translation [2], natural + language processing [22], and audio generation [23]. Machines Learning + MedicalDNNs have played an important role in genomic + to gain insight into the genetics of diseases such as autism, + cancers, and spinal muscular atrophy [24–27]. + <
> They have also been used in medical imaging to detect skin cancer [5], + brain cancer [28] and breast cancer [29]. + Fig. 6. Example of an image classification task. + + The machine learning platform takes in an image and outputs the confidence scores for a predefined set of classes. + Game PlayRecently, many of the grand AI challenges + involving game play have been overcome using DNNs. + These successes also required innovations in training + techniques and many rely on reinforcement learning [30]. + DNNs have surpassed human level accuracy in playing + Atari [31] as well as Go [6], where an exhaustive search + of all possibilities is not feasible due to the unimaginably + huge number of possible moves. + RoboticsDNNs have been successful in the domain of + <
> robotic tasks such as grasping with a robotic arm [32], + motion planning for ground robots [33], visual navigation [4,34], control to stabilize a quadcopter [35] and + Fig. 7. Results from the ImageNet Challenge [14]. driving strategies for autonomous vehicles [36]. + + DNNs are already widely used in multimedia applications + today (e.g., computer vision, speech recognition). Looking + improvements. forward, we expect that DNNs will likely play an increasingly + In conjunction with the trend to deep learning approaches important role in the medical and robotics fields, as discussed + for the ImageNet Challenge, there has been a corresponding above, as well as finance (e.g., for trading, energy forecasting, + increase in the number of entrants using GPUs. From 2012 and risk assessment), infrastructure (e.g., structural safety, and + when only 4 entrants used GPUs to 2014 when almost all traffic control), weather forecasting and event detection [37]. + the entrants (110) were using them. This reflects the almost The myriad application domains pose new challenges to the + complete switch from traditional computer vision approaches efficient processing of DNNs; the solutions then have to be + to deep learning-based approaches for the competition. adaptive and scalable in order to handle the new and varied + In 2015, the ImageNet winning entry, ResNet [15], exceeded forms of DNNs that these applications may employ. + human-level accuracy with a top-5 error rate 4 below 5%. Since + then, the error rate has dropped below 3% and more focus F. Embedded versus Cloud + is now being placed on more challenging components of the The various applications and aspects of DNN processing competition, such as object detection and localization. These (i.e., training versus inference) have different computational successes are clearly a contributing factor to the wide range needs. Specifically, training often requires a large dataset 5 and of applications to which DNNs are being applied. + significant computational resources for multiple weight-update + iterations. In many cases, training a DNN model still takes several hours to multiple days and thus is typically performed + + E. Applications of DNN + + Many applications can benefit from DNNs ranging from in the cloud. Inference, on the other hand, can happen either + multimedia to medical space. In this section, we will provide in the cloud or at the edge (e.g., IoT or mobile). + examples of areas where DNNs are currently making an impact In many applications, it is desirable to have the DNN + and highlight emerging areas where DNNs hope to make an inference processing near the sensor. For instance, in computer + impact in the future. vision applications, such as measuring wait times in stores + Image and VideoVideo is arguably the biggest of the or predicting traffic patterns, it would be desirable to extract + big data. It accounts for over 70% of today’s Internet meaningful information from the video right at the image + traffic [16]. For instance, over 800 million hours of video sensor rather than in the cloud to reduce the communication + is collected daily worldwide for video surveillance [17]. cost. For other applications such as autonomous vehicles, + Computer vision is necessary to extract meaningful infor- drone navigation and robotics, local processing is desired since + mation from video. DNNs have significantly improved the the latency and security risks of relying on the cloud are + accuracy of many computer vision tasks such as image too high. However, video involves a large amount of data, + classification [14], object localization and detection [18], which is computationally complex to process; thus, low cost + image segmentation [19], and action recognition [20]. hardware to analyze video is challenging yet critical to enabling + + 4 The top-5 error rate is measured based on whether the correct answer 5 One of the major drawbacks of DNNs is their need for large datasets to + appears in one of the top 5 categories selected by the algorithm. prevent over-fitting during training. 6 + + + attention has been given to hardware acceleration specifically Feed Forward Recurrent Fully-Connected Sparsely-Connected for RNNs. + DNNs can be composed solely offully-connected(FC) + layers (also referred to as multi-layer perceptrons, or MLP) + as shown in the leftmost layer of Fig. 8(b). In a FC layer, + all output activations are composed of a weighted sum of + all input activations (i.e., all outputs are connected to all + inputs). This requires a significant amount of storage and + Thankfully, in many applications, we can remove current) networks some connections between the activations by setting the weights + to zero without affecting accuracy. This results in a sparsely connected layer. A sparsely connected layer is illustrated in + the rightmost layer of Fig. 8(b).these applications. Speech recognition enables us to seamlessly We can also make the computation more efficient by limitinginteract with electronic devices, such as smartphones. While the number of weights that contribute to an output. This sort ofcurrently most of the processing for applications such as Apple structured sparsity can arise if each output is only a functionSiri and Amazon Alexa voice services is in the cloud, it is of a fixed-size window of inputs. Even further efficiency canstill desirable to perform the recognition on the device itself to be gained if the same set of weights are used in the calculationreduce latency and dependency on connectivity, and to improve of every output. This repeated use of the same weight values is privacy and security. calledweight sharingand can significantly reduce the storageMany of the embedded platforms that perform DNN infer- requirements for weights.ence have stringent energy consumption, compute and memory An extremely popular windowed and weight-shared DNNcost limitations; efficient processing of DNNs have thus become layer arises by structuring the computation as a convolution,of prime importance under these constraints. Therefore, in this as shown in Fig. 9(a), where the weighted sum for each outputarticle, we will focus on the compute requirements for inference activation is computed using only a small neighborhood of inputrather than training. activations (i.e., all weights beyond beyond the neighborhood + are set to zero), and where the same set of weights are shared for + every output (i.e., the filter is space invariant). Such convolution- + + III. OVERVIEW OF DNN'S + + DNNs come in a wide variety of shapes and sizes depending based layers are referred to as convolutional (CONV) layers. + on the application. The popular shapes and sizes are also + evolving rapidly to improve accuracy and efficiency. In all A. Convolutional Neural Networks (CNNs)cases, the input to a DNN is a set of values representing the A common form of DNNs isConvolutional Neural Netsinformation to be analyzed by the network. For instance, these (CNNs), which are composed of multiple CONV layers asvalues can be pixels of an image, sampled amplitudes of an shown in Fig. 10. In such networks, each layer generates aaudio wave or the numerical representation of the state of some successively higher-level abstraction of the input data, calledsystem or game. afeature map(fmap), which preserves essential yet uniqueThe networks that process the input come in two major information. Modern CNNs are able to achieve superior per-forms: feed forward and recurrent as shown in Fig. 8(a). In formance by employing a very deep hierarchy of layers. CNNfeed-forward networks all of the computation is performed as a are widely used in a variety of applications including imagesequence of operations on the outputs of a previous layer. The understanding [3], speech recognition [39], game play [6],final set of operations generates the output of the network, for robotics [32], etc. This paper will focus on its use in imageexample a probability that an image contains a particular object, processing, specifically for the task of image classification [3].the probability that an audio sequence contains a particular Each of the CONV layers in CNN is primarily composed ofword, a bounding box in an image around an object or the high-dimensional convolutions as shown in Fig. 9(b). In thisproposed action that should be taken. In such DNNs, the computation, the input activations of a layer are structured asnetwork has no memory and the output for an input is always a set of 2-Dinput feature maps(ifmaps), each of which isthe same irrespective of the sequence of inputs previously given called achannel. Each channel is convolved with a distinctto the network. 2-D filter from the stack of filters, one for each channel; thisIn contrast, recurrent neural networks (RNNs), of which stack of 2-D filters is often referred to as a single 3-D filter.Long Short-Term Memory networks (LSTMs) [38] are a The results of the convolution at each point are summed acrosspopular variant, have internal memory to allow long-term all the channels. In addition, a 1-D bias can be added to thedependencies to affect the output. In these networks, some filtering results, but some recent networks [15] remove itsintermediate operations generate values that are stored internally usage from parts of the layers. The result of this computationin the network and used as inputs to other operations in is the output activations that comprise one channel ofoutputconjunction with the processing of a later input. In this article, feature map(ofmap). Additional 3-D filters can be used onwe will focus on feed-forward networks since (1) the major + computation in RNNs is still the weighted sum, which is 6 Note: the structured sparsity in CONV layers is orthogonal to the sparsity covered by the feed-forward networks, and (2) to-date little that occurs from network pruning as described in Section VII-B2. 7 + + after the CONV layers for classification purposes. A FC layer Fully + Connected also applies filters on the ifmaps as in the CONV layers, but + × × the filters are of the same size as the ifmaps. Therefore, it + does not have the weight sharing property of CONV layers. Optional + Eq. (1) still holds for the computation of FC layers with a + Fig. 10. Convolutional Neural Networks. few additional constraints on the shape parameters: <>, + <>,<>, and <>. + In addition to CONV and FC layers, various optional layers + the same input to create additional output channels. Finally, can be found in a DNN such as the non-linearity, pooling, + multiple input feature maps may be processed together as a and normalization. The function and computations for each of + batchto potentially improve reuse of the filter weights. these layers are discussed next. + Given the shape parameters in Table I, the computation of 1) Non-Linearity:A non-linear activation function is typically + applied after each CONV or FC layer. Various non-linear + functions are used to introduce non-linearity into the DNN as + shown in Fig. 11. These include historically conventional non- <> + <> linear functions such as sigmoid or hyperbolic tangent as well + <> as rectified linear unit (ReLU) [40], which has become popular + <>; in recent years due to its simplicity and its ability to enable + <>; fast training. Variations of ReLU, such as leaky ReLU [41], (1) parametric ReLU [42], + and exponential LU [43] have also been O,I,W and B are the matrices of the of_maps, if_maps, filters explored for improved accuracy. + Finally, a non-linearity called and biases, respectively.Uis a given stride size. Fig. 9(b) maxout, which takes the max value of two intersecting linear shows a visualization of this computation (ignoring biases). + functions, has shown to be effective in speech recognition To align the terminology of CNNs with the generic DNN, tasks [44, 45]. + filters are composed of weights (i.e., synapses) 2) Pooling: A variety of computations that reduce the + input and output feature maps (if_maps, of_maps) are dimensionality of a feature map are referred to as pooling. + composed of activations (i.e., input and output neurons) Pooling, which is applied to each channel separately, enables + DNN is run only once), which is more consistent with what + would likely be deployed in real-time and/or energy-constrained + LeNet[11] was one of the first CNN approaches introduced + in 1989. It was designed for the task of digit classification in + <
> grayscale images of size 28x28. The most well known version, + LeNet-5, contains two CONV layers and two FC layers [48]. + Fig. 12. Various forms of pooling (Figure adopted from Caffe Tutorial [46]). Each CONV layer uses filters of size 5x5 (1 channel per filter) + with 6 filters in the first layer and 16 filters in the second layer. + the network to be robust and invariant to small shifts and Average pooling of 2x2 is used after each convolution and a + distortions. Pooling combines, or pools, a set of values in sigmoid is used for the non-linearity. In total, LeNet requires + its receptive field into a smaller number of values. It can be 60k weights and 341k multiply-and-accumulates (MACs) per + configured based on the size of its receptive field (e.g., 2x2) image. LeNet led to CNNs’ first commercial success, as it was + and pooling operation (e.g., max or average), as shown in deployed in ATMs to recognize digits for check deposits. + Fig. 12. Typically pooling occurs on non-overlapping blocks AlexNet[3] was the first CNN to win the ImageNet Challenge + (i.e., the stride is equal to the size of the pooling). Usually a in 2012. It consists of five CONV layers and three FC layers. + stride of greater than one is used such that there is a reduction Within each CONV layer, there are 96 to 384 filters and the + in the dimension of the representation (i.e., feature map). filter size ranges from 3x3 to 11x11, with 3 to 256 channels + 3) Normalization:Controlling the input distribution across each. In the first layer, the 3 channels of the filter correspond + layers can help to significantly speed up training and improve to the red, green and blue components of the input image. + accuracy. Accordingly, the distribution of the layer input A ReLU non-linearity is used in each layer. Max pooling of + activations <> are normalized such that it has a zero mean 3x3 is applied to the outputs of layers 1, 2 and 5. To reduce + and a unit standard deviation. In batch normalization (BN), computation, a stride of 4 is used at the first layer of the + the normalized value is further scaled and shifted, as shown network. AlexNet introduced the use of LRN in layers 1 and + in Eq. (2), where the parameters <> are learned from 2 before the max pooling, though LRN is no longer popular + training [47].X is a small constant to avoid numerical problems. in later CNN models. One important factor that differentiates + Prior to this, local response normalization (LRN) [3] was AlexNet from LeNet is that the number of weights is much + used, which was inspired by lateral inhibition in neurobiology larger and the shapes vary from layer to layer. To reduce the + where excited neurons (i.e., high value activations) should amount of weights and computation in the second CONV layer, + subdue its neighbors (i.e., cause low value activations); however, the 96 output channels of the first layer are split into two groups + BN is now considered standard practice in the design of of 48 input channels for the second layer, such that the filters in + CNNs while LRN is mostly deprecated. Note that while LRN the second layer only have 48 channels. Similarly, the weights + usually is performed after the non-linear function, BN is mostly in fourth and fifth layer are also split into two groups. In total, + performed between the CONV or FC layer and the non-linear AlexNet requires 61M weights and 724M MACs to process + one 227x227 input image. + Overfeat[49] has a very similar architecture to AlexNet with + <> (2) five CONV layers and three FC layers. The main differences <> + are that the number of filters is increased for layers 3 (384 + to 512), 4 (384 to 1024), and 5 (256 to 1024), layer 2 is notB. Popular DNN Models + split into two groups, the first fully connected layer only has + Many DNN models have been developed over the past 3072 channels rather than 4096, and the input size is 231x231 + two decades. Each of these models has a different ‘network rather than 227x227. As a result, the number of weights grows + architecture’ in terms of number of layers, layer types, layer to 146M and the number of MACs grows to 2.8G per image. + shapes (i.e., filter size, number of channels and filters), and Overfeat has two different models: fast (described here) and + connections between layers. Understanding these variations accurate. The accurate model used in the ImageNet Challenge + and trends is important for incorporating the right flexibility gives a 0.65% lower top-5 error rate than the fast model at the + in any efficient DNN engine. cost of 1.9% more MACs + In this section, we will give an overview of various popular VGG-16[50] goes deeper to 16 layers consisting of 13 + DNNs such as LeNet [48] as well as those that competed in CONV layers and 3 FC layers. In order to balance out the + and/or won the ImageNet Challenge [14] as shown in Fig. 7, cost of going deeper, larger filters (e.g., 5x5) are built from + most of whose models with pre-trained weights are publicly multiple smaller filters (e.g., 3x3), which have fewer weights, + available for download; the DNN models are summarized in to achieve the same receptive fields as shown in Fig. 13(a). + Table II. Two results for top-5 error results are reported. In the As a result, all CONV layers have the same filter size of 3x3. + first row, the accuracy is boosted by using multiple crops from In total, VGG-16 requires 138M weights and 15.5G MACs + the image and an ensemble of multiple trained models (i.e., to process one 224224 input image. VGG has two different + the DNN needs to be run several times); these results were models: VGG-16 (described here) and VGG-19. VGG-19 gives + used to compete in the ImageNet Challenge. The second row a 0.1% lower top-5 error rate than VGG-16 at the cost of + reports the accuracy if only a single crop was used (i.e., the 1.27more MACs. 9 + + + <
> <
> + + Fig. 13. Decomposing larger filters into smaller filters. Fig. 14. Inception module from GoogleNet [51] with example channel lengths. + + + GoogLeNet[51] goes even deeper with 22 layers. It in- + troduced an inception module, shown in Fig. 14, which is + composed of parallel connections, whereas previously there + was only a single serial connection. Different sized filters (i.e., + 1x1, 3x3, 5x5), along with 3x3 max-pooling, are used for + each parallel connection and their outputs are concatenated + for the module output. Using multiple filter sizes has the + effect of processing the input at multiple scales. For improved + training speed, GoogLeNet is designed such that the weights + ReLU and the activations, which are stored for backpropagation during <> + training, could all fit into the GPU memory. In order to reduce + the number of weights, 1x1 filters are applied as a ‘bottleneck’ + to reduce the number of channels for each filter [52]. The 22 + layers consist of three CONV layers, followed by 9 inceptions + layers (each of which are two CONV layers deep), and one FC + layer. Since its introduction in 2014, GoogleNet (also referred <
> + to as Inception) has multiple versions: v1 (described here), v3 7 + smaller 1-D filters as shown in Fig. 13(b) to reduce number Fig. 15. Shortcut module from ResNet [15]. + Note that ReLU following last + of MACs and weights in order to go deeper to 42 layers. CONV layer in short cut is after the addition. + In conjunction with batch normalization [47], v3 achieves + over 3% lower top-5 error than v1 with 2.5% increase in is used. This is similar to the LSTM networks that are used for computation [53]. + Inception-v4 uses residual connections [54], sequential data. ResNet also uses the ‘bottleneck’ approach of described in the next section, + for a 0.4% reduction in error. using 1x1 filters to reduce the number of weight parameters.ResNet[15], also known as Residual Net, uses residual + As a result, the two layers in the shortcut module are replace d connections to go even deeper (34 layers or more). It was by three layers (1x1, 3x3, 1x1) where the 1x1 reduces and + the first entry DNN in ImageNet Challenge that exceeded then increases (restores) the number of weights. ResNet-50human-level accuracy with a top-5 error rate below 5%. + One consists of one CONV layer, followed by 16 shortcut layers of the challenges with deep networks is the vanishing gradient (each of which are three CONV layers deep), and one FC + during training: as the error backpropagates through the network layer; it requires 25.5M weights and 3.9G MACs per image.the gradient shrinks, which affects the ability to update the There are various versions of ResNet with multiple depths + weights in the earlier layers for very deep networks. Residual (e.g.,without bottleneck:18, 34;with bottleneck:50, 101, 152).net introduces a ‘shortcut’ module which contains an identity The ResNet with 152 layers was the winner of the ImageNet + connection such that the weight layers (i.e., CONV layers) Challenge requiring 11.3G MACs and 60M weights. Compared can be skipped as shown in Fig. 15. Rather than learning the to ResNet-50, it reduces the top-5 error by around 1% at the + function for the weight layersF(x), the shortcut module learns cost of 2.9% more MACs and 2.5% more weights.the residual mapping <>. Initially, <> is + zero and the identity connection is taken; then gradually during Several trends can be observed in the popular DNNs shown + training, the actual forward connection through the weight layer in Table II. Increasing the depth of the network tends to provide + higher accuracy. Controlling for number of weights, a deeper + 7 v2 is very similar to v3. network can support a wider range of non-linear functions + + + that are more discriminative and also provides more levels B. Models + of hierarchy in the learned representation [15,50,51,55]. Pretrained DNN models can be downloaded from variousThe number of filter shapes continues to vary across layers, websites [56–59] for the various different frameworks. It shouldthus flexibility is still important. Furthermore, most of the be noted that even for the same DNN (e.g., AlexNet) thecomputation has been placed on CONV layers rather than FC accuracy of these models can vary by around 1% to 2%layers. In addition, the number of weights in the FC layers is depending on how the model was trained, and thus the resultsreduced and in most recent networks (since GoogLeNet) the do not always exactly match the original publication.CONV layers also dominate in terms of weights. Thus, the + focus of hardware implementations should be on addressing + the efficiency of the CONV layers, which in many domains C. Popular Datasets for Classification + are increasingly important. It is important to factor in the difficulty of the task when + comparing different DNN models. For instance, the task of + IV. DNN DEVELOPMENT RESOURCES classifying handwritten digits from the MNIST dataset [62] + is much simpler than classifying an object into one of 1000 + One of the key factors that has enabled the rapid development classes as is required for the ImageNet dataset [14](Fig. 16). + of DNNs is the set of development resources that have been It is expected that the size of the DNNs (i.e., number ofmade available by the research community and industry. + These weights) and the number of MACs will be larger for the moreresources are also key to the development of DNN accelerators difficult task than the simpler task and thus + require moreby providing characterizations of the workloads and facilitating energy and have lower throughput. For instance, LeNet-5[48]the exploration of trade-offs in + model complexity and accuracy. is designed for digit classification, while AlexNet[3], VGG-This section will describe these resources such that those who 16[50], GoogLeNet[51], + and ResNet[15] are designed for theare interested in this field can quickly get started. + There are many AI tasks that come with publicly availableA. Frameworks + For ease of DNN development and to enable sharing of Public datasets are important for comparing the accuracy of + trained networks, several deep learning frameworks have been different approaches. The simplest and most common task + developed from various sources. These open source libraries is image classification, which involves being given an entire + contain software libraries for DNNs. Caffe was made available image, and selecting 1 of N classes that the image most likely + in 2014 from UC Berkeley [46]. It supports C, C++, Python belongs to. There is no localization or detection. + and MATLAB. Tensorflow was released by Google in 2015, MNISTis a widely used dataset for digit classification + and supports C++ and python; it also supports multiple CPUs that was introduced in 1998 [62]. It consists of 2828 pixel + and GPUs and has more flexibility than Caffe, with the grayscale images of handwritten digits. There are 10 classes + computation expressed as dataflow graphs to manage the (for 10 digits) and 60,000 training images and 10,000 test + tensors (multidimensional arrays). Another popular framework images. LeNet-5 was able to achieve an accuracy of 99.05% + is Torch, which was developed by Facebook and NYU and when MNIST was first introduced. Since then the accuracy has + supports C, C++ and Lua. There are several other frameworks increased to 99.79% using regularization of neural networks + such as Theano, MXNet, CNTK, which are described in [60]. with dropconnect [63]. Thus, MNIST is now considered a fairly + There are also higher-level libraries that can run on top of easy dataset. + the aforementioned frameworks to provide a more universal CIFARis a dataset that consists of 3232 pixel colored + experience and faster development. One example of such images of of various objects, which was released in 2009 [64]. + libraries is Keras, which is written in Python and supports CIFAR is a subset of the 80 million Tiny Image dataset [65]. + Tensorflow, CNTK and Theano. CIFAR-10 is composed of 10 mutually exclusive classes. There + The existence of such frameworks are not only a convenient are 50,000 training images (5000 per class) and 10,000 test + aid for DNN researchers and application designers, but they images (1000 per class). A two-layer convolutional deep belief + are also invaluable for engineering high performance or more network was able to achieve 64.84% accuracy on CIFAR-10 + efficient DNN computation engines. In particular, because the when it was first introduced [66]. Since then the accuracy has + frameworks make heavy use of a set primitive operations, increased to 96.53% using fractional max pooling [67]. + such processing of a CONV layer, they can incorporate use of ImageNetis a large scale image dataset that was first + optimized software or hardware accelerators. This acceleration introduced in 2010; the dataset stabilized in 2012 [14]. It + is transparent to the user of the framework. Thus, for example, contains images of 256256 pixel in color with 1000 classes. + most frameworks can use Nvidia’s cuDNN library for rapid The classes are defined using the WordNet as a backbone to + execution on Nvidia GPUs. Similarly, transparent incorporation handle ambiguous word meanings and to combine together + of dedicated hardware accelerators can be achieved as was synonyms into the same object category. In otherwords, there + done with the Eyeriss chip [61]. is a hierarchy for the ImageNet categories. The 1000 classes + Finally, these frameworks are a valuable source of workloads were selected such that there is no overlap in the ImageNet + for hardware researchers. They can be used to drive experi- hierarchy. The ImageNet dataset contains many fine-grained + mental designs for different workloads, for profiling different categories including 120 different breeds of dogs. There are + workloads and for exploring hardware-software trade-offs. 1.3M training images (732 to 1300 per class), 100,000 testing 11 + + <
> + TABLE II + SUMMARY OF POPULAR DNN S [3,15,48,50,51]. y ACCURACY IS MEASURED BASED ON TOP -5 ERROR ON IMAGE NET [14]. z THIS VERSION OF LE NET -5 + HAS 431 K WEIGHTS FOR THE FILTERS AND REQUIRES 2.3M MAC S PER IMAGE ,AND USES RE LU RATHER THAN SIGMOID . + + + + be localized and classified (out of 1000 classes). The DNN + outputs the top five categories and top five bounding box + locations. There is no penalty for identifying an object that + is in the image but not included in the ground truth. For + object detection, all objects in the image must be localized + and classified (out of 200 classes). The bounding box for all + objects in these categories must be labeled. Objects that are + not labeled are penalized as are duplicated detections. Fig. 16. + MNIST (10 classes, 60k training, 10k testing) [62] vs. ImageNet + (1000 classes, 1.3M training, 100k testing)[14] dataset. Beyond ImageNet, there are also other popular image + datasets for computer vision tasks. For object detection, there + images (100 per class) and 50,000 validation images (50 per is the PASCAL VOC (2005-2012) dataset that contains 11k + class). images representing 20 classes (27k object instances, 7k of + The accuracy of the ImageNet Challenge are reported using which has detailed segmentation) [68]. For object detection, + two metrics: Top-5 and Top-1 error. Top-5 error means that if segmentation and recognition in context, there is the MS COCO + any of the top five scoring categories are the correct category, dataset with 2.5M labeled instances in 328k images (91 object + it is counted as a correct classification. The Top-1 requires categories) [69]; compared to ImageNet, COCO has fewer + that the top scoring category be correct. In 2012, the winner categories but more instances per category, which is useful for + of the ImageNet Challenge (AlexNet) was able to achieve an precise 2-D localization. COCO also has more labeled instances + accuracy of 83.6% for the top-5 (which is substantially better per image to potentially help with contextual information. + than the 73.8% which was second place that year that did not Most recently even larger scale datasets have been made + use DNNs); it achieved 61.9% on the top-1 of the validation available. For instance, Google has an Open Images dataset + set. In 2017, the highest accuracy was 97.7% for the top-5. with over 9M images [70], spanning 6000 categories. There is + In summary of the various image classification datasets, it also a YouTube dataset with 8M videos (0.5M hours of video) + is clear that MNIST is a fairly easy dataset, while ImageNet covering 4800 classes [71]. Google also released an audio + is a challenging one with a wider coverage of classes. Thus dataset comprised of 632 audio event classes and a collection + in terms of evaluating the accuracy of a given DNN, it is of 2M human-labeled 10-second sound clips [72]. These large + important to consider that dataset upon which the accuracy is datasets will be evermore important as DNNs become deeper + measured. with more weight parameters to train. + Undoubtedly, both larger datasets and datasets for new + D. Datasets for Other Tasks domains will serve as important resources for profiling and + exploring the efficiency of future DNN engines.Since the accuracy of the state-of-the-art DNNs are perform- + ing better than human-level accuracy on image classification + tasks, the ImageNet Challenge has started to focus on more V. H ARDWARE FOR DNN P ROCESSING + difficult tasks such as single-object localization and object Due to the popularity of DNNs, many recent hardware + detection. For single-object localization, the target object must platforms have special features that target DNN processing. For 12 + + + instance, the Intel Knights Landing CPU features special vector + instructions for deep learning; the Nvidia PASCAL GP100 + GPU features 16-bit floating point (FP16) arithmetic support + to perform two FP16 operations on a single precision core for + faster deep learning computation. Systems have also been built + specifically for DNN processing such as Nvidia DGX-1 and + Facebook’s Big Basin custom DNN server [73]. DNN inference + has also been demonstrated on various embedded System-on- + Chips (SoC) such as Nvidia Tegra and Samsung Exynos as + well as FPGAs. Accordingly, it’s important to have a good + understanding of how the processing is being performed on + these platforms, and how application-specific accelerators can <
> + be designed for DNNs for further improvement in throughput + and energy efficiency. Fig. 17. Highly-parallel compute paradigms. + The fundamental component of both the CONV and FC lay- + ers are the multiply-and-accumulate (MAC) operations, which + can be easily parallelized. In order to achieve high performance, + highly-parallel compute paradigms are very commonly used, + including both temporal and spatial architectures as shown in <> + Fig. 17. The temporal architectures appear mostly in CPUs + parallelism such as vectors (SIMD) or parallel threads (SIMT). + Such temporal architecture use a centralized control for a large + number of ALUs. These ALUs can only fetch data from the + memory hierarchy and cannot communicate directly with each + other. In contrast, spatial architectures use dataflow processing, + i.e., the ALUs form a processing chain so that they can pass data + from one to another directly. Sometimes each ALU can have + its own control logic and local memory, called a scratchpad or + register file. We refer to the ALU with its own local memory as + a processing engine (PE). Spatial architectures are commonly + used for DNNs in ASIC and FPGA-based designs. In this + section, we will discuss the different design strategies for + efficient processing on these different platforms, without any + impact on accuracy (i.e., all approaches in this section produce + bit-wise identical results); specifically, <
> + * For temporal architectures such as CPUs and GPUs, we + will discuss howcomputational transformson the kernel Fig. 18. Mapping to matrix multiplication for fully connected layers + can reduce the number of multiplications to increase + throughput. + * For spatial architectures used in accelerators, we will + discuss howdataflowscan increase data reuse from low andNin Fig. 18(b)); finally, the height of the output feature + cost memories in the memory hierarchy toreduce energy map matrix is the number of channels in the output feature + consumption. maps (M), and the width is the number of output feature maps + (N), where each output feature map of the FC layer has the + dimension of 1x1 number of output channels (M). + A. Accelerate Kernel Computation on CPU and GPU Platforms The CONV layer in a DNN can also be mapped to a matrix + CPUs and GPUs use parallelizaton techniques such as SIMD multiplication using a relaxed form of the Toeplitz matrix as + or SIMT to perform the MACs in parallel. All the ALUs share shown in Fig. 19. The downside for using matrix multiplication + the same control and memory (register file). On these platforms, for the CONV layers is that there is redundant data in the input + both the FC and CONV layers are often mapped to a matrix feature map matrix as highlighted in Fig. 19(a). This can lead + multiplication (i.e., the kernel computation). Fig. 18 shows how to either inefficiency in storage, or a complex memory access + a matrix multiplication is used for the FC layer. The height of pattern. + the filter matrix is the number of filters and the width is the There are software libraries designed for CPUs (e.g., Open- + number of weights per filter (input channels (C) width (W) BLAS, Intel MKL, etc.) and GPUs (e.g., cuBLAS, cuDNN, + height (H), sinceR=WandS=Hin the FC layer); etc.) that optimize for matrix multiplications. The matrix + the height of the input feature maps matrix is the number of multiplication is tiled to the storage hierarchy of these platforms, + activations per input feature map <>, and the which are on the order of a few megabytes at the higher levels. + + <
> + + Fig. 21. Read and write access per MAC. + + <
> + + Fig. 19. Mapping to matrix multiplication for convolutional layers. + + for a 3x3 filter, respectively, at the cost of reduced numerical stability, increased storage requirements, and specialized + The matrix multiplications on these platforms can be further processing depending on the size of the filter. + sped up by applying computational transforms to the data to In practice, different algorithms might be used for different + reduce the number of multiplications, while still giving the layer shapes and sizes (e.g., FFT for filters greater than 5x5, + same bit-wise result. Often this can come at a cost of increased and Winograd for filters 3x3 and below). Existing platform + number of additions and a more irregular data access pattern. libraries, such as MKL and cuDNN, dynamically chose the + appropriate algorithm for a given shape and size [77, 78].Fast Fourier Transform (FFT) [10,74] is a well known + approach, shown in Fig. 20 that reduces the number of + multiplications from <> to <> B. Energy-Efficient Dataflow for Accelerators <>, where the + output size is <> and the filter size is <>. To For DNNs, the bottleneck for processing is in the memory perform + the convolution, we take the FFT of the filter and access. Each MAC requires three memory reads (for filterinput feature map, and then + perform the multiplication in weight, fmap activation, and partial sum) and one memorythe frequency domain; we then apply an inverse + FFT to the write (for the updated partial sum) as shown in Fig. 21. In theresulting product to recover the output feature map in the + worst case, all of the memory accesses have to go through the spatial domain. However, there are several drawbacks to using off-chip + DRAM, which will severely impact both throughput FFT: (1) the benefits of FFTs decrease with filter size; (2) the and energy efficiency. + For example, in AlexNet, to support itssize of the FFT is dictated by the output feature map size which 724M MACs, nearly 3000M DRAM + accesses will be required. is often much larger than the filter; (3) the coefficients in the Furthermore, DRAM accesses require up to + several orders offrequency domain are complex. As a result, while FFT reduces magnitude higher energy than computation [79].computation, + it requires larger storage capacity and bandwidth. Accelerators, such as spatial architectures as shown inFinally, a popular + approach for reducing complexity is to make Fig. 17, provide an opportunity to reduce the energy cost ofthe weights sparse, which will + be discussed in SectionVII-B2; data movement by introducing several levels of local memoryusing FFTs makes it difficult for this sparsity + to be exploited. hierarchy with different energy cost as shown in Fig. 22. This + Several optimizations can be performed on FFT to make it includes a large global buffer with a size of several hundred + more effective for DNNs. To reduce the number of operations, kilobytes that connects to DRAM, an inter-PE network that + the FFT of the filter can be precomputed and stored. In addition, can pass data directly between the ALUs, and a register file + the FFT of the input feature map can be computed once and (RF) within each processing element (PE) with a size of a + used to generate multiple channels in the output feature map. few kilobytes or less. The multiple levels of memory hierarchy + Finally, since an image contains only real values, its Fourier help to improve energy efficiency by providing low-cost data + Transform is symmetric and this can be exploited to reduce accesses. For example, fetching the data from the RF or + storage and computation cost. neighbor PEs is going to cost 1 or 2 orders of magnitude + Other approaches include Strassen [75] and Winograd [76], lower energy than from DRAM. + which rearrange the computation such that the number of Accelerators can be designed to support specialized process- + multiplications reduce from <> to <> and by 2.25% ing dataflows that leverage this memory hierarchy. The dataflow 14 + + <> + + Fig. 23. Data reuse opportunities in DNNs [80]. + + <> + + Fig. 22. Memory hierarchy and data movement energy [80]. + + + decides what data gets read into which level of the memory + hierarchy and when are they getting processed. Since there is + no randomness in the processing of DNNs, it is possible to + design a fixed dataflow that can adapt to the DNN shapes and + sizes and optimize for the best energy efficiency. The optimized + dataflow minimizes access from the more energy consuming <> + levels of the memory hierarchy. Large memories that can store + a significant amount of data consume more energy than smaller + memories. For instance, DRAM can store gigabytes of data, but + consumes two orders of magnitude higher energy per access + than a small on-chip memory of a few kilobytes. Thus, every + time a piece of data is moved from an expensive level to a Fig. 24. An analogy between the operation of DNN accelerators (texts in + lower cost level in terms of energy, we want to reuse that piece black) and that of general-purpose processors (texts in red). Figure adopted + from [81]. of data as much as possible to minimize subsequent accesses + to the expensive levels. The challenge, however, is that the + storage capacity of these low cost memories are limited. Thus program into machine-readable binary codes for executionwe need to explore different dataflows that maximize reuse given the hardware architecture (e.g., x86 or ARM); in theunder these constraints. processing of DNNs, the mapper translates the DNN shapeFor DNNs, we investigate dataflows that exploit three forms and size into a hardware-compatible computation mappingof input data reuse (convolutional, feature map and filter) as for execution given the dataflow. While the compiler usuallyshown in Fig. 23. For convolutional reuse, the same input optimizes for performance, the mapper optimizes for energyfeature map activations and filter weights are used within efficiency.a given channel, just in different combinations for different The following taxonomy (Fig. 25) can be used to classifyweighted sums. For feature map reuse, multiple filters are the DNN dataflows in recent works [82–93] based on their applied to the same feature map, so the input feature map data handling characteristics [80]: activations are used multiple times across filters. Finally, for 1) Weight stationary (WS):The weight stationary dataflow + filter reuse, when multiple input feature maps are processed at is designed to minimize the energy consumption of reading + once (referred to as a batch), the same filter weights are used weights by maximizing the accesses of weights from the register + multiple times across input features maps. + file (RF) at the PE (Fig. 25(a)). Each weight is read from + If we can harness the three types of data reuse by storing DRAM into the RF of each PE and stays stationary for further + the data in the local memory hierarchy and accessing them accesses. The processing runs as many MACs that use the + multiple times without going back to the DRAM, it can save same weight as possible while the weight is present in the RF; + a significant amount of DRAM accesses. For example, in it maximizes convolutional and filter reuse of weights. The + AlexNet, the number of DRAM reads can be reduced by up to inputs and partial sums must move through the spatial array + 500in the CONV layers. The local memory can also be used and global buffer. The input fmap activations are broadcast to + for partial sum accumulation, so they do not have to reach all PEs and then the partial sums are spatially accumulated + DRAM. In the best case, if all data reuse and accumulation across the PE array. + can be achieved by the local memory hierarchy, the 3000M One example of previous work that implement weight + DRAM accesses in AlexNet can be reduced to only 61M. stationary dataflow is nn-X, or neuFlow [85], which uses + The operation of DNN accelerators is analogous to that of eight 2-D convolution engines for processing a 1010 filter. + general-purpose processors as illustrated in Fig. 24 [81]. In There are total 100 MAC units, i.e. PEs, per engine with each + conventional computer systems, the compiler translates the PE having a weight that stays stationary for processing. The + + <
> + + Fig. 26. Variations of output stationary [80].(b) Output Stationary + + are [89], [88], and [90], respectively. + No local reuse (NLR): While small register files are + efficient in terms of energy (pJ/bit), they are inefficient in terms Psum + <> of area (<>). In order to maximize the storage capacity, + and minimize the off-chip memory bandwidth, no local storage + Fig. 25. Dataflows for DNNs [80]. is allocated to the PE and instead all that area is allocated + to the global buffer to increase its capacity (Fig. 25(c)). The + no local reuse dataflow differs from the previous dataflows in + input fmap activations are broadcast to all MAC units and the that nothing stays stationary inside the PE array. As a result, + partial sums are accumulated across the MAC units. In order to there will be increased traffic on the spatial array and to the + accumulate the partial sums correctly, additional delay storage global buffer for all data types. Specifically, it has to multicast + elements are required, which are counted into the required size the activations, single-cast the filter weights, and then spatially + of local storage. Other weight stationary examples are found accumulate the partial sums across the PE array. + in [82–84, 86, 87]. In an example of the no local reuse dataflow from + 2) Output stationary (OS):The output stationary dataflow is UCLA [91], the filter weights and input activations are read + designed to minimize the energy consumption of reading and from the global buffer, processed by the MAC units with custom + writing the partial sums (Fig. 25(b)). It keeps the accumulation adder trees that can complete the accumulation in a single cycle, + of partial sums for the same output activation value local in the and the resulting partial sums or output activations are then put + RF. In order to keep the accumulation of partial sums stationary back to the global buffer. Another example is DianNao [92], + in the RF, one common implementation is to stream the input which also reads input activations and filter weights from + activations across the PE array and broadcast the weight to all the buffer, and processes them through the MAC units with + PEs in the array. custom adder trees. However, DianNao implements specialized + One example that implements the output stationary dataflow registers to keep the partial sums in the PE array, which helps + is ShiDianNao [89], where each PE handles the processing for to further reduce the energy consumption of accessing partial + each output activation value by fetching the corresponding input sums. Another example of no local reuse dataflow is found + activations from neighboring PEs. The PE array implements in [93]. + dedicated networks to pass data horizontally and vertically. 4) Row stationary (RS): A row stationary dataflow is + Each PE also has data delay registers to keep data around for proposed in [80], which aims to maximize the reuse and + the required amount of cycles. At the system level, the global accumulation at the RF level foralltypes of data (weights, + buffer streams the input activations and broadcasts the weights pixels, partial sums) for the overall energy efficiency. This + into the PE array. The partial sums are accumulated inside differs from WS or OS dataflows, which optimize for only + each PE and then get streamed out back to the global buffer. weights and partial sums, respectively. + Other examples of output stationary are found in [88, 90]. The row stationary dataflow assigns the processing of a + There are multiple possible variants of output stationary as 1-D row convolution into each PE for processing as shown + shown in Fig. 26 since the output activations that get processed in Fig. 27. It keeps the row of filter weights stationary inside + at the same time can come from different dimensions. For the RF of the PE and then streams the input activations into + example, the variantOS A targets the processing of CONV the PE. The PE does the MACs for each sliding window at a + layers, and therefore focuses on the processing of output time, which uses just one memory space for the accumulation + activations from the same channel at a time in order to of partial sums. Since there are overlaps of input activations + maximize data reuse opportunities. The variantOS C targets between different sliding windows, the input activations can + the processing of FC layers, and focuses on generating output then be kept in the RF and get reused. By going through all the + activations from all different channels, since each channel only sliding windows in the row, it completes the 1-D convolution + has one output activation. The variantOS B is something in and maximize the data reuse and local accumulation of data + betweenOS A andOS C . Example of variantsOS A ,OS B , and in this row. 16 + + <
> + + Fig. 27. 1-D Convolutional reuse within PE for Row Stationary Dataflow [80]. + + <
> + + Fig. 29. Multiple rows of different input feature maps, filters and channels are + mapped to same PE within array for additional reuse in the Row Stationary + + <
> + Fig. 28. 2-D convolutional reuse within spatial array for Row Stationary + + shown in Fig. 28. For example, to generate the first row of + output activations with a filter having three rows, three 1-D Fig. 30. Mapping optimization takes in hardware and DNNs shape constraints + convolutions are required. Therefore, we can use three PEs in to determine optimal energy dataflow [80]. + a column, each running one of the three 1-D convolutions. The + partial sums are further accumulated vertically across the three + PEs to generate the first output row. To generate the second different channels are interleaved, and run through the same PE + row of output, we use another column of PEs, where three as a 1-D convolution. The partial sums from different channels + rows of input activations are shifted down by one row, and use then naturally get accumulated inside the PE. + the same rows of filters to perform the three 1-D convolutions. The number of filters, channels, and fmaps that can be + Additional columns of PEs are added until all rows of the processed at the same time is programmable, and there exists an + output are completed (i.e., the number of PE columns equals optimal mapping for the best energy efficiency, which depends + the number of output rows). on the shape configuration of the DNN as well as the hardware + This 2-D array of PEs enables other forms of reuse to reduce resources provided, e.g., the number of PEs and the size of the + accesses to the more expensive global buffer. For example, each memory in the hierarchy. Since all of the variables are known + filter row is reused across multiple PEs horizontally. Each row before runtime, it is possible to build a compiler (i.e., mapper) + of input activations is reused across multiple PEs diagonally. to perform this optimization off-line to configure the hardware + And each row of partial sums are further accumulated across for different mappings of the RS dataflow for different DNNs + the PEs vertically. Therefore, 2-D convolutional data reuse and as shown in Fig. 30. + accumulation are maximized inside the 2-D PE array. One example that implements the row stationary dataflow + To address the high-dimensional convolution of the CONV is Eyeriss [94]. It consists of a 14x12 PE array, a 108KB + layer (i.e., multiple fmaps, filters, and channels), multiple rows global buffer, ReLU and fmap compression units as shown + can be mapped onto the same PE as shown in Fig. 29. The in Fig. 31. The chip communicates with the off-chip DRAM + 2-D convolution is mapped to a set of PEs, and the additional using a 64-bit bidirectional data bus to fetch data into the + dimensions are handled by interleaving or concatenating the global buffer. The global buffer then streams the data into the + additional data. For filter reuse within the PE, different rows PE array for processing. + of fmaps are concatenated and run through the same PE In order to support the RS dataflow, two problems need to be + as a 1-D convolution. For input fmap reuse within the PE, solved in the hardware design. First, how can the fixed-size PE + different filter rows are interleaved and run through the same array accommodate different layer shapes? Second, although + PE as a 1-D convolution. Finally, to increase local partial sum the data will be passed in a very specific pattern, it still changes + accumulation within the PE, filter rows and fmap rows from with different shape configurations. How can the fixed design + + + needs of each dataflow under the same area constraint. For + example, since the no local reuse dataflow does not require any Processing + RF in PE, it is allocated with a much larger global buffer. The If map + simulation uses the layer configurations from AlexNet with a Buffer + batch size of 16. The simulation also takes into account the bits + fact that accessing different levels of the memory hierarchy Enc. + requires different energy cost. + of each dataflow for the CONV layers of AlexNet with a + batch size of 16. The WS and OS dataflows have the lowest + energy consumption for accessing weights and partial sums, + respectively. However, the RS dataflow has the lowest total 13 + energy consumption since it optimizes for the overall energy + efficiency instead of only for a certain data type. + + Fig. 33(a) shows the same results with breakdown in terms of + memory hierarchy. The RS dataflow consumes the most energy + in the RF, since by design most of the accesses have been + moved to the lowest level of the memory hierarchy. This helps + to achieve the lowest total energy consumption since RF has + the lowest energy per access. The NLR dataflow has the lowest Clock Gated + energy consumption at the DRAM level, since it has a much + larger global buffer and thus higher on-chip storage capacity + compared to others. However, most of the data accesses in + relatively large energy consumption per access compared to + accessing data from RF or inside the PE array. As a result, the + overall energy consumption of the NLR dataflow is still fairly + high. Overall, RS dataflow uses 1.4% to 2.5% lower energy + pass data in different patterns? + + <
> + + Fig. 32. Mapping uses replication and folding to maximized utilization of the NLR dataflow is from the global buffer, which still has a + PE array [94]. + + Two mapping strategies can be used to solve the first problem than other dataflows. + as shown in Fig. 32. First, replication can be used to map shapes Fig. 34 shows the energy efficiency between different + that do not use up the entire PE array. For example, in the dataflows in the FC layers of AlexNet with a batch size of 16. + third to fifth layers of AlexNet, each 2-D convolution only uses Since there is not as much data reuse in the FC layers as in a + 13x3 PE array. This structure is then replicated four times, the CONV layers, all dataflows spend a significant amount of + and runs different channels and filters in each replication. The energy on reading weights. However, RS dataflow still has the + second strategy is called folding. For example, in the second lowest energy consumption because it optimizes for the energy + layer of AlexNet, it requires a 27x5 PE array to complete the of accessing input activations and partial sums. For the OS2-D + convolution. In order to fit it into the 14x12 physical PE dataflows,OSarray, it is folded into two parts, 14x5 and 13x5, and each + C now consumes lower energy thanOS A since it is designed for the FC layers. Overall, RS still consumesare vertically mapped into + the physical PE array. Since not all 1.3% lower energy compared to other dataflows at the batchPEs are used by the mapping, the + unused PEs can be clock size of 16.gated to save energy consumption. + A custom multicast network is used to solve the second Fig. 35 shows the RS dataflow design with energy breakdown + problem about flexible data delivery. The simplest way to pass in terms of different layers of AlexNet. In the CONV layers, the + data to multiple destinations is to broadcast the data to all PEs energy is mostly consumed by the RF, while in the FC layers, + and let each PE decide if it has to process the data or not. the energy is mostly consumed by DRAM. However, most + However, it is not very energy efficient especially when the of the energy is consumed by the CONV layers, which takes + size of PE array is large. Instead, a multicast network is used around 80% of the energy. As recent DNN models go deeper + to send data to only the places where it is needed. with more CONV layers, the ratio between number of CONV + 5) Energy comparison of different dataflows:To evaluate and FC layers only gets larger. Therefore, moving forward, + and compare different dataflows, the same total hardware area significant effort should be placed on energy optimizations for + and number of PEs (256) are used in the simulation of a spatial CONV layers. + architecture for all dataflows. The local memory (register file) at Finally, up until now, we have been looking at architec- + each processing element (PE) is on the order of 0.5 – 1.0kB and tures with relatively limited storage on the order of a few + a shared memory (global buffer) is on the order of 100 – 500kB. hundred kilobytes. With much larger storage on the order of + The sizes of these memories are selected to be comparable to a few megabytes, additional dataflows can be considered. For + a typical accelerator for multimedia processing, such as video example, Fused-Layer looks at dataflow optimizations across + coding [95]. The memory sizes are further adjusted for the layers [96]. 18 + + <> + + Fig. 35. Energy breakdown across layers of the AlexNet [80]. RF energy + dominates in convolutional layers. DRAM energy dominates in the fully + connected layer. Convolutional layer dominate energy consumption. + In this section, we will discuss how moving compute and data Normalized + closer to reduce data movement (i.e., near-data processing) can pixels + be achieved using mixed-signal circuit design and advanced + memory technologies. + Many of these works use analog processing which has the + drawback of increased sensitivity to circuit and device non- + idealities. Consequentially, the computation is often performed + at reduced precision, which can be accounted for during (b) Energy breakdown across data type + the training of the DNNs using the techniques discussed in + Section VII. Another factor to take into consideration is that Fig. 33. + Comparison of energy efficiency between different dataflows in the DNNs are + often trained in the digital domain; thus for analog CONV layers of AlexNet with a batch size of 16 [3]: + (a) breakdown in terms of storage levels and ALU, (b) breakdown in terms of data types. OS + processing, there is an additional overhead cost for analog- A , OS B and OS C are three variants of the + OS dataflow that are commonly seen in to-digital conversion (ADC) and digital-to-analog conversion different implementations [80]. (DAC). + + A. DRAM + + Advanced memory technology can reduce the access energy + for high density memories such as DRAMs. For instance, psums + embedded DRAM (eDRAM)brings high density memory on- + chip to avoid the high energy cost of switching off-chip pixels + capacitance [97]; eDRAM is 2.85higher density than SRAM 0.5 + and 32% more energy efficient than DRAM (DDR3) [93]. + eDRAM also offers higher bandwidth and lower latency + compared to DRAM. In DNN processing, eDRAM can be used DNN Dataflows + to store tens of megabytes of weights and activations on-chip + to avoid off-chip access, as demonstrated in DaDianNao [93]. + off-chip DRAM and can increase the cost of the chip. + Rather than integrating DRAM into the chip itself, the + DRAM can also be stacked on top of the chip using throughVI. N EAR -D ATA PROCESSING silicon vias (TSV). This technology is often referred to as3-D + The previous section highlighted that data movement domi- memory, and has been commercialized in the form of Hybrid + nates energy consumption. While spatial architectures distribute Memory Cube (HMC) [98] and High Bandwidth Memory + the on-chip memory such that it is closer to the computation (HBM) [99]. 3-D memory delivers an order of magnitude higher + (e.g., into the PE), there have also been efforts to bring the bandwidth and reduces access energy by up to 5relative to + off-chip high density memory closer to the computation or to existing 2-D DRAMs, as TSV have lower capacitance than + integrate the computation into the memory itself; the latter is typical off-chip interconnects. Recent works have explored the + often referred to asprocessing-in-memoryorlogic-in-memory. use of HMC for efficient DNN processing in a variety of ways. + In embedded systems, there have also been efforts to bring the For instance, Neurocube [100] integrates SIMD processors into + computation into the sensor where the data is first collected. the logic die of the HMC to bring the memory and computation 19 + voltage as the input, and the current as the output as shown in resistive memory. + + <
> + + Fig. 36. Analog computation by (a) SRAM bit-cell and (b) non-volatile + + + Processing with non-volatile resistive memories has several drawbacks as described in [108]. + First, it suffers from the + reduced precision and ADC/DAC overhead of analog process- + ing described earlier. Second, the array size is limited by thecloser together. Tetris [101] explores the use of HMC with wires that connect the resistive devices; specifically, wire energythe Eyeriss spatial architecture and row stationary dataflow. dominates for large arrays (e.g., 1k1k), and the IR drop alongIt proposes allocating more area to computation than on-chip wire can degrade the read accuracy. Third, the write energymemory (i.e., larger PE array and smaller global buffer) in to program the resistive devices can be costly, in some casesorder to exploit the low energy and high throughput properties requiring multiple pulses. Finally, the resistive devices can alsoof the HMC. It also adapts the dataflow to account for the suffer from device-to-device and cycle-to-cycle variations withHMC memory and smaller on-chip memory. Tetris achieves non-linear conductance across the conductance range.a 1.5reduction in energy consumption and 4.1increase There have been several recent works that explore the use ofin throughput over a baseline system with conventional 2-D memristors for DNNs. ISAAC [104] replaces the eDRAM inDRAM. DaDianNao with memristors. To address the limited precision + support, ISAAC computes a 16-bit dot product operation with + B. SRAM 8 memristors each storing 2-bits; a 1-bit2-bit multiplication + Rather than bringing the memory near the compute, recent is performed at each memristor, where a 16-bit input requires + work has also investigated bringing the compute into the 16 cycles to complete. In other words, the ISAAC architecture + memory. For instance, the multiply and accumulate operation trades off area and time for increased precision. Finally, ISAAC + can be directly integrated into the bit-cells of an SRAM arranges its 25.1M memristors in a hierarchical structure to + array [102], as shown in Fig. 36(a). In this work, a 5-bit avoid issues with large arrays. PRIME [109] also replaces the + DAC is used to drive the word line (WL) to an analog voltage DRAM main memory with memristors; specifically, it uses + that represents the feature vector, while the bit-cells store the 256256 memristor arrays that can be configured for 4-bit + binary weights1. The bit-cell current (I multi-level cell computation or 1-bit single level cell storage. BC ) is effectively + a product of the value of the feature vector and the value of It should be noted that results from ISAAC and PRIME are + the weight stored in the bit-cell; the currents from the bit- obtained from simulations. The task of actually fabricating + cells within a column add together to discharge the bitline large memristors arrays is still very much a research challenge; + (V for instance, [110] uses a fabricated 1212 memristor array BL ). This approach gives 12energy savings compared to + reading the 1-bit weights from the SRAM and performing the to demonstrate a linear classifier. + computation separately. To counter circuit non-idealities, the + DAC accounts for the non-linear bit-line discharge with respect D. Sensors + to the WL voltage, and boosting is used to combine the weak In certain applications, such as image processing, the dataclassifiers that are susceptible to device variations to form a movement from the sensor itself can account for a significantstrong classifier [103]. portion of the system energy consumption. Thus there has + also been research on performing the computation as close + C. Non-volatile Resistive Memories as possible to the sensor. In particular, much of the work + focuses on moving the computation into the analog domain toThe multiply and accumulate operation can also be directly avoid using the ADC within the sensor, which accounts for aintegrated into advancednon-volatilehigh density memories significant portion of the sensor power. However, as mentionedby using them as programmable resistive elements, commonly + referred to asmemristors[105]. Specifically, a multiplication 8 The resistive devices can be inserted between the cross-point of two wires is performed with the resistor’s conductance as the weight, the and in certain cases can avoid the need for an access transistor. 20 + + + earlier, lower precision is required for analog computation due + to circuit non-idealities. + In [111], the matrix multiplication is integrated into the + ADC, where the most significant bits of the multiplications + are performed using switched capacitors in an 8-bit successive + approximation format. This is extended in [112] to not only + perform the multiplications, but also the accumulations in the + analog domain. In this work, it is assumed that 3-bits and + 6-bits are sufficient to represent the weights and activations, + respectively. This reduces the number of ADC conversions in + the sensor by 21. RedEye [113] takes this approach even + further by performing the entire convolution layer (including + convolution, max pooling and quantization) in the analog + domain at the sensor. It should be noted that [111] and [112] + report measured results from fabricated test chips, while results + in [113] are from simulations. <
> + It is also feasible to embed the computation not just before + the ADC, but into the sensor itself. For instance, in [114] an Fig. 37. Various methods of quantization (Figures from [117, 118]). + Angle Sensitive Pixels sensor is used to compute the gradient + of the input, which along with compression, reduces the data the number of bits. The benefits of reduced precision includemovement from the sensor by 10. In addition, since the reduced storage cost and/or reduced computation requirements.first layer of the DNN often outputs a gradient-like feature + map, it maybe possible to skip the computations in the first There are several ways to map the data to quantization levels. + layer, which further reduces energy consumption as discussed The simplest method is a linear mapping with uniform distance + in [115, 116]. between each quantization level (Fig. 37(a)). Another approach + is to use a simple mapping function such as alog function + (Fig. 37(b)) where the distance between the levels varies; thisVII. C O -DESIGN OF DNN MODELS AND HARDWARE mapping can often be implemented with simple logic such as aIn earlier work, the DNN models were designed to maximize shift. Alternatively, a more complex mapping function can beaccuracy without much consideration of the implementation used where the quantization levels are determined or learnedcomplexity. However, this can lead to designs that are chal- from the data (Fig. 37(c)), e.g., using k-means clustering; forlenging to implement and deploy. To address this, recent this approach, the mapping is usually implemented with a lookwork has shown that DNN models and hardware can be co- up table.designed to jointly maximize accuracy and throughput, while Finally, the quantization can be fixed (i.e., the same methodminimizing energy and cost, which increases the likelihood of of quantization is used for all data types and layers, filters, andadoption. In this section, we will highlight various efforts that channels in the network); or it can be variable (i.e., differenthave been made towards the co-design of DNN models and methods of quantization can be used for weights and activations,hardware. Note that unlike Section V, the techniques discussed and different layers, filters, and channels in the network).in this section can affect the accuracy; thus, the goal is to Reduced precision research initially focused on reducingnot only substantially reduce energy consumption and increase the precision of the weights rather than the activations, sincethroughput, but also to minimize any degradation in accuracy. weights directly increase the storage capacity requirement,The co-design approaches can be loosely grouped into the while the impact of activations on storage capacity depends onfollowing categories: the network architecture and dataflow. However, more recent + Reduce precision of operations and operands.This in- works have also started to look at the impact of quantizationcludes going from floating point to fixed point, reducing on activations. Most reduced precision research also focusesthe bitwidth, non-linear quantization and weight sharing. on reducing the precision for inference rather than training + Reduce number of operations and model size. This (with some exceptions [88,119,120]) due to the sensitivity ofincludes techniques such as compression, pruning and the gradients to quantization.compact network architectures. The key techniques used in recent work to reduce precision + are summarized in Table III; both linear and non-linear + A. Reduce Precision quantization applied to weights and activations are explored. + Quantization involves mapping data to a smaller set of The impact on accuracy is reported relative to a baseline + quantization levels. The ultimate goal is to minimize the error precision of 32-bit floating point, which is the default precision + between the reconstructed data from the quantization levels and used on platforms such as GPUs and CPUs. + the original data. The number of quantization levels reflects the 1) Linear quantization:The first step of reducing precision + precisionand ultimately the number of bits required to represent is usually to convert values and operations from floating point + the data (usuallylog 2 of the number of levels); thus,reduced to fixed point. A 32-bit floating point number, as shown in + precisionrefers to reducing the number of levels, and thus Fig. 38(a), is represented by <>, wheres + product; that output would need to be accumulated with <> + bit precision, where M is determined based on the largest filter (b) 8-bit dynamic fixed point examples + size <> (<> from Fig. 9(b)), which is in the range of 0 to 16 bits for the popular DNNs described in SectionIII-B. + + Fig. 38. Various methods of number representations. 1 + + After accumulation, the precision of the final output activation + is typically reduced to N-bits [88,121], as shown in Fig. 39.is the sign bit, e is the + 8-bit exponent, andmis the 23-bit The reduced output precision does not have a significant impact + mantisa, and covers the range of <>. + on accuracy if the distribution of the weights and activationsAn N-bit fixed point number is + represented by <> are centered near zero such that the accumulation would not + 2f , wheresis the sign bit,mis the (N-1)-bit mantissa, and move only in one direction; + this is particularly true when batchfdetermines the location of the decimal point and acts as a normalization is used. + scale factor. For instance, for an 8-bit integer, whenf= 0, + The reduced precision is not only explored in research,the dynamic range is -128 to 127, + whereas whenf= 10, the but has been used in recent commercial platforms for DNN + dynamic range is -0.125 to 0.124023438.Dynamicfixed point processing. For instance, Google’s + Tensor Processing Unitrepresentation allowsfto vary based on the desired dynamic (TPU) + which was announced in May 2016, was designed forrange as shown in Fig. 38(b). + This is useful for DNNs, since 8-bit integer arithmetic [123]. Similarly, Nvidia’s PASCAL + the dynamic range of the weights and activations can be quite GPU, which was announced in + April 2016, also has 8-bitdifferent. In addition, the dynamic range can also vary across + \integer instructions for deep learning inference [124]. In generallayers and layer types + (e.g., convolutional vs. fully connected). purpose platforms such as CPUs and GPUs, the main benefit + Using dynamic fixed point, the bitwidth can be reduced to 8 of using 8-bit computation is an increase + in throughput, asbits for the weights and 10 bits for the activations without any four 8-bit + operations rather than one 32-bit operation can befine-tuning of the weights [121]; with fine-tuning, + both weights performed for a given clock cycle.and activations can reach 8-bits [122]. + While general purpose platforms usually support 8-bit,Using 8-bit fixed point has the following + impact on energy 16-bit and/or 32-bit operations, it has been shown that theand area [79]: + minimum bit precision for DNNs can actually vary in a more + An 8-bit fixed point add consumes 3.3% less energy fine grained manner. For instance, the weight and activation + (3.8less area) than a 32-bit fixed point add, and 30% precision can vary between 4 and 9 bits for AlexNet across + less energy (116less area) than a 32-bit floating point different layers without significant impact on accuracy (i.e., a + add. The energy and area of a fixed-point add scales change of less than 1%) [125,126]. This fine-grained variation + approximately linearly with the number of bits. can be exploited for increased throughput or reduced energy + An 8-bit fixed point multiply consumes 15.5% less energy consumption with specialized hardware. For instance, if bit- + (12.4% less area) than a 32-bit fixed point multiply, serial processing is used, where the number of clock cycles to + and 18.5% less energy (27.5% less area) than a 32-bit complete an operation is proportional to the bitwidth, adapting + floating point multiply. The energy and area of a fixed- to fine-grain variations in bit precision can result in a 2.24% + point multiply scales approximately quadratically with the speed up versus 16-bits [125]. Alternatively, a multiplier can + number of bits. be designed such that its critical path reduces based on the bit + Reducing the precision also reduces the energy and area cost precision as fewer adders are needed to resolve the product; + for storage, which is important since memory access and data this can be combined with voltage scaling for a 2.56energy + movement dominate energy consumption as described earlier. savings versus 16-bits [126]. While these bit scaling results + The energy and area of the memory scale approximately linearly are reported relative to 16-bit, it would be interesting to see + with number of bits. It should be noted, however, that changing their impact relative to the maximum precision required across + from floating point to fixed point, without reducing bit-width, layers (i.e., 9-bits for [125, 126]). + does not reduce the energy or area cost of the memory. The precision can be reduced even more aggressively to a + For completeness, it should be noted that the precision of single bit; this area of research is often referred to asbinary nets. + the internal values of a fixed-point multiply and accumulate BinaryConnect (BC) [127] introduced the concept of binary + (MAC) operation are typically higher than the weights and weights (i.e., -1 and 1), where using a binary weight reduced + activations. To guarantee no precision loss, weights and input the multiplication in the MAC to addition and subtraction + activations with N-bit fixed-point precision would require an only. This was later extended in Binarized Neural Networks + N-bitxN-bit multiplication which generates a 2N-bit output (BNN) [128] that uses binary weightsandactivations, which + + <
> + + Fig. 40. Weight sharing hardware. + + w, where w is the average of the absolute values of the + weights in the filter) 9 , keeping the first and last layers at 32-bit + floating point precision, and performing normalization before VGG-16 [117]. Furthermore, when weights are quantized to + convolution to reduce the dynamic range of the activations. powers of two, the multiplication can be replaced with a bit- + With these changes, BWN reduced the accuracy loss to 0.8%, shift [122,135]. 10 Incremental Network Quantization (INQ) + while XNOR-Nets reduced the loss to 11%. The loss of XNOR- can be used to further reduce the loss in accuracy by dividing + Net can be further reduced by increasing the precision of the the large and small weights into different groups, and then + activations to be slightly larger than one bit. For instance, iteratively quantizing and re-training the weights [136]. + Quantized Neural Networks (QNN) [119], DoReFa-Net [120], Weight Sharingforces several weights to share a single value. + and HWGQ-Net [130] allow the activations to have 2-bits, This reduces the number of unique weights in a filter or a + while the weights remain at 1-bit; in HWGQ-Net, this reduces layer. One example is to group the weights by using a hashing + the accuracy loss to 5.2%. function and use one value for each group [137]. Alternatively, + All the previously described binary nets limit the weights the weights can be grouped by the k-means algorithm [118]. + to two values (-wandw); however, there may be benefits Both the shared weights and the indexes indicating which + for allowing weights to be zero (i.e., -w, 0,w). Although weight to use at each position of the filter are stored. This + this requires an additional bit per weight compared to binary leads to a two step process to fetch the weight: (1) read the + weights, the sparsity of the weights can be exploited to reduce weight index; (2) using the weight index, read the shared + computation and storage cost, which can potentially cancel weights. This approach can reduce the cost of reading and + out the cost of the additional bit. This is explored in Ternary storing the weights if the weight index (log 2 of the number of + Weight Nets (TWN) [131] and then extended in Trained Ternary unique weights) is less than the bitwidth of the weight itself. + Quantization (TTQ) where a different scale is trained for each For instance, in Deep Compression [118], the number of + weight (i.e., -w unique weights per layer is reduced to 256 for convolutional 1 , 0,w2 ) for an accuracy loss of 0.6% [132], + assuming 32-bit floating point for the activations. layers and 16 for fully-connected layers in AlexNet, requiring + Hardware implementations for binary/ternary nets have 8-bit and 4-bit weight indexes, respectively. Assuming there + been explored in recent publications. YodaNN [133] uses areUunique weights and the size of the filters in the layer + binary weights, while BRein [134] uses binary weights and is <> from Fig. 9(b), there will be energy savings + activations. Binary weights are also used in the compute if reading from a CRSM <> U-bit memory plus aU16- + in SRAM work [102] described in Section VI. Finally, the bit memory (as shown in Fig. 40) cost less than reading + nominally spike-inspired TrueNorth chip can implement a from a CRSM 16-bit memory. Note that unlike the previous + reduced precision neural network with binary activations and quantization methods, the weight sharing approach does not + ternary weights using TrueNorth’s quantized weight table [9]. reduce the precision of the MAC computation itself and only + These works tend not to support state-of-the-art DNN models reduces the weight storage requirement. + (with the exception of YodaNN). + 2) Non-linear quantization:The previous works described B. Reduce Number of Operations and Model Size + involve linear quantization where the levels are uniformly In addition to reducing the size of each operation or operandspaced out. It has been shown that the distributions of the (weight/activation), there is also a significant amount of researchweights and activations are not uniform [118,135], and thus on methods to reduce the number of operations and modela non-linear quantization can potentially improve accuracy. size. These techniques can be loosely classified as exploitingSpecifically, there have been two popular approaches taken activation statistics, network pruning, network architecturein recent works: (1) log domain quantization; (2) learned design and knowledge distillation.quantization or weight sharing. 1) Exploiting Activation Statistics: As discussed in Sec-Log domain quantizationIf the quantization levels are tionIII-A1, ReLU is a popular form of non-linearity used inassigned based on a logarithmic distribution as shown in DNNs that sets all negative values to zero as shown in Fig. 41(a). Fig 37(b), the weights and activations are more equally As a result, the output activations of the feature maps after the distributed across the different levels and each level is used ReLU are sparse; for instance, the feature maps in AlexNetmore efficiently resulting in less quantization error. For instance, have sparsity between 19% to 63% as shown in Fig. 41(b).using 4 bits in linear quantization results in a 27.8% loss in This sparsity gives ReLU an implementation advantage overaccuracy versus a 5% loss for log base-2 quantization for other non-linearities such as sigmoid, etc. + + 9 This can also be thought of as a form of weights sharing, where only two 10 Note however that multiplications do not account for a significant portion + weights are used per filter. of the total energy. + + <> + + TABLE III + METHODS TO REDUCE NUMERICAL PRECISION FOR ALEX NET . ACCURACY MEASURED FOR TOP-5 ERROR ON IMAGE NET . + + + + a cost of reduced accuracy. + 2) Network Pruning:To make network training easier, the + networks are usually over-parameterized. Therefore, a large + amount of the weights in a network are redundant and can + be removed (i.e., set to zero). This process is called network + pruning. Aggressive network pruning often requires some fine- + tuning of the weights to maintain the original accuracy. This + was first proposed in 1989 through a technique called Optimal + Brain Damage [140]. The idea was to compute the impact of + each weight on the training loss (discussed in SectionII-C), + referred to as the weight saliency. The low-saliency weights (Normalized) + were removed and the remaining weights were fine-tuned; this + process was repeated until the desired weight reduction and + accuracy were reached. + In 2015, a similar idea was applied to modern DNNs in [141]. + <> Rather than using the saliency as a metric, which is too difficult + to compute for the large-scaled DNNs, the pruning was simply + Fig. 41. Sparsity in activations due to ReLU. based on the magnitude of the weights. Small weights were + pruned and the model was fine-tuned to restore the accuracy. + Without fine-tuning the weights, about 50% of the weightsThe sparsity can be exploited for energy and area savings could be pruned. With fine-tuning, over 80% of the weightsusing compression, particularly for off-chip DRAM access were pruned. Overall this approach can reduce the numberwhich is expensive. For instance, a simple run length coding of weights in AlexNet by 9and the number of MACsthat involves signaling non-zero values of 16-bits and then runs by 3. Most of the weight reduction comes from the fully-of zeros up to 31 can reduce the external memory bandwidth connected layers (9.9for fully-connected layers versus 2.7of the activations by 2.1and the overall external bandwidth for convolutional layers).(including weights) by 1.5[61]. 11 In addition to compression, + the hardware can also be modified such that it skips reading the However, the number of weights alone is not a good metric + weights and performing the MAC for zero-valued activations for energy. For instance, in AlexNet, the number of weights + to reduce energy cost by 45% [94]. Rather than just gating the in the fully-connected layers is much larger than in the + read and MAC computation, the hardware could also skip the convolutional layers; however, the energy of the convolutional + cycle to increase the throughput by 1.37%[138]. layers is much higher than the fully-connected layers as shown + The activations can be made to be even more sparse by prun- in Fig. 35 [80]. Rather than using the number of weights + ing the low-valued activations. For instance, if all activations and MAC operations as proxies for energy, the pruning of + with small values are pruned, this can be translated into an the weights can be directly driven by energy itself [142]. An + additional 11% speed up [138] or 2power reduction [139] energy evaluation method can be used to estimate the DNN + with little impact on accuracy. Aggressively pruning more energy that accounts for the data movement from different + activations can provide additional throughput improvement at levels of the memory hierarchy, the number of MACs, and the + data sparsity as shown in Fig. 42; this energy estimation tool + is available at [143]. The resulting energy values for popular This simple run length compression is within 5-10% of the theoretical + entropy limit. DNN models are shown in Fig. 43(a). Energy-aware pruning 24 + + + <
> + + Fig. 42. Energy estimation methodology from [142], which estimates the + energy based on data movement from different levels of the memory hierarchy, + + <
> + + Fig. 43. Energy values estimated with methodology in [142]. a time [144]. The CSC format will provide an overall lower + memory bandwidth than CSR if the output is smaller than the + input, or in the case of DNN, if the number of filters isnot + can then be used to prune weights based on energy to reduce significantly larger than the number of weights in the filter + the overall energy across all layers by 3.7% for AlexNet, which (<> from Fig. 9(b)). Since this is often true, CSC can + is 1.74more efficient than magnitude-based approaches [141] be an effective format for sparse DNN processing. + as shown in Fig. 43(b). As mentioned previously, it is well Custom hardware has been explored to efficiently supportknown that AlexNet is over-parameterized. The energy-aware pruned DNN models. Many works aim to perform the process-pruning can also be applied to GoogleNet, which is already a ing without decompressing the weights or activations. EIE [145]small DNN model, for a 1.6energy reduction. performs the sparse matrix-vector multiplication specifically for + Recent works have examine how to efficiently support the fully connected layers. It stores the weights in a CSC format + processing of sparse weights in hardware. One area of interest along with the start location of each column, which needs to be + is how to best store the sparse weights after pruning. Similar to stored since the compressed weights have variable length. When + compressing the sparse activations discussed in SectionVII-B1, the input is not zero, the compressed weight column is read and + the sparse weights can be compressed to reduce memory access the output is updated. To handle the sparsity, additional logic + bandwidth by 20 to 30% [118]. is used to keep track of the location of the output that should + When DNN processing is performed as a matrix-vector be updated. SCNN [146] supports processing of convolutional 25 + + + layers in a compressed format. It uses an input stationary weights [154]. It proposes afiremodule that first ‘squeezes’ + dataflow to deliver the compressed weights and activations to the network with 1x1 convolution filters and then expands + a multiplier array followed by a scatter network to add the it with multiple 1x1 and 3x3 convolution filters. It achieves + scattered partial sums. an overall 50% reduction in number of weights compared to + Recent works have also explored the use of structured AlexNet, while maintaining the same accuracy. It should be + pruning to avoid the need for custom hardware [147,148]. noted, however, that reducing the number of weights does not + Rather than pruning individual weights (also referred to as fine- necessarily reduce energy; for instance, SqueezeNet consumes + grained pruning), structured pruning involves pruning groups more energy than AlexNet, as shown in Fig. 43(a). + of weights (also referred to as coarse-grained pruning). The b) After Training:Tensor decomposition can be used to + benefits of structured pruning are (1) the resulting weights can decompose filters in a trained network without impacting the + better align with the data-parallel architecture (e.g., SIMD) accuracy. It treats weights in a layer as a 4-D tensor and breaks + found in existing general purpose hardware, which results in it into a combination of smaller tensors (i.e., several layers). + more efficient processing [149]; (2) it amortizes the overhead Low-rank approximation can then be applied to further increase + cost required to signal the location of the non-zero weights the compression rate at the cost of accuracy degradation, which + across a group of weights, which improves compression and can be restored by fine-tuning the weights. + thus reduces storage cost. These groups of weights can include This approach is demonstrated using Canonical Polyadic (CP) + a pair of neighboring weights, an entire row or column of a decomposition, a high-order extension of singular value decom- + filter, an entire channel of a filter or the entire filter itself; using position that can be solved by various methods, such as a greedy + larger groups tends to result in higher loss in accuracy [150]. algorithm [155] or a non-linear least-square method [156]. + 3) Compact Network Architectures:The number of weights Combining CP-decomposition with low-rank approximation + and operations can also be reduced by improving the network achieves a 4.5% speed-up on CPUs [156]. However, CP- + architecture itself. The trend is to replace a large filter with a decomposition cannot be computed in a numerically stable + series of smaller filters, which have fewer weights in total; when way when the dimension of the tensor, which represents the + the filters are applied sequentially, they achieve the same overall weights, is larger than two [156]. To alleviate this problem, + effective receptive field (i.e., the region the filter uses from input Tucker decomposition is adopted instead in [157]. + image to compute an output). This approach can be applied 4) Knowledge Distillation:Using a deep network or av- + during the network architecture design (before training) or by eraging the predictions of different models (i.e., ensemble) + decomposing the filters of a trained network (after training). gives a better accuracy than using a single shallower network. + The latter one avoids the hassle of training networks from However, the computational complexity is also higher. To get + scratch. However, it is less flexible than the former one. For the best of both worlds, knowledge distillation transfers the + example, existing methods can only decompose a filter in a knowledge learned by the complex model (teacher) to the + trained network into a series of filters without non-linearity simpler model (student). The student network can therefore + between them. achieve an accuracy that would be unachievable if it was + a) Before Training:In recent DNN models, filters with directly trained with the same dataset [158,159]. For example, + a smaller width and height are used more frequently because [160] shows how using knowledge distillation can improve the + concatenating several of them can emulate a larger filter as speech recognition accuracy of a student net by 2%, which is + shown in Fig. 13. For example, one 5x5 convolution can be similar to the accuracy of a teacher net that is composed of + replaced with two 3x3 convolutions. Alternatively, one NxN an ensemble of 10 networks. + convolution can be decomposed into two 1-D convolutions, one Fig. 45 shows the simplest knowledge distillation + 1xN and one Nx1 convolution [53]; this basically imposes method [158]. The softmax layer is commonly used as the + a restriction that the 2-D filter must be separable, which is output layer in the image classification networks to generate + a common constraint in image processing [151]. Similarly, a the class probabilities from the class scores 12 ; it squashes the + 3-D convolution can be replaced by a set of 2-D convolutions class scores into values between 0 and 1 that sum up to 1. + (i.e., applied only on one of the input channels) followed by For this knowledge distillation method, soft targets (values + 1x1 3-D convolutions as demonstrated in Xception [152] and between 0 and 1) such as the class scores of the teacher DNN + MobileNets [153]. The order of the 2-D convolutions and 1x1 (or an ensemble of teacher DNNs) are used instead of the + 3-D convolutions can be switched. hard targets (values of either 0 or 1) such as the labels in the + 1x1 convolutional layers can also be used to reduce the dataset. The objective is to minimize the squared difference + number of channels in the output feature map for a given between the soft targets and the class scores of the student DNN. + layer, which reduces the number of filter channels and thus Class scores are used as the soft targets instead of the class + computation cost for the filters in the next layer as demonstrated probabilities because small values in the class scores contain + in [15,51,52]; this is often referred to as a ‘bottleneck’ as important information that may be eliminated by the softmax. + discussed in SectionIII-B. For this purpose, the number of 1x1 Alternatively, class probabilities after the softmax layer can be + filters has to be less than the number of channels in the 1x1 used as soft targets if the softmax is configured to generate + filter. For example, 32 filters of 1x164 can transform an input softer class probabilities where the smaller values retain more + with 64 channels to an output of 32 channels and reduce the information [160]. Finally, the intermediate representations of + number of filter channels in the next layer to 32. SqueezeNet + uses many 1x1 filters to aggressively reduce the number of 12 Also commonly referred to as logits. + + + robotics. For data analytics, high throughput means that more + data can be analyzed in a given amount of time. As the amount + + of visual data is growing exponentially, high-throughput big + data analytics becomes important, particularly if an action needs + to be taken based on the analysis (e.g., security or terrorist + prevention; medical diagnosis). Try to match + Low latencyis necessary for real-time interactive applications. + Latency measures the time between when the pixel arrives + to a system and when the result is generated. Latency is Simple DNN + measured in terms of seconds, while throughput is measured + in operations/second. Often high throughput is obtained by + batching multiple images/frames together for processing; this Fig. 45. + Knowledge distillation matches the class scores of a small DNN to results + in multiple frame latency (e.g., at 30 frames per second, an ensemble of large DNNs. + a batch of 100 frames results in a 3 second delay). This delay + is not acceptable for real-time applications, such as high-speed + navigation where it would reduce the time available for coursethe teacher DNN can + also be incorporated as the extra hints correction. Thus achieving low latency and + high throughputto train the student DNN [161]. + Hardware costis in large part dictated by the amount of + on-chip storage and the number of cores. Typical embedded + processors have limited on-chip storage on the order of a few + simultaneously can be a challenge. + + VIII. B ENCHMARKING METRICS FOR DNN EVALUATION AND COMPARISON + As we have seen in this article, there has been a significant hundred kilobytes. Since there is a trade-off between the amount + amount of research on efficient processing of DNNs. We should of on-chip memory and the external memory bandwidth, both + consider several key metrics to compare the various strengths metrics should be reported. Similarly, there is a correlation + and weaknesses of different designs and proposed techniques. between the number of cores and the throughput. In addition, + These metrics should cover important attributes such as accu- while many cores can be built on a chip, the number of cores + racy/robustness, power/energy consumption, throughput/latency that can actually be used at a given time should be reported. It is + and cost. Reporting all these metrics is important in order often unrealistic to assume peak utilization and performance due + to provide a complete picture of the trade-offs made by a to limitations of mapping and memory bandwidth. Accordingly, + proposed design or technique. We have prepared a website to the power and throughput should be reported for running actual + collect these metrics from various publications [162]. DNNs as opposed to only reporting theoretical limits. + In terms ofaccuracyandrobustness, it is important that the + accuracy be reported on widely-accepted datasets as discussed + in Section IV. The difficulty of the dataset and/or task should A. Metrics for DNN Models + be considered when measuring the accuracy. For instance, the To evaluate the properties of a given DNN model, we should + MNIST dataset for digit recognition is significantly easier than consider the following metrics:the ImageNet dataset. + As a result, a DNN that performs well + on MNIST may not necessarily perform well on ImageNet. Theaccuracy of the model in terms of the top-5 error + Thus it is important that the same dataset and task is used when on datasets such as ImageNet. Also, the type of data + comparing the accuracy of different DNN models; currently augmentation used (e.g., multiple crops, ensemble models) + ImageNet is preferred since it presents a challenge for DNNs, should be reported. + as opposed to MNIST, which can also be addressed with simple Thenetwork architectureof the model should be reported, + non-DNN techniques. To demonstrate primarily hardware including number of layers, filter sizes, number of filters + innovations, it would be desirable to report results for widely- and number of channels. + used DNN models (e.g., AlexNet, GoogLeNet) whose accuracy Thenumber of weightsimpact the storage requirement of + and robustness have been well studied and tested. the model and should be reported. If possible, the number + Energyandpowerare important when processing DNNs at of non-zero weights should be reported since this reflects + the edge in embedded devices with limited battery capacity the theoretical minimum storage requirements. + (e.g., smart phones, smart sensors, UAVs, and wearables), or in Thenumber of MACsthat needs to be performed should + the cloud in data centers with stringent power ceilings due to be reported as it is somewhat indicative of the number + cooling costs, respectively. Edge processing is preferred over of operations and potential throughput of the given DNN. + the cloud for certain applications due to latency, privacy or If possible, the number of non-zero MACs should also + communication bandwidth limitations. When evaluating the be reported since this reflects the theoretical minimum + power and energy consumption, it is important to account compute requirements. + for all aspects of the system including the chip and external Table IV shows how these metrics are reported for various + memory accesses. well known DNNs. The accuracy is reported for the case where + High throughputis necessary to deliver real-time perfor- only a single crop for a single model is used for classification, + mance for interactive applications such as navigation and such that the number of weights and MACs in the table are + reported in terms of the core area in squared millimeters + per multiplier along with process technology. + In terms of cost, different platforms will have different + implementation-specific metrics. For instance, for an FPGA, (Number of CONV Layers) + the specific device should be reported, along with the utilization + of resources such as DSP, BRAM, LUT and FF; performance + density such as GOPs/slice can also be reported. Stride + Each processor should report various specifications for each NZ Weights + metric as shown in Table V, using the Eyeriss chip as an + example. It is important that all metrics and specifications are + accounted for in order fairly evaluate all the design trade-offs. Number of Channels + For instance, without the accuracy given for a specific dataset Number of Filters + and task, one could run a simple DNN and easily claim low + power, high throughput, and low cost – however, the processor + might not be usable for a meaningful task; alternatively, without Total NZ MACs + reporting the off-chip bandwidth, one could build a processor + with only multipliers and easily claim low cost, high throughput, + high accuracy, and lowchippower – however, when evaluating + systempower, the off-chip memory access would be substantial. + Finally, the test setup should also be reported, including whether + the results are measured or obtained from simulation and consistent. + (NZ) operations significantly reduces the number of MACs + In summary, the evaluation process for whether a DNNand weights. + Since the number of NZ MACs depends on the system is a viable solution + for a given application might go asinput data, we propose using the publicly available 50,000 follows: + (1) the accuracy determines if it can perform the givenvalidation images from ImageNet for the + computation. Finally, task; (2) the latency and throughput determine if it can run fast there are + various methods to reduce the weights in a DNN enough and in real-time; (3) the energy and power consumption + (e.g., network pruning in SectionVII-B2). Table IV shows will primarily dictate the form factor of the device + where the another example of these DNN model metrics, by comparing processing can operate; (4) the cost, + which is primarily dictatedsparse DNNs pruned using [142] to dense DNNs. + by the chip area, determines how much one would pay for this + solution. + <
> + TABLE IV + METRICS FOR POPULAR DNN MODELS. SPARSITY IS ACCOUNT FOR BY + REPORTING NON-ZERO (NZ) WEIGHTS AND MACS. + + B. Metrics for DNN Hardware + To measure the efficiency of the DNN hardware, we should IX. SUMMARY + consider the following additional metrics: The use of deep neural networks (DNNs) has seen explosive + Thepower and energyconsumption of the design should growth in the past few years. They are currently widely used + be reported for various DNN models; the DNN model for many artificial intelligence (AI) applications including + specifications should be provided including which layers computer vision, speech recognition and robotics and are often + and bit precision are supported by the hardware during delivering better than human accuracy. However, while DNNs + measurement. In addition, the amount of off-chip accesses can deliver this outstanding accuracy, it comes at the cost + (e.g., DRAM accesses) should be included since it of high computational complexity. Consequently, techniques + accounts for a significant portion of the system power; it that enable efficient processing of deep neural network to + can be reported in terms of the total amount of data that improveenergy-efficiencyandthroughputwithout sacrificing + is read and written off-chip per inference. accuracywith cost-effective hardware are critical to expanding + Thelatency and throughputshould be reported in terms the deployment of DNNs in both existing and new domains. + of the batch size and the actual run time for various Creating a system for efficient DNN processing should + DNN models, which accounts for mapping and memory begin with understanding the current and future applications + bandwidth effects. This provides a more useful and and the specific computations required both now and the + informative metric than peak throughput. potential evolution of those computations. This article surveys a + Thecostof the chip depends on the area efficiency, which number of the current applications, focusing on computer vision + accounts for the size and type of memory (e.g., registers applications, the associated algorithms, and the data being used + or SRAM) and the amount of control logic. It should be to drive the algorithms. These applications, algorithms and + input data are experiencing rapid change. So extrapolating + 13 Data augmentation is often used to increase accuracy. This includes using these trends to determine the degree of flexibility desired to + multiple crops of an image to account for misalignment; in addition, an handle next generation computations, becomes an important ensemble of multiple models can be used where each model has different + weights due to different training settings, such as using different initializations ingredient of any design project. + or datasets, or even different network architectures. If multiple crops and + models are used, then the number of MACs and weights required would + + <
> + TABLE V + EXAMPLE BENCHMARK METRICS FOR EYERISS [94]. + + During the design-space exploration process, it is critical to article both reviews a variety of these techniques and discusses + understand and balance the important system metrics. For DNN the frameworks that are available for describing, running and + computation these include the accuracy, energy, throughput training networks. + and hardware cost. Evaluating these metrics is, of course, Finally, DNNs afford the opportunity to use mixed-signal + key, so this article surveys the important components of circuit design and advanced technologies to improve efficiency. + a DNN workload. In specific, a DNN workload has two These include using memristors for analog computation and 3-D + major components. First, the workload is the form of each stacked memory. Advanced technologies can also can facilitate + DNN network including the ‘shape’ of each layer and the moving computation closer to the source by embedding compu- + interconnections between layers. These can vary both within tation near or within the sensor and the memories. Of course, all + and between applications. Second, the workload consists of of these techniques should also be considered in combination, + the specific the data input to the DNN. This data will vary while being careful to understand their interactions and looking + with the input set used for training or the data input during for opportunities for joint hardware/algorithm co-optimization. + operation for inference. In conclusion, although much work has been done, deep + This article also surveys a number of avenues that prior neural networks remain an important area of research with + work have taken to optimize DNN processing. Since data many promising applications and opportunities for innovation + movement dominates energy consumption, a primary focus at various levels of hardware design. + of some recent research has been to reduce data movement + while maintaining accuracy, throughput and cost. This means ACKNOWLEDGMENTS + selecting architectures with favorable memory hierarchies like Funding provided by DARPA YFA, MIT CICS, and gifts + a spatial array, and developing dataflows that increase data from Nvidia and Intel. The authors thank the anonymous + reuse at the low-cost levels of the memory hierarchy. We reviewers as well as James Noraky, Mehul Tikekar and + have included a taxonomy of dataflows and an analysis of Zhengdong Zhang for providing valuable feedback on this + their characteristics. Other work is presented that aims to save paper. + space and energy by changing the representation of data values + in the DNN. Still other work saves energy and sometimes REFERENCES + increases throughput by exploiting the sparsity of weights [1]Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”Nature, and/or activations. vol. 521, no. 7553, pp. 436–444, May 2015. + The DNN domain also affords an excellent opportunity [2]L. Deng, J. Li, J.-T. Huang, K. Yao, D. Yu, F. Seide, M. Seltzer, + G. Zweig, X. He, J. Williamset al., “Recent advances in deep for joint hardware/software co-design. For example, various learning for speech research at Microsoft,” inICASSP, 2013. efforts have noted that efficiency can be improved by increasing [3]A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet + sparsity (increasing the number of zero values) or optimizing Classification with Deep Convolutional Neural Networks,” in + the representation of data by reducing the precision of values NIPS, 2012. + or using more complex mappings of the stored value to the [4]C. Chen, A. Seff, A. Kornhauser, and J. Xiao, “Deepdriving: + Learning affordance for direct perception in autonomous actual value used for computation. However, to avoid losing driving,” inICCV, 2015. accuracy it is often useful to modify the network or fine-tune the [5]A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. + network’s weights to accommodate these changes. Thus, this Blau, and S. Thrun, “Dermatologist-level classification of skin 29 + + + cancer with deep neural networks,”Nature, vol. 542, no. 7639, [25]J. Zhou and O. G. Troyanskaya, “Predicting effects of noncod- + pp. 115–118, 2017. ing variants with deep learning-based sequence model,”Nature + [6]D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, methods, vol. 12, no. 10, pp. 931–934, 2015. + G. van den Driessche, J. Schrittwieser, I. Antonoglou, [26]B. Alipanahi, A. Delong, M. T. Weirauch, and B. J. Frey, + V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, “Predicting the sequence specificities of dna-and rna-binding + J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, proteins by deep learning,”Nature biotechnology, vol. 33, no. 8, + K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the pp. 831–838, 2015. + game of Go with deep neural networks and tree search,”Nature, [27]H. Zeng, M. D. Edwards, G. Liu, and D. K. Gifford, “Convolu- + vol. 529, no. 7587, pp. 484–489, Jan. 2016. tional neural network architectures for predicting dna–protein + [7]F.-F. Li, A. Karpathy, and J. Johnson, “Stanford CS class binding,”Bioinformatics, vol. 32, no. 12, pp. i121–i127, 2016. + CS231n: Convolutional Neural Networks for Visual Recogni- [28]M. Jermyn, J. Desroches, J. Mercier, M.-A. Tremblay, K. St- + tion,” http://cs231n.stanford.edu/. Arnaud, M.-C. Guiot, K. Petrecca, and F. Leblond, “Neural net- + [8]P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, works improve brain cancer detection with raman spectroscopy + J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, in the presence of operating room light artifacts,”Journal of + Y. Nakamuraet al., “A million spiking-neuron integrated circuit Biomedical Optics, vol. 21, no. 9, pp. 094002–094002, 2016. + with a scalable communication network and interface,”Science, [29]D. Wang, A. Khosla, R. Gargeya, H. Irshad, and A. H. Beck, + vol. 345, no. 6197, pp. 668–673, 2014. “Deep learning for identifying metastatic breast cancer,”arXiv + [9]S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, preprint arXiv:1606.05718, 2016. + R. Appuswamy, A. Andreopoulos, D. J. Berg, J. L. McKinstry, [30]L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Rein- + T. Melano, D. R. Barchet al., “Convolutional networks for forcement learning: A survey,”Journal of artificial intelligence + fast, energy-efficient neuromorphic computing,”Proceedings research, vol. 4, pp. 237–285, 1996. + of the National Academy of Sciences, 2016. [31]V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, + [10]M. Mathieu, M. Henaff, and Y. LeCun, “Fast training of D. Wierstra, and M. Riedmiller, “Playing Atari with Deep + convolutional networks through FFTs,” inICLR, 2014. Reinforcement Learning,” inNIPS Deep Learning Workshop, + [11]Y. LeCun, L. D. Jackel, B. Boser, J. S. Denker, H. P. Graf, 2013. + I. Guyon, D. Henderson, R. E. Howard, and W. Hubbard, [32]S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end + “Handwritten digit recognition: applications of neural network training of deep visuomotor policies,”Journal of Machine + chips and automatic learning,”IEEE Commun. Mag., vol. 27, Learning Research, vol. 17, no. 39, pp. 1–40, 2016. + no. 11, pp. 41–46, Nov 1989. [33]M. Pfeiffer, M. Schaeuble, J. Nieto, R. Siegwart, and C. Cadena, + [12]B. Widrow and M. E. Hoff, “Adaptive switching circuits,” in “From Perception to Decision: A Data-driven Approach to End- + 1960 IRE WESCON Convention Record, 1960. to-end Motion Planning for Autonomous Ground Robots,” in + [13]B. Widrow, “Thinking about thinking: the discovery of the ICRA, 2017. + LMS algorithm,”IEEE Signal Process. Mag., 2005. [34]S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik, + [14]O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, “Cognitive mapping and planning for visual navigation,” in + Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, CVPR, 2017. + and L. Fei-Fei, “ImageNet Large Scale Visual Recognition [35]T. Zhang, G. Kahn, S. Levine, and P. Abbeel, “Learning deep + Challenge,”International Journal of Computer Vision (IJCV), control policies for autonomous aerial vehicles with mpc-guided + vol. 115, no. 3, pp. 211–252, 2015. policy search,” inICRA, 2016. + [15]K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning [36]S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi- + for Image Recognition,” inCVPR, 2016. agent, reinforcement learning for autonomous driving,” inNIPS + [16]“Complete Visual Networking Index (VNI) Forecast,” Cisco, Workshop on Learning, Inference and Control of Multi-Agent + June 2016. Systems, 2016. + [17]J. Woodhouse, “Big, big, big data: higher and higher resolution [37]N. Hemsoth, “The Next Wave of Deep Learning Applications,” + video surveillance,” technology.ihs.com, January 2016. Next Platform, September 2016. + [18]R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich [38]S. Hochreiter and J. Schmidhuber, “Long short-term memory,” + Feature Hierarchies for Accurate Object Detection and Semantic Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. + Segmentation,” inCVPR, 2014. [39]T. N. Sainath, A.-r. Mohamed, B. Kingsbury, and B. Ramab- + [19]J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional hadran, “Deep convolutional neural networks for LVCSR,” in + Networks for Semantic Segmentation,” inCVPR, 2015. ICASSP, 2013. + [20]K. Simonyan and A. Zisserman, “Two-stream convolutional [40]V. Nair and G. E. Hinton, “Rectified Linear Units Improve + networks for action recognition in videos,” inNIPS, 2014. Restricted Boltzmann Machines,” inICML, 2010. + [21]G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, [41]A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlin- + A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainathet al., “Deep earities improve neural network acoustic models,” inICML, + neural networks for acoustic modeling in speech recognition: 2013. + The shared views of four research groups,”IEEE Signal Process. [42]K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into + Mag., vol. 29, no. 6, pp. 82–97, 2012. rectifiers: Surpassing human-level performance on imagenet + [22]R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, classification,” inICCV, 2015. + and P. Kuksa, “Natural language processing (almost) from [43]D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and + scratch,”Journal of Machine Learning Research, vol. 12, no. Accurate Deep Network Learning by Exponential Linear Units + Aug, pp. 2493–2537, 2011. (ELUs),”ICLR, 2016. + [23]A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, [44]X. Zhang, J. Trmal, D. Povey, and S. Khudanpur, “Improving + O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and deep neural network acoustic models using generalized maxout + K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” networks,” inICASSP, 2014. + CoRR abs/1609.03499, 2016. [45]Y. Zhang, M. Pezeshki, P. Brakel, S. Zhang, , C. Laurent, + [24]H. Y. Xiong, B. Alipanahi, L. J. Lee, H. Bretschneider, Y. Bengio, and A. Courville, “Towards End-to-End Speech + D. Merico, R. K. Yuen, Y. Hua, S. Gueroussov, H. S. Najafabadi, Recognition with Deep Convolutional Neural Networks,” in + T. R. Hugheset al., “The human splicing code reveals new Interspeech, 2016. + insights into the genetic determinants of disease,”Science, vol. [46] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- + 347, no. 6218, p. 1254806, 2015. shick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional + architecture for fast feature embedding,” inACM International [75]J. Cong and B. Xiao, “Minimizing computation in convolutional + Conference on Multimedia, 2014. neural networks,” inICANN, 2014. + [47]S. Ioffe and C. Szegedy, “Batch normalization: Accelerating [76]A. Lavin and S. Gray, “Fast algorithms for convolutional neural + deep network training by reducing internal covariate shift,” in networks,” inCVPR, 2016. + ICML, 2015. [77]“Intel Math Kernel Library,” https://software.intel.com/en-us/ + [48]Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient- mkl. + based learning applied to document recognition,”Proc. IEEE, [78]S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, + vol. 86, no. 11, pp. 2278–2324, Nov 1998. B. Catanzaro, and E. Shelhamer, “cuDNN: Efficient Primitives + [49]P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and for Deep Learning,”arXiv preprint arXiv:1410.0759, 2014. + Y. LeCun, “OverFeat: Integrated Recognition, Localization and [79]M. Horowitz, “Computing’s energy problem (and what we can + Detection using Convolutional Networks,” inICLR, 2014. do about it),” inISSCC, 2014. + [50]K. Simonyan and A. Zisserman, “Very Deep Convolutional [80]Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Archi- + Networks for Large-Scale Image Recognition,” inICLR, 2015. tecture for Energy-Efficient Dataflow for Convolutional Neural + [51]C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, Networks,” inISCA, 2016. + D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going Deeper [81]——, “Using Dataflow to Optimize Energy Efficiency of Deep + With Convolutions,” inCVPR, 2015. Neural Network Accelerators,”IEEE Micro’s Top Picks from the + [52]M. Lin, Q. Chen, and S. Yan, “Network in Network,” inICLR, Computer Architecture Conferences, vol. 37, no. 3, May-June + 2014. 2017. + [53]C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, [82]M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Dur- + “Rethinking the inception architecture for computer vision,” in danovic, E. Cosatto, and H. P. Graf, “A Massively Parallel + CVPR, 2016. Coprocessor for Convolutional Neural Networks,” inASAP, + [54]C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception- 2009. + v4, Inception-ResNet and the Impact of Residual Connections [83]V. Sriram, D. Cox, K. H. Tsoi, and W. Luk, “Towards an + on Learning,” inAAAI, 2017. embedded biologically-inspired machine vision processor,” in + [55]G. Urban, K. J. Geras, S. E. Kahou, O. Aslan, S. Wang, FPT, 2010. + R. Caruana, A. Mohamed, M. Philipose, and M. Richardson, [84]S. Chakradhar, M. Sankaradas, V. Jakkula, and S. Cadambi, + “Do Deep Convolutional Nets Really Need to be Deep and “A Dynamically Configurable Coprocessor for Convolutional + Convolutional?”ICLR, 2017. Neural Networks,” inISCA, 2010. + [56]“Caffe LeNet MNIST,” http://caffe.berkeleyvision.org/gathered/ [85]V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, + examples/mnist.html. “A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks,” + [57]“Caffe Model Zoo,” http://caffe.berkeleyvision.org/modelzoo. inCVPR Workshop, 2014. + html. [86]S. Park, K. Bong, D. Shin, J. Lee, S. Choi, and H.-J. Yoo, “A + [58]“Matconvnet Pretrained Models,” http://www.vlfeat.org/ 1.93TOPS/W scalable deep learning/inference processor with + matconvnet/pretrained/. tetra-parallel MIMD architecture for big-data applications,” in + [59]“TensorFlow-Slim image classification library,” https://github. ISSCC, 2015. + com/tensorflow/models/tree/master/slim. [87]L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and + [60]“Deep Learning Frameworks,” https://developer.nvidia.com/ L. Benini, “Origami: A Convolutional Network Accelerator,” + deep-learning-frameworks. inGLVLSI, 2015. + [61]Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An [88]S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, + Energy-Efficient Reconfigurable Accelerator for Deep Convolu- “Deep Learning with Limited Numerical Precision,” inICML, + tional Neural Networks,”IEEE J. Solid-State Circuits, vol. 51, 2015. + no. 1, 2017. [89]Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, + [62]C. J. B. Yann LeCun, Corinna Cortes, “THE MNIST X. Feng, Y. Chen, and O. Temam, “ShiDianNao: Shifting + DATABASE of handwritten digits,” http://yann.lecun.com/exdb/ Vision Processing Closer to the Sensor,” inISCA, 2015. + mnist/. [90]M. Peemen, A. A. A. Setio, B. Mesman, and H. Corporaal, + [63]L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Memory-centric accelerator design for Convolutional Neural + “Regularization of neural networks using dropconnect,” inICML, Networks,” inICCD, 2013. + 2013. [91]C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Opti- + [64]A. Krizhevsky, V. Nair, and G. Hinton, “The CIFAR-10 dataset,” mizing FPGA-based Accelerator Design for Deep Convolutional + https://www.cs.toronto.edu/ kriz/cifar.html. Neural Networks,” inFPGA, 2015. + [65]A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny [92]T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and + images: A large data set for nonparametric object and scene O. Temam, “DianNao: A Small-footprint High-throughput + recognition,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, Accelerator for Ubiquitous Machine-learning,” inASPLOS, + no. 11, pp. 1958–1970, 2008. 2014. + [66]A. Krizhevsky and G. Hinton, “Convolutional deep belief [93]Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, + networks on cifar-10,”Unpublished manuscript, vol. 40, 2010. T. Chen, Z. Xu, N. Sun, and O. Temam, “DaDianNao: A + [67]B. Graham, “Fractional max-pooling,” arXiv preprint Machine-Learning Supercomputer,” inMICRO, 2014. + arXiv:1412.6071, 2014. [94]Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An + [68]“Pascal VOC data sets,” http://host.robots.ox.ac.uk/pascal/ Energy-Efficient Reconfigurable Accelerator for Deep Convo- + VOC/. lutional Neural Networks,” inISSCC, 2016. + [69]“Microsoft Common Objects in Context (COCO) dataset,” http: [95]V. Sze, M. Budagavi, and G. J. Sullivan, “High Efficiency Video + //mscoco.org/. Coding (HEVC): Algorithms and Architectures,” inIntegrated + [70]“Google Open Images,” https://github.com/openimages/dataset. Circuit and Systems. Springer, 2014, pp. 1–375. + [71]“YouTube-8M,” https://research.google.com/youtube8m/. [96]M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer + [72]“AudioSet,” https://research.google.com/audioset/index.html. CNN accelerators,” inMICRO, 2016. + [73]S. Condon, “Facebook unveils Big Basin, new server geared [97]D. Keitel-Schulz and N. Wehn, “Embedded DRAM develop- + for deep learning,” ZDNet, March 2017. ment: Technology, physical design, and application issues,” + [74] C. Dubout and F. Fleuret, “Exact acceleration of linear object IEEE Des. Test. Comput., vol. 18, no. 3, pp. 7–15, 2001. + detectors,” inECCV, 2012. [98]J. Jeddeloh and B. Keeth, “Hybrid memory cube new DRAM + architecture increases density and performance,” inSymp. on and modularized RTL compilation of Convolutional Neural + VLSI, 2012. Networks onto FPGA,” inFPL, 2016. + [99]J. Standard, “High bandwidth memory (HBM) DRAM,” [122]P. Gysel, M. Motamedi, and S. Ghiasi, “Hardware-oriented + JESD235, 2013. Approximation of Convolutional Neural Networks,” inICLR, + [100]D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopad- 2016. + hyay, “Neurocube: A programmable digital neuromorphic [123]S. Higginbotham, “Google Takes Unconventional Route with + architecture with high-density 3D memory,” inISCA, 2016. Homegrown Machine Learning Chips,” Next Platform, May + [101]M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, 2016. + “TETRIS: Scalable and Efficient Neural Network Acceleration [124]T. P. Morgan, “Nvidia Pushes Deep Learning Inference With + with 3D Memory,” inASPLOS, 2017. New Pascal GPUs,” Next Platform, September 2016. + [102]J. Zhang, Z. Wang, and N. Verma, “A machine-learning [125]P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and + classifier implemented in a standard 6T SRAM array,” inSymp. A. Moshovos, “Stripes: Bit-serial deep neural network comput- + on VLSI, 2016. ing,” inMICRO, 2016. + [103]Z. Wang, R. Schapire, and N. Verma, “Error-adaptive classifier [126]B. Moons and M. Verhelst, “A 0.3–2.6 TOPS/W precision- + boosting (EACB): Exploiting data-driven training for highly scalable processor for real-time large-scale ConvNets,” inSymp. + fault-tolerant hardware,” inICASSP, 2014. on VLSI, 2016. + [104]A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, [127]M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: + J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “ISAAC: Training deep neural networks with binary weights during + A Convolutional Neural Network Accelerator with In-Situ propagations,” inNIPS, 2015. + Analog Arithmetic in Crossbars,” inISCA, 2016. [128]M. Courbariaux and Y. Bengio, “Binarynet: Training deep + [105]L. Chua, “Memristor-the missing circuit element,”IEEE Trans. neural networks with weights and activations constrained to+ + Circuit Theory, vol. 18, no. 5, pp. 507–519, 1971. 1 or-1,”arXiv preprint arXiv:1602.02830, 2016. + [106]L. Wilson, “International technology roadmap for semiconduc- [129]M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR- + tors (ITRS),”Semiconductor Industry Association, 2013. Net: ImageNet Classification Using Binary Convolutional + [107]Lu, Darsen, “Tutorial on Emerging Memory Devices,” 2016. Neural Networks,” inECCV, 2016. + [108]S. B. Eryilmaz, S. Joshi, E. Neftci, W. Wan, G. Cauwenberghs, [130]Z. Cai, X. He, J. Sun, and N. Vasconcelos, “Deep learning with + and H.-S. P. Wong, “Neuromorphic architectures with electronic low precision by half-wave gaussian quantization,” inCVPR, + synapses,” inISQED, 2016. 2017. + [109]P. Chi, S. Li, Z. Qi, P. Gu, C. Xu, T. Zhang, J. Zhao, Y. Liu, [131]F. Li and B. Liu, “Ternary weight networks,” inNIPS Workshop + Y. Wang, and Y. Xie, “PRIME: A Novel Processing-In-Memory on Efficient Methods for Deep Neural Networks, 2016. + Architecture for Neural Network Computation in ReRAM-based [132]C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained Ternary + Main Memory,” inISCA, 2016. Quantization,”ICLR, 2017. + [110]M. Prezioso, F. Merrikh-Bayat, B. Hoskins, G. Adam, K. K. [133]R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “YodaNN: An + Likharev, and D. B. Strukov, “Training and operation of Ultra-Low Power Convolutional Neural Network Accelerator + an integrated neuromorphic network based on metal-oxide Based on Binary Weights,” inISVLSI, 2016. + memristors,”Nature, vol. 521, no. 7550, pp. 61–64, 2015. [134]K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato, + [111]J. Zhang, Z. Wang, and N. Verma, “A matrix-multiplying ADC H. Nakahara, M. Ikebe, T. Asai, S. Takamaeda-Yamazaki, and + implementing a machine-learning classifier directly with data M. Kuroda, T.and Motomura, “BRein Memory: A 13-Layer + conversion,” inISSCC, 2015. 4.2 K Neuron/0.8 M Synapse Binary/Ternary Reconfigurable + [112]E. H. Lee and S. S. Wong, “A 2.5 GHz 7.7 TOPS/W switched- In-Memory Deep Neural Network Accelerator in 65nm CMOS,” + capacitor matrix multiplier with co-designed local memory in inSymp. on VLSI, 2017. + 40nm,” inISSCC, 2016. [135]D. Miyashita, E. H. Lee, and B. Murmann, “Convolutional + [113]R. LiKamWa, Y. Hou, J. Gao, M. Polansky, and L. Zhong, Neural Networks using Logarithmic Data Representation,” + “RedEye: analog ConvNet image sensor architecture for contin- arXiv preprint arXiv:1603.01025, 2016. + uous mobile vision,” inISCA, 2016. [136]A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental + [114]A. Wang, S. Sivaramakrishnan, and A. Molnar, “A 180nm Network Quantization: Towards Lossless CNNs with Low- + CMOS image sensor with on-chip optoelectronic image com- precision Weights,” inICLR, 2017. + pression,” inCICC, 2012. [137]W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen, + [115]H. Chen, S. Jayasuriya, J. Yang, J. Stephen, S. Sivaramakrish- “Compressing Neural Networks with the Hashing Trick,” in + nan, A. Veeraraghavan, and A. Molnar, “ASP Vision: Optically ICML, 2015. + Computing the First Layer of Convolutional Neural Networks [138]J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, + using Angle Sensitive Pixels,” inCVPR, 2016. and A. Moshovos, “Cnvlutin: ineffectual-neuron-free deep + [116]A. Suleiman and V. Sze, “Energy-efficient HOG-based object neural network computing,” inISCA, 2016. + detection at 1080HD 60 fps with multi-scale support,” inSiPS, [139]B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. + 2014. Lee, J. M. Hernandez-Lobato, G.-Y. Wei, and D. Brooks,´ + [117]E. H. Lee, D. Miyashita, E. Chai, B. Murmann, and S. S. Wong, “Minerva: Enabling low-power, highly-accurate deep neural + “Lognet: Energy-Efficient Neural Networks Using Logrithmic network accelerators,” inISCA, 2016. + Computations,” inICASSP, 2017. [140]Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal Brain + [118]S. Han, H. Mao, and W. J. Dally, “Deep Compression: Damage,” inNIPS, 1990. + Compressing Deep Neural Networks with Pruning, Trained [141]S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights + Quantization and Huffman Coding,” inICLR, 2016. and connections for efficient neural networks,” inNIPS, 2015. + [119] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Ben- [142]T.-J. Yang, Y.-H. Chen, and V. Sze, “Designing Energy-Efficient + gio, “Quantized neural networks: Training neural networks Convolutional Neural Networks using Energy-Aware Pruning,” + with low precision weights and activations,”arXiv preprint inCVPR, 2017. + arXiv:1609.07061, 2016. [143]“DNN Energy Estimation,” http://eyeriss.mit.edu/energy.html. + [120]S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “DoReFa- [144]R. Dorrance, F. Ren, and D. Markovic, “A scalable sparse´ + Net: Training low bitwidth convolutional neural networks with matrix-vector multiplication kernel for energy-efficient sparse- + low bitwidth gradients,”arXiv preprint arXiv:1606.06160, 2016. blas on FPGAs,” inISFPGA, 2014. + [121]Y. Ma, N. Suda, Y. Cao, J.-S. Seo, and S. Vrudhula, “Scalable [145]S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, + and W. J. Dally, “EIE: efficient inference engine on compressed + deep neural network,” inISCA, 2016. + [146]A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, + B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: + An accelerator for compressed-sparse convolutional neural + networks,” inISCA, 2017. + [147]W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning + structured sparsity in deep neural networks,” inNIPS, 2016. + [148]S. Anwar, K. Hwang, and W. Sung, “Structured pruning of + deep convolutional neural networks,”ACM Journal of Emerging + Technologies in Computing Systems, vol. 13, no. 3, p. 32, 2017. + [149]J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das, and + S. Mahlke, “Scalpel: Customizing dnn pruning to the underlying + hardware parallelism,” inISCA, 2017. + [150]H. Mao, S. Han, J. Pool, W. Li, X. Liu, Y. Wang, and W. J. Dally, + “Exploring the regularity of sparse structure in convolutional + neural networks,” inCVPR Workshop on Tensor Methods In + Computer Vision, 2017. + [151]J. S. Lim, “Two-dimensional signal and image processing,” + Englewood Cliffs, NJ, Prentice Hall, 1990, 710 p., 1990. + [152]F. Chollet, “Xception: Deep Learning With Depthwise Separa- + ble Convolutions,”CVPR, 2017. + [153]A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, + T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient + convolutional neural networks for mobile vision applications,” + arXiv preprint arXiv:1704.04861, 2017. + [154]F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. + Dally, and K. Keutzer, “SqueezeNet: AlexNet-level accuracy + with 50x fewer parameters and<1MB model size,”ICLR, + 2017. + [155]E. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, + “Exploiting Linear Structure Within Convolutional Networks + for Efficient Evaluation,” inNIPS, 2014. + [156]V. Lebedev, Y. Ganin, M. Rakhuba1, I. Oseledets, and V. Lem- + pitsky, “Speeding-Up Convolutional Neural Networks Using + Fine-tuned CP-Decomposition,”ICLR, 2015. + [157]Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin, + “Compression of Deep Convolutional Neural Networks for Fast + and Low Power Mobile Applications,” inICLR, 2016. + [158]C. Bucilu, R. Caruana, and A. Niculescu-Mizil, “Model + Compression,” inSIGKDD, 2006. + [159]L. Ba and R. Caurana, “Do Deep Nets Really Need to be + Deep?”NIPS, 2014. + [160]G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge + in a Neural Network,” inNIPS Deep Learning Workshop, 2014. + [161]A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and + Y. Bengio, “Fitnets: Hints for Thin Deep Nets,”ICLR, 2015. + [162]“Benchmarking DNN Processors,” http://eyeriss.mit.edu/benchmarking.html. +< <> <> + + +<> <> <> +EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks + +Abstract + +Convolutional Neural Networks (ConvNets) are commonly developed at a fixed resource budget, and then scaled up for better accuracy if more resources are available. In this paper, we systematically study model scaling and identify that carefully balancing network depth, width, and resolution can lead to better performance. Based on this observation, we propose a new scaling method that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. We demonstrate the effectiveness of this method on scaling up MobileNets and ResNet. +To go even further, we use neural architecture search to design a new baseline network and scale it up to obtain a family of models, called EfficientNets, which achieve much better accuracy and efficiency than previous ConvNets. In particular, our EfficientNet-B7 achieves state-of-the-art 84.4% top-1 / 97.1% top-5 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet. Our EfficientNets also transfer well and achieve state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with an order of magnitude fewer parameters. Source code is at https: //github.com/tensorflow/tpu/tree/master/models/official/efficientnet. + +1. Introduction + +Scaling up ConvNets is widely used to achieve better accuracy. For example, ResNet (He et al., 2016) can be scaled up from ResNet-18 to ResNet-200 by using more layers; Recently, GPipe (Huang et al., 2018) achieved 84.3% Ima. +geNet top-1 accuracy by scaling up a baseline model four + +<
> + +Figure 1. Model Size vs. ImageNet Accuracy. All numbers are for single-crop, single-model. Our EfficientNets significantly out.perform other ConvNets. In particular, EfficientNet-B7 achieves new state-of-the-art 84.4% top-1 accuracy but being 8.4x smaller and 6.1x faster than GPipe. EfficientNet-B1 is 7.6x smaller and 5.7x faster than ResNet-152. Details are in Table 2 and 4. +time larger. However, the process of scaling up ConvNets has never been well understood and there are currently many ways to do it. The most common way is to scale up Con.vNets by their depth (He et al., 2016) or width (Zagoruyko & Komodakis, 2016). Another less common, but increasingly popular, method is to scale up models by image resolution (Huang et al., 2018). In previous work, it is common to scale only one of the three dimensions � depth, width, and image size. Though it is possible to scale two or three dimensions arbitrarily, arbitrary scaling requires tedious manual tuning and still often yields sub-optimal accuracy and efficiency. +In this paper, we want to study and rethink the process of scaling up ConvNets. In particular, we investigate the central question: is there a principled method to scale up ConvNets that can achieve better accuracy and efficiency? Our empirical study shows that it is critical to balance all dimensions of network width/depth/resolution, and surpris.ingly such balance can be achieved by simply scaling each of them with constant ratio. Based on this observation, we propose a simple yet effective compound scaling method. Unlike conventional practice that arbitrary scales these fac.tors, our method uniformly scales network width, depth, + +<
> + +Figure 2. Model Scaling. (a) is a baseline network example; (b)-(d) are conventional scaling that only increases one dimension of network width, depth, or resolution. (e) is our proposed compound scaling method that uniformly scales all three dimensions with a fixed ratio. +and resolution with a set of fixed scaling coefficients. For example, if we want to use 2N times more computational resources, then we can simply increase the network depth by <>, width by <> , and image size by <> are constant coefficients determined by a small grid search on the original small model. Figure 2 illustrates the difference between our scaling method and conventional methods. +Intuitively, the compound scaling method makes sense be.cause if the input image is bigger, then the network needs more layers to increase the receptive field and more channels to capture more fine-grained patterns on the bigger image. In fact, previous theoretical (Raghu et al., 2017; Lu et al., 2018) and empirical results (Zagoruyko & Komodakis, 2016) both show that there exists certain relationship between network width and depth, but to our best knowledge, we are the first to empirically quantify the relationship among all three dimensions of network width, depth, and resolution. +We demonstrate that our scaling method work well on exist.ing MobileNets (Howard et al., 2017; Sandler et al., 2018) and ResNet (He et al., 2016). Notably, the effectiveness of model scaling heavily depends on the baseline network; to go even further, we use neural architecture search (Zoph & Le, 2017; Tan et al., 2019) to develop a new baseline network, and scale it up to obtain a family of models, called EfficientNets. Figure 1 summarizes the ImageNet performance, where our EfficientNets significantly outperform other ConvNets. In particular, our EfficientNet-B7 surpasses the best existing GPipe accuracy (Huang et al., 2018), but using 8.4x fewer parameters and running 6.1x faster on inference. Compared to the widely used ResNet-50 (He et al., 2016), our EfficientNet-B4 improves the top-1 accuracy from 76.3% to 83.0% (+6.7%) with similar FLOPS. Besides ImageNet, EfficientNets also transfer well and achieve state-of-the-art accuracy on 5 out of 8 widely used datasets, while reducing parameters by up to 21x than existing ConvNets. + +2. Related Work + +ConvNet Accuracy: Since AlexNet (Krizhevsky et al., 2012) won the 2012 ImageNet competition, ConvNets have become increasingly more accurate by going bigger: while the 2014 ImageNet winner GoogleNet (Szegedy et al., 2015) achieves 74.8% top-1 accuracy with about 6.8M parameters, the 2017 ImageNet winner SENet (Hu et al., 2018) achieves 82.7% top-1 accuracy with 145M parameters. Recently, GPipe (Huang et al., 2018) further pushes the state-of-the-art ImageNet top-1 validation accuracy to 84.3% using 557M parameters: it is so big that it can only be trained with a specialized pipeline parallelism library by partitioning the network and spreading each part to a different accelerator. While these models are mainly designed for ImageNet, recent studies have shown better ImageNet models also per.form better across a variety of transfer learning datasets (Kornblith et al., 2019), and other computer vision tasks such as object detection (He et al., 2016; Tan et al., 2019). Although higher accuracy is critical for many applications, we have already hit the hardware memory limit, and thus further accuracy gain needs better efficiency. +ConvNet efficiency: Deep ConvNets are often over-parameterized. Model compression (Han et al., 2016; He et al., 2018; Yang et al., 2018) is a common way to re.duce model size by trading accuracy for efficiency. As mo.bile phones become ubiquitous, it is also common to hand.craft efficient mobile-size ConvNets, such as SqueezeNets (Iandola et al., 2016; Gholami et al., 2018), MobileNets (Howard et al., 2017; Sandler et al., 2018), and ShuffleNets (Zhang et al., 2018; Ma et al., 2018). Recently, neural architecture search becomes increasingly popular in designing efficient mobile-size ConvNets (Tan et al., 2019; Cai et al., 2019), and achieves even better efficiency than hand-crafted mobile ConvNets by extensively tuning the network width, depth, convolution kernel types and sizes. However, it is unclear how to apply these techniques for larger models that have much larger design space and much more expensive tuning cost. In this paper, we aim to study model efficiency for super large ConvNets that surpass state-of-the-art accuracy. To achieve this goal, we resort to model scaling. + +Model Scaling: There are many ways to scale a Con.vNet for different resource constraints: ResNet (He et al., 2016) can be scaled down (e.g., ResNet-18) or up (e.g., ResNet-200) by adjusting network depth (#layers), while WideResNet (Zagoruyko & Komodakis, 2016) and Mo.bileNets (Howard et al., 2017) can be scaled by network width (#channels). It is also well-recognized that bigger input image size will help accuracy with the overhead of more FLOPS. Although prior studies (Raghu et al., 2017; Lin & Jegelka, 2018; Sharir & Shashua, 2018; Lu et al., 2018) have shown that network depth and width are both important for ConvNets� expressive power, it still remains an open question of how to effectively scale a ConvNet to achieve better efficiency and accuracy. Our work systematically and empirically studies ConvNet scaling for all three dimensions of network width, depth, and resolutions. + +3. Compound Model Scaling + +In this section, we will formulate the scaling problem, study different approaches, and propose our new scaling method. + +3.1. Problem Formulation +A ConvNet Layer i can be defined as a function: <>, where Fi is the operator, Yi is output tensor, Xi is input tensor, with tensor shape <>, where H_i and W_i are spatial dimension and C_i is the channel dimension. A ConvNet N can be represented by a list of composed lay- + +<> + +practice, ConvNet layers are often partitioned into multiple stages and all layers in each stage share the same architecture: for example, ResNet (He et al., 2016) has �ve stages, and all layers in each stage has the same convolutional type except the first layer performs down-sampling. Therefore, we can define a ConvNet as: + +<> + +<> where <> denotes layer F_i is repeated L_i times in stage i, + +<> denotes the shape of input tensor X of layer 1For the sake of simplicity, we omit batch dimension. +i. Figure 2(a) illustrate a representative ConvNet, where the spatial dimension is gradually shrunk but the channel dimension is expanded over layers, for example, from initial input shape h224, 224, 3i to final output shape h7, 7, 512i. +Unlike regular ConvNet designs that mostly focus on find.ing the best layer architecture Fi, model scaling tries to expand the network length (Li), width (Ci), and/or resolution (Hi,Wi) without changing Fi predefined in the baseline network. By �xing Fi, model scaling simplifies the design problem for new resource constraints, but it still remains a large design space to explore different <> for each layer. In order to further reduce the design space, we restrict that all layers must be scaled uniformly with constant ratio. Our target is to maximize the model accuracy for any given resource constraints, which can be formulated as an optimization problem: + +<> (2) + +where <> are coefficients for scaling network width, depth, and resolution; <> are predefined parameters in baseline network (see Table 1 as an example). +3.2. Scaling Dimensions +The main difficulty of problem 2 is that the optimal d, w, r depend on each other and the values change under different resource constraints. Due to this difficulty, conventional methods mostly scale ConvNets in one of these dimensions: +Depth (d): Scaling network depth is the most common way used by many ConvNets (He et al., 2016; Huang et al., 2017; Szegedy et al., 2015; 2016). The intuition is that deeper ConvNet can capture richer and more complex features, and generalize well on new tasks. However, deeper networks are also more difficult to train due to the vanishing gradient problem (Zagoruyko & Komodakis, 2016). Although several techniques, such as skip connections (He et al., 2016) and batch normalization (Ioffe & Szegedy, 2015), alleviate the training problem, the accuracy gain of very deep network diminishes: for example, ResNet-1000 has similar accuracy as ResNet-101 even though it has much more layers. Figure 3 (middle) shows our empirical study on scaling a baseline model with different depth coefficient d, further suggesting the diminishing accuracy return for very deep ConvNets. +Width (w): Scaling network width is commonly used for small size models (Howard et al., 2017; Sandler et al., 2018; + +<
> + +Figure 3. Scaling Up a Baseline Model with Different Network Width (w), Depth (d), and Resolution (r) coefficients. Bigger networks with larger width, depth, or resolution tend to achieve higher accuracy, but the accuracy gain quickly saturate after reaching 80%, demonstrating the limitation of single dimension scaling. Baseline network is described in Table 1. +Tan et al., 2019)2. As discussed in (Zagoruyko & Komodakis, 2016), wider networks tend to be able to capture more fine-grained features and are easier to train. However, extremely wide but shallow networks tend to have dif�cul.ties in capturing higher level features. Our empirical results in Figure 3 (left) show that the accuracy quickly saturates when networks become much wider with larger w. +Resolution (r): With higher resolution input images, Con.vNets can potentially capture more fine-grained patterns. Starting from 224x224 in early ConvNets, modern Con.vNets tend to use 299x299 (Szegedy et al., 2016) or 331x331 (Zoph et al., 2018) for better accuracy. Recently, GPipe (Huang et al., 2018) achieves state-of-the-art ImageNet ac.curacy with 480x480 resolution. Higher resolutions, such as 600x600, are also widely used in object detection ConvNets (He et al., 2017; Lin et al., 2017). Figure 3 (right) shows the results of scaling network resolutions, where indeed higher resolutions improve accuracy, but the accuracy gain dimin.ishes for very high resolutions (r =1.0 denotes resolution 224x224 and r =2.5 denotes resolution 560x560). +The above analyses lead us to the first observation: +Observation 1 � Scaling up any dimension of network width, depth, or resolution improves accuracy, but the accuracy gain diminishes for bigger models. +3.3. Compound Scaling +We empirically observe that different scaling dimensions are not independent. Intuitively, for higher resolution images, we should increase network depth, such that the larger receptive fields can help capture similar features that include more pixels in bigger images. Correspondingly, we should also increase network width when resolution is higher, in +In some literature, scaling number of channels is called depth multiplier, which means the same as our width coefficient w. + +<
> + +Figure 4. Scaling Network Width for Different Baseline Net.works. Each dot in a line denotes a model with different width coefficient (w). All baseline networks are from Table 1. The first baseline network <> has 18 convolutional layers with resolution 224x224, while the last baseline <> has 36 layers with resolution 299x299. +order to capture more fine-grained patterns with more pixels in high resolution images. These intuitions suggest that we need to coordinate and balance different scaling dimensions rather than conventional single-dimension scaling. +To validate our intuitions, we compare width scaling under different network depths and resolutions, as shown in Figure 4. If we only scale network width w without changing depth <<(d=1.0)>> and resolution <<(r=1.0)>>, the accuracy saturates quickly. With deeper (d=2.0) and higher resolution <<(r=2.0)>>, width scaling achieves much better accuracy under the same FLOPS cost. These results lead us to the second observation: +Observation 2 In order to pursue better accuracy and efficiency, it is critical to balance all dimensions of network width, depth, and resolution during ConvNet scaling. + +In fact, a few prior work (Zoph et al., 2018; Real et al., 2019) have already tried to arbitrarily balance network width and depth, but they all require tedious manual tuning. +In this paper, we propose a new compound scaling method, which use a compound coefficient . to uniformly scales network width, depth, and resolution in a principled way: + +<> (3) + +where <> are constants that can be determined by a small grid search. Intuitively, . is a user-specified coefficient that controls how many more resources are available for model scaling, while <> specify how to assign these extra resources to network width, depth, and resolution respectively. Notably, the FLOPS of a regular convolution op +is proportional to <> i.e., doubling network depth will double FLOPS, but doubling network width or resolution will increase FLOPS by four times. Since convolution ops usually dominate the computation cost in ConvNets, scaling a ConvNet with equation 3 will approximately in. +crease total FLOPS by <> In this paper, we constraint <> such that for any new <>, the total FLOPS will approximately3 increase by 2. +4. EfficientNet Architecture +Since model scaling does not change layer operators F_i in baseline network, having a good baseline network is also critical. We will evaluate our scaling method using existing ConvNets, but in order to better demonstrate the effectiveness of our scaling method, we have also developed a new mobile-size baseline, called EfficientNet. +Inspired by (Tan et al., 2019), we develop our baseline net.work by leveraging a multi-objective neural architecture search that optimizes both accuracy and FLOPS. Specifically, we use the same search space as (Tan et al., 2019), and use <> as the optimization goal, where <> and <> denote the accuracy and FLOPS of model m, T is the target FLOPS and w=-0.07 is a hyperparameter for controlling the trade-off between accuracy and FLOPS. Unlike (Tan et al., 2019; Cai et al., 2019), here we optimize FLOPS rather than latency since we are not targeting any specific hardware de.vice. Our search produces an efficient network, which we name EfficientNet-B0. Since we use the same search space as (Tan et al., 2019), the architecture is similar to <>. +FLOPS may differ from theoretical value due to rounding. + +Table 1. EfficientNet-B0 baseline network <> Each row describes a stage i with L_i layers, with input resolution <> and output channels C_i. Notations are adopted from equation 2. + +<> + +Net, except our EfficientNet-B0 is slightly bigger due to the larger FLOPS target (our FLOPS target is 400M). Ta.ble 1 shows the architecture of EfficientNet-B0. Its main building block is mobile inverted bottleneck MBConv (Sandler et al., 2018; Tan et al., 2019), to which we also add squeeze-and-excitation optimization (Hu et al., 2018). +Starting from the baseline EfficientNet-B0, we apply our compound scaling method to scale it up with two steps: + +STEP 1: we first <> assuming twice more re.sources available, and do a small grid search of <> based on Equation 2 and 3. In particular, we find the best values for EfficientNet-B0 are <>, under constraint of <>. + +STEP 2: we then <> as constants and scale up baseline network with different . using Equation 3, to obtain EfficientNet-B1 to B7 (Details in Table 2). + +Notably, it is possible to achieve even better performance by searching for <> directly around a large model, but the search cost becomes prohibitively more expensive on larger models. Our method solves this issue by only doing search once on the small baseline network (step 1), and then use the same scaling coefficients for all other models (step 2). + +5. Experiments + +In this section, we will first evaluate our scaling method on existing ConvNets and the new proposed EfficientNets. + +5.1. Scaling Up MobileNets and ResNets +As a proof of concept, we first apply our scaling method to the widely-used MobileNets (Howard et al., 2017; Sandler et al., 2018) and ResNet (He et al., 2016). Table 3 shows the ImageNet results of scaling them in different ways. Compared to other single-dimension scaling methods, our compound scaling method improves the accuracy on all these models, suggesting the effectiveness of our proposed scaling method for general existing ConvNets. + +Table 2. EfficientNet Performance Results on ImageNet (Russakovsky et al., 2015). All EfficientNet models are scaled from our baseline EfficientNet-B0 using different compound coefficient . in Equation 3. ConvNets with similar top-1/top-5 accuracy are grouped together for efficiency comparison. Our scaled EfficientNet models consistently reduce parameters and FLOPS by an order of magnitude (up to 8.4x parameter reduction and up to 16x FLOPS reduction) than existing ConvNets. + +<
> + +We omit ensemble and multi-crop models (Hu et al., 2018), or models pretrained on 3.5B Instagram images (Mahajan et al., 2018). + +Table 3. Scaling Up MobileNets and ResNet. + +<
> + +Table 5. EfficientNet Performance Results on Transfer Learning Datasets. Our scaled EfficientNet models achieve new state-of-the.art accuracy for 5 out of 8 datasets, with 9.6x fewer parameters on average. +Comparison to best public-available results Comparison to best reported results Model Accuracy. + +<
> + +Figure 6. Model Parameters vs. Transfer Learning Accuracy weight decay 1e-5; initial learning rate 0.256 that decays by 0.97 every 2.4 epochs. + +<
> + +We also use swish activation (Ramachandran et al., 2018; Elfwing et al., 2018), fixed Au.to Augment policy (Cubuk et al., 2019), and stochastic depth (Huang et al., 2016) with survival probability 0.8. As commonly known that bigger models need more regularization, we linearly increase dropout (Srivastava et al., 2014) ratio from 0.2 for EfficientNet-B0 to 0.5 for EfficientNet-B7. +Table 2 shows the performance of all EfficientNet models that are scaled from the same baseline EfficientNet-B0. Our EfficientNet models generally use an order of magnitude fewer parameters and FLOPS than other ConvNets with similar accuracy. In particular, our EfficientNet-B7 achieves 84.4% top1 / 97.1% top-5 accuracy with 66M parameters and 37B FLOPS, being more accurate but 8.4x smaller than the previous best GPipe (Huang et al., 2018). +All models are pretrained on ImageNet and fine tuned on new datasets. Figure 1 and Figure 5 illustrates the parameters-accuracy and FLOPS-accuracy curve for representative ConvNets, where our scaled EfficientNet models achieve better accuracy with much fewer parameters and FLOPS than other ConvNets. Notably, our EfficientNet models are not only small, but also computational cheaper. For example, our EfficientNet-B3 achieves higher accuracy than ResNeXt.101 (Xie et al., 2017) using 18x fewer FLOPS. +To validate the computational cost, we have also measured the inference latency for a few representative CovNets on a real CPU as shown in Table 4, where we report average latency over 20 runs. Our EfficientNet-B1 runs 5.7x faster than the widely used ResNet-152 (He et al., 2016), while EfficientNet-B7 runs about 6.1x faster than GPipe (Huang et al., 2018), suggesting our EfficientNets are indeed fast on real hardware. + +Figure 7. Class Activation Map (CAM) (Zhou et al., 2016) for Models with different scaling methods-Our compound scaling method allows the scaled model (last column) to focus on more relevant regions with more object details. Model details are in Table 7. + +<
> + +Table 6. Transfer Learning Datasets. + +<
> + +5.3. Transfer Learning Results for EfficientNet +We have also evaluated our EfficientNet on a list of commonly used transfer learning datasets, as shown in Table 6. We borrow the same training settings from (Kornblith et al., 2019) and (Huang et al., 2018), which take ImageNet pretrained checkpoints and fine tune on new datasets. +Table 5 shows the transfer learning performance: (1) Com.pared to public available models, such as NASNet-A (Zoph et al., 2018) and Inception-v4 (Szegedy et al., 2017), our EfficientNet models achieve better accuracy with 4.7x average (up to 21x) parameter reduction. (2) Compared to state-of-the-art models, including DAT (Ngiam et al., 2018) that dynamically synthesizes training data and GPipe (Huang et al., 2018) that is trained with specialized pipeline parallelism, our EfficientNet models still surpass their accuracy in 5 out of 8 datasets, but using 9.6x fewer parameters +Figure 6 compares the accuracy-parameters curve for a variety of models. In general, our EfficientNets consistently achieve better accuracy with an order of magnitude fewer parameters than existing models, including ResNet (He et al., 2016), DenseNet (Huang et al., 2017), Inception (Szegedy et al., 2017), and NASNet (Zoph et al., 2018). + +6. Discussion + +Figure 8. Scaling Up EfficientNet-B0 with Different Methods. Table 7. Scaled Models Used in Figure 7. + +<
> + +To disentangle the contribution of our proposed scaling method from the EfficientNet architecture, Figure 8 com.pares the ImageNet performance of different scaling methods for the same EfficientNet-B0 baseline network. In general, all scaling methods improve accuracy with the cost of more FLOPS, but our compound scaling method can further improve accuracy, by up to 2.5%, than other single-dimension scaling methods, suggesting the importance of our proposed compound scaling. +In order to further understand why our compound scaling method is better than others, Figure 7 compares the class activation map (Zhou et al., 2016) for a few representative models with different scaling methods. All these models are scaled from the same baseline, and their statistics are shown in Table 7. Images are randomly picked from ImageNet validation set. As shown in the figure, the model with com.pound scaling tends to focus on more relevant regions with more object details, while other models are either lack of object details or unable to capture all objects in the images. + +7. Conclusion + +In this paper, we systematically study ConvNet scaling and identify that carefully balancing network width, depth, and resolution is an important but missing piece, preventing us from better accuracy and efficiency. To address this issue, we propose a simple and highly effective compound scaling method, which enables us to easily scale up a baseline Con.vNet to any target resource constraints in a more principled way, while maintaining model efficiency. Powered by this compound scaling method, we demonstrate that a mobile-size EfficientNet model can be scaled up very effectively, surpassing state-of-the-art accuracy with an order of magnitude fewer parameters and FLOPS, on both ImageNet and five commonly used transfer learning datasets. + +Acknowledgements + +We thank Ruoming Pang, Vijay Vasudevan, Alok Aggarwal, Barret Zoph, Hongkun Yu, Xiaodan Song, Samy Bengio, Jeff Dean, and Google Brain team for their help. + +References + +Berg, T., Liu, J., Woo Lee, S., Alexander, M. L., Jacobs, +D. W., and Belhumeur, P. N. Birdsnap: Large-scale fine-grained visual categorization of birds. CVPR, pp. 2011�2018, 2014. +Bossard, L., Guillaumin, M., and Van Gool, L. Food-101� mining discriminative components with random forests. ECCV, pp. 446�461, 2014. +Cai, H., Zhu, L., and Han, S. Proxylessnas: Direct neural architecture search on target task and hardware. ICLR, 2019. +Chollet, F. Xception: Deep learning with depthwise separa.ble convolutions. CVPR, pp. 1610�02357, 2017. +Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V. Autoaugment: Learning augmentation policies from data. CVPR, 2019. +Elfwing, S., Uchibe, E., and Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:3�11, 2018. +Gholami, A., Kwon, K., Wu, B., Tai, Z., Yue, X., Jin, P., Zhao, S., and Keutzer, K. Squeezenext: Hardware-aware neural network design. ECV Workshop at CVPR�18, 2018. +Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR, 2016. +He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. CVPR, pp. 770�778, 2016. +He, K., Gkioxari, G., Dollar,� P., and Girshick, R. Mask r-cnn. ICCV, pp. 2980�2988, 2017. +He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., and Han, S. Amc: Automl for model compression and acceleration on mobile devices. ECCV, 2018. +Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017. +Hu, J., Shen, L., and Sun, G. Squeeze-and-excitation net.works. CVPR, 2018. +Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger, +K. Q. Deep networks with stochastic depth. ECCV, pp. 646�661, 2016. +Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, +K. Q. Densely connected convolutional networks. CVPR, 2017. +Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le, +Q. V., and Chen, Z. Gpipe: efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1808.07233, 2018. +Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., and Keutzer, K. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016. +Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML, pp. 448�456, 2015. +Kornblith, S., Shlens, J., and Le, Q. V. Do better imagenet models transfer better? CVPR, 2019. +Krause, J., Deng, J., Stark, M., and Fei-Fei, L. Collecting a large-scale dataset of fine-grained cars. Second Workshop on Fine-Grained Visual Categorizatio, 2013. +Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical Report, 2009. +Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classi�cation with deep convolutional neural networks. In NIPS, pp. 1097�1105, 2012. +Lin, H. and Jegelka, S. Resnet with one-neuron hidden layers is a universal approximator. NeurIPS, pp. 6172� 6181, 2018. +Lin, T.-Y., Dollar,� P., Girshick, R., He, K., Hariharan, B., and Belongie, S. Feature pyramid networks for object detection. CVPR, 2017. +Liu, C., Zoph, B., Shlens, J., Hua, W., Li, L.-J., Fei-Fei, L., Yuille, A., Huang, J., and Murphy, K. Progressive neural architecture search. ECCV, 2018. +Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. The expres.sive power of neural networks: A view from the width. NeurIPS, 2018. +Ma, N., Zhang, X., Zheng, H.-T., and Sun, J. Shuf�enet v2: Practical guidelines for efficient cnn architecture design. ECCV, 2018. +Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri, M., Li, Y., Bharambe, A., and van der Maaten, L. Explor.ing the limits of weakly supervised pretraining. arXiv preprint arXiv:1805.00932, 2018. +Maji, S., Rahtu, E., Kannala, J., Blaschko, M., and Vedaldi, +A. Fine-grained visual classi�cation of aircraft. arXiv preprint arXiv:1306.5151, 2013. +Ngiam, J., Peng, D., Vasudevan, V., Kornblith, S., Le, Q. V., and Pang, R. Domain adaptive transfer learning with spe.cialist models. arXiv preprint arXiv:1811.07056, 2018. +Nilsback, M.-E. and Zisserman, A. Automated �ower clas.si�cation over a large number of classes. ICVGIP, pp. 722�729, 2008. +Parkhi, O. M., Vedaldi, A., Zisserman, A., and Jawahar, C. Cats and dogs. CVPR, pp. 3498�3505, 2012. +Raghu, M., Poole, B., Kleinberg, J., Ganguli, S., and Sohl-Dickstein, J. On the expressive power of deep neural networks. ICML, 2017. +Ramachandran, P., Zoph, B., and Le, Q. V. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2018. +Real, E., Aggarwal, A., Huang, Y., and Le, Q. V. Regu.larized evolution for image classi�er architecture search. AAAI, 2019. +Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition chal.lenge. International Journal of Computer Vision, 115(3): 211�252, 2015. +Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. CVPR, 2018. +Sharir, O. and Shashua, A. On the expressive power of overlapping architectures of deep learning. ICLR, 2018. +Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from over�tting. The Journal of Machine Learning Research, 15(1):1929�1958, 2014. +Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, +A. Going deeper with convolutions. CVPR, pp. 1�9, 2015. +Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, +Z. Rethinking the inception architecture for computer vision. CVPR, pp. 2818�2826, 2016. +Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A. Inception-v4, inception-resnet and the impact of residual connections on learning. AAAI, 4:12, 2017. +Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., and Le, Q. V. MnasNet: Platform-aware neural architecture search for mobile. CVPR, 2019. +Xie, S., Girshick, R., Doll�ar, P., Tu, Z., and He, K. Aggre.gated residual transformations for deep neural networks. CVPR, pp. 5987�5995, 2017. +Yang, T.-J., Howard, A., Chen, B., Zhang, X., Go, A., Sze, V., and Adam, H. Netadapt: Platform-aware neural net.work adaptation for mobile applications. ECCV, 2018. +Zagoruyko, S. and Komodakis, N. Wide residual networks. BMVC, 2016. +Zhang, X., Li, Z., Loy, C. C., and Lin, D. Polynet: A pursuit of structural diversity in very deep networks. CVPR, pp. 3900�3908, 2017. +Zhang, X., Zhou, X., Lin, M., and Sun, J. Shuf�enet: An ex.tremely efficient convolutional neural network for mobile devices. CVPR, 2018. +Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, +A. Learning deep features for discriminative localization. CVPR, pp. 2921�2929, 2016. +Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. ICLR, 2017. +Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning transferable architectures for scalable image recognition. CVPR, 2018. +<> <> <> + + +<> <> <> +Energy and Policy Considerations for Deep Learning in NLP + +Emma Strubell Ananya Ganesh Andrew McCallum College of Information and Computer Sciences University of Massachusetts Amherst +{strubell, aganesh, + +Abstract + +Recent progress in hardware and methodology for training neural networks has ushered in a new generation of large networks trained on abundant data. These models have obtained notable gains in accuracy across many NLP tasks. However, these accuracy improvements depend on the availability of exception.ally large computational resources that necessitate similarly substantial energy consumption. As a result these models are costly to train and develop, both financially, due to the cost of hardware and electricity or cloud compute time, and environmentally, due to the car.bon footprint required to fuel modern tensor processing hardware. In this paper we bring this issue to the attention of NLP researchers by quantifying the approximate financial and environmental costs of training a variety of recently successful neural network models for NLP. Based on these findings, we propose actionable recommendations to reduce costs and improve equity in NLP research and practice. + +1 Introduction + +Advances in techniques and hardware for train.ing deep neural networks have recently enabled impressive accuracy improvements across many fundamental NLP tasks (Bahdanau et al., 2015; Luong et al., 2015; Dozat and Manning, 2017; Vaswani et al., 2017), with the most computationally-hungry models obtaining the highest scores (Peters et al., 2018; Devlin et al., 2019; Radford et al., 2019; So et al., 2019). As a result, training a state-of-the-art model now re.quires substantial computational resources which demand considerable energy, along with the associated financial and environmental costs. Research and development of new models multiplies these costs by thousands of times by requiring re.training to experiment with model architectures and hyperparameters. Whereas a decade ago most + +<
> + +Table 1: Estimated CO2 emissions from training com.mon NLP models, compared to familiar consumption. + +NLP models could be trained and developed on a commodity laptop or server, many now require multiple instances of specialized hardware such as GPUs or TPUs, therefore limiting access to these highly accurate models on the basis of finances. +Even when these expensive computational resources are available, model training also incurs a substantial cost to the environment due to the energy required to power this hardware for weeks or months at a time. Though some of this energy may come from renewable or carbon credit-offset resources, the high energy demands of these models are still a concern since (1) energy is not currently derived from carbon-neural sources in many locations, and (2) when renewable energy is available, it is still limited to the equipment we have to pro.duce and store it, and energy spent training a neural network might better be allocated to heating a family's home. It is estimated that we must cut carbon emissions by half over the next decade to deter escalating rates of natural disaster, and based on the estimated CO2 emissions listed in Table 1, +1Sources: (1) Air travel and per-capita consumption: https://bit.ly/2Hw0xWc; (2) car lifetime: https://bit.ly/2Qbr0w1. +model training and development likely make up a substantial portion of the greenhouse gas emissions attributed to many NLP researchers. +To heighten the awareness of the NLP community to this issue and promote mindful practice and policy, we characterize the dollar cost and carbon emissions that result from training the neural net.works at the core of many state-of-the-art NLP models. We do this by estimating the kilowatts of energy required to train a variety of popular off-the-shelf NLP models, which can be converted to approximate carbon emissions and electricity costs. To estimate the even greater resources re.quired to transfer an existing model to a new task or develop new models, we perform a case study of the full computational resources required for the development and tuning of a recent state-of-the-art NLP pipeline (Strubell et al., 2018). We conclude with recommendations to the community based on our findings, namely: (1) Time to retrain and sensitivity to hyperparameters should be reported for NLP machine learning models; (2) academic Researchers need equitable access to computational resources; and (3) researchers should prioritize developing efficient models and hardware. +2 Methods +To quantify the computational and environmental cost of training deep neural network models for NLP, we perform an analysis of the energy required to train a variety of popular off-the-shelf NLP models, as well as a case study of the complete sum of resources required to develop LISA (Strubell et al., 2018), a state-of-the-art NLP model from EMNLP 2018, including all tuning and experimentation. +We measure energy use as follows. We train the models described in 2.1 using the default settings provided, and sample GPU and CPU power con.sumption during training. Each model was trained for a maximum of 1 day. We train all models on a single NVIDIA Titan X GPU, with the exception of ELMo which was trained on 3 NVIDIA GTX 1080 Ti GPUs. While training, we repeatedly query the NVIDIA System Management Interface to sample the GPU power consumption and report the average over all samples. To sample CPU power consumption, we use Intel's Running Average Power Limit interface. + +<
> + +Table 2: Percent energy sourced from: Renewable (e.g. hydro, solar, wind), natural gas, coal and nuclear for the top 3 cloud compute providers (Cook et al., 2017), compared to the United States,4 China5 and Germany (Burger, 2019). + +We estimate the total time expected for models to train to completion using training times and hardware reported in the original papers. We then calculate the power consumption in kilowatt-hours (kWh) as follows. Let pc be the average power draw (in watts) from all CPU sockets during train.ing, let pr be the average power draw from all DRAM (main memory) sockets, let pg be the aver.age power draw of a GPU during training, and let g be the number of GPUs used to train. We esti.mate total power consumption as combined GPU, CPU and DRAM consumption, then multiply this by Power Usage Effectiveness (PUE), which ac.counts for the additional energy required to sup.port the compute infrastructure (mainly cooling). We use a PUE coefficient of 1.58, the 2018 global average for data centers (Ascierto, 2018). Then the total power pt required at a given instance during training is given by: + +<> (1) + +The U.S. Environmental Protection Agency (EPA) provides average CO2 produced (in pounds per kilowatt-hour) for power consumed in the U.S. (EPA, 2018), which we use to convert power to estimated CO2 emissions: + +<> (2) + +This conversion takes into account the relative pro.portions of different energy sources (primarily nat.ural gas, coal, nuclear and renewable) consumed to produce energy in the United States. Table 2 lists the relative energy sources for China, Ger.many and the United States compared to the top +three cloud service providers. The U.S. break.down of energy is comparable to that of the most popular cloud compute service, Amazon Web Ser.vices, so we believe this conversion to provide a reasonable estimate of CO2 emissions per kilowatt hour of compute energy used. + +2.1 Models + +We analyze four models, the computational requirements of which we describe below. All models have code freely available online, which we used out-of-the-box. For more details on the models themselves, please refer to the original papers. +Transformer. The Transformer model (Vaswani et al., 2017) is an encoder-decoder architecture primarily recognized for efficient and accurate ma.chine translation. The encoder and decoder each consist of 6 stacked layers of multi-head self-attention. Vaswani et al. (2017) report that the Transformer base model (65M parameters) was trained on 8 NVIDIA P100 GPUs for 12 hours, and the Transformer big model (213M parameters) was trained for 3.5 days (84 hours; 300k steps). This model is also the basis for recent work on neural architecture search (NAS) for ma.chine translation and language modeling (So et al., 2019), and the NLP pipeline that we study in more detail in 4.2 (Strubell et al., 2018). So et al. (2019) report that their full architecture search ran for a total of 979M training steps, and that their base model requires 10 hours to train for 300k steps on one TPUv2 core. This equates to 32,623 hours of TPU or 274,120 hours on 8 P100 GPUs. +ELMo. The ELMo model (Peters et al., 2018) is based on stacked LSTMs and provides rich word representations in context by pre-training on a large amount of data using a language model.ing objective. Replacing context-independent pre.trained word embeddings with ELMo has been shown to increase performance on downstream tasks such as named entity recognition, semantic role labeling, and coreference. Peters et al. (2018) report that ELMo was trained on 3 NVIDIA GTX 1080 GPUs for 2 weeks (336 hours). +BERT. The BERT model (Devlin et al., 2019) provides a Transformer-based architecture for build.ing contextual representations similar to ELMo, but trained with a different language modeling objective. BERT substantially improves accuracy on tasks requiring sentence-level representations such as question answering and natural language inference. Devlin et al. (2019) report that the BERT base model (110M parameters) was trained on 16 TPU chips for 4 days (96 hours). NVIDIA reports that they can train a BERT model in 3.3 days (79.2 hours) using 4 DGX-2H servers, totaling 64 Tesla V100 GPUs (Forster et al., 2019). +GPT-2. This model is the latest edition of OpenAI's GPT general-purpose token encoder, also based on Transformer-style self-attention and trained with a language modeling objective (Rad.ford et al., 2019). By training a very large model on massive data, Radford et al. (2019) show high zero-shot performance on question answering and language modeling benchmarks. The large model described in Radford et al. (2019) has 1542M parameters and is reported to require 1 week (168 hours) of training on 32 TPUv3 chips. 6 + +3 Related work + +There is some precedent for work characterizing the computational requirements of training and inference in modern neural network architectures in the computer vision community. Li et al. (2016) present a detailed study of the energy use required for training and inference in popular convolutional models for image classification in computer vision, including fine-grained analysis comparing different neural network layer types. Canziani et al. (2016) assess image classification model accuracy as a function of model size and gigaflops required during inference. They also measure average power draw required during inference on GPUs as a function of batch size. Neither work analyzes the recurrent and self-attention models that have become commonplace in NLP, nor do they extrapolate power to estimates of carbon and dol.lar cost of training. +Analysis of hyperparameter tuning has been performed in the context of improved algorithms for hyperparameter search (Bergstra et al., 2011; Bergstra and Bengio, 2012; Snoek et al., 2012). To our knowledge there exists to date no analysis of the computation required for R&D and hyperparameter tuning of neural network models in NLP. +6Via the authors on Reddit. +7GPU lower bound computed using pre-emptible <> U.S. resources priced at <>, upper bound uses on-demand U.S. resources priced at <>. We similarly use pre-emptible (<>) and on-demand (<>) pricing as lower and upper bounds for TPU v2/3; cheaper bulk contracts are available. + +<
> + +Table 3: Estimated cost of training a model in terms of CO2 emissions (lbs) and cloud compute cost (USD).7 Power and carbon footprint are omitted for TPUs due to lack of public information on power draw for this hardware. + +4 Experimental results + +4.1 Cost of training +Table 3 lists CO2 emissions and estimated cost of training the models described in 2.1. Of note is that TPUs are more cost-efficient than GPUs on workloads that make sense for that hardware (e.g. BERT). We also see that models emit substantial carbon emissions; training BERT on GPU is roughly equivalent to a trans-American fight. So et al. (2019) report that NAS achieves a new state-of-the-art BLEU score of 29.7 for English to Ger.man machine translation, an increase of just 0.1 BLEU at the cost of at least $150k in on-demand compute time and non-trivial carbon emissions. + +4.2 Cost of development: Case study +To quantify the computational requirements of R&D for a new model we study the logs of all training required to develop Linguistically-Informed Self-Attention (Strubell et al., 2018), a multi-task model that performs part-of-speech tagging, labeled dependency parsing, predicate detection and semantic role labeling. This model makes for an interesting case study as a representative NLP pipeline and as a Best Long Paper at EMNLP. +Model training associated with the project spanned a period of 172 days (approx. 6 months). During that time 123 small hyperparameter grid searches were performed, resulting in 4789 jobs in total. Jobs varied in length ranging from a minimum of 3 minutes, indicating a crash, to a maximum of 9 days, with an average job length of 52 hours. All training was done on a combination of NVIDIA Titan X (72%) and M40 (28%) GPUs.8 +The sum GPU time required for the project totaled 9998 days (27 years). This averages to + +<
> + +Table 4: Estimated cost in terms of cloud compute and electricity for training: (1) a single model (2) a single tune and (3) all models trained during R&D. +about 60 GPUs running constantly throughout the 6 month duration of the project. Table 4 lists upper and lower bounds of the estimated cost in terms of Google Cloud compute and raw electricity re.quired to develop and deploy this model.9 We see that while training a single model is relatively inexpensive, the cost of tuning a model for a new dataset, which we estimate here to require 24 jobs, or performing the full R&D required to develop this model, quickly becomes extremely expensive. + +5 Conclusions + +Authors should report training time and sensitivity to hyperparameters. +Our experiments suggest that it would be beneficial to directly compare different models to per.form a cost-bene�t (accuracy) analysis. To ad.dress this, when proposing a model that is meant to be re-trained for downstream use, such as re.training on a new domain or fine-tuning on a new task, authors should report training time and computational resources required, as well as model sensitivity to hyperparameters. This will enable direct comparison across models, allowing subsequent consumers of these models to accurately assess whether the required computational resources + +We approximate cloud compute cost using P100 pricing. 9Based on average U.S cost of electricity of $0.12/kWh. +are compatible with their setting. More explicit characterization of tuning time could also reveal inconsistencies in time spent tuning baseline models compared to proposed contributions. Realizing this will require: (1) a standard, hardware-independent measurement of training time, such as gigaflops required to convergence, and (2) a standard measurement of model sensitivity to data and hyperparameters, such as variance with respect to hyperparameters searched. +Academic researchers need equitable access to computation resources. + +Recent advances in available compute come at a high price not attainable to all who desire access. Most of the models studied in this paper were developed outside academia; recent improvements in state-of-the-art accuracy are possible thanks to industry access to large-scale compute. +Limiting this style of research to industry labs hurts the NLP research community in many ways. First, it stifles creativity. Researchers with good ideas but without access to large-scale compute will simply not be able to execute their ideas, instead constrained to focus on different problems. Second, it prohibits certain types of Research on the basis of access to financial resources. This even more deeply promotes the already problematic rich get richer cycle of research funding, where groups that are already successful and thus well-funded tend to receive more funding due to their existing accomplishments. Third, the prohibitive start-up cost of building in-house resources forces resource-poor groups to rely on cloud compute services such as AWS, Google Cloud and Microsoft Azure. +While these services provide valuable, flexible, and often relatively environmentally friendly compute resources, it is more cost effective for academic researchers, who often work for nonprofit educational institutions and whose research is funded by government entities, to pool resources to build shared compute centers at the level of funding agencies, such as the U.S. National Science Foundation. For example, an off-the-shelf GPU server containing 8 NVIDIA 1080 Ti GPUs and supporting hardware can be purchased for approximately $20,000 USD. At that cost, the hardware required to develop the model in our case study (approximately 58 GPUs for 172 days) would cost $145,000 USD plus electricity, about half the estimated cost to use on-demand cloud GPUs. Unlike money spent on cloud compute, however, that invested in centralized resources would continue to pay off as resources are shared across many projects. A government-funded academic compute cloud would provide equitable access to all researchers. +Researchers should prioritize computationally efficient hardware and algorithms. +We recommend a concerted effort by industry and academia to promote research of more computationally efficient algorithms, as well as hardware that requires less energy. An effort can also be made in terms of software. There is already a precedent for NLP software packages prioritizing efficient models. An additional avenue through which NLP and machine learning software developers could aid in reducing the energy associated with model tuning is by providing easy.to-use APIs implementing more efficient alternatives to brute-force grid search for hyperparameter tuning, e.g. random or Bayesian hyperparameter search techniques (Bergstra et al., 2011; Bergstra and Bengio, 2012; Snoek et al., 2012). While software packages implementing these techniques do exist,10 they are rarely employed in practice for tuning NLP models. This is likely because their interoperability with popular deep learning frameworks such as PyTorch and TensorFlow is not optimized, i.e. there are not simple examples of how to tune TensorFlow Estimators using Bayesian search. Integrating these tools into the work�ows with which NLP researchers and practitioners are already familiar could have notable im.pact on the cost of developing and tuning in NLP. + +Acknowledgements + +We are grateful to Sherief Farouk and the anonymous reviewers for helpful feedback on earlier drafts. This work was supported in part by the Centers for Data Science and Intelligent Information Retrieval, the Chan-Zuckerberg Initiative under the Scientific Knowledge Base Construction project, the IBM Cognitive Horizons Network agreement no. W1668553, and National Science Foundation grant no. IIS-1514053. Any opinions, findings and conclusions or recommendations ex.pressed in this material are those of the authors and do not necessarily reflect those of the sponsor. +For example, the Hyperopt Python library. + +References + +Rhonda Ascierto. 2018. Uptime Institute Global Data Center Survey. Technical report, Uptime Institute. +Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben.gio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd Inter.national Conference for Learning Representations (ICLR), San Diego, California, USA. +James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281�305. +James S Bergstra, R�emi Bardenet, Yoshua Bengio, and Bal�azs K�egl. 2011. Algorithms for hyper-parameter optimization. In Advances in neural information processing systems, pages 2546�2554. +Bruno Burger. 2019. Net Public Electricity Generation in Germany in 2018. Technical report, Fraunhofer Institute for Solar Energy Systems ISE. +Alfredo Canziani, Adam Paszke, and Eugenio Culur.ciello. 2016. An analysis of deep neural network models for practical applications. +Gary Cook, Jude Lee, Tamina Tsai, Ada Kongn, John Deans, Brian Johnson, Elizabeth Jardim, and Brian Johnson. 2017. Clicking Clean: Who is winning the race to build a green internet? Technical report, Greenpeace. +Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Un.derstanding. In NAACL. +Timothy Dozat and Christopher D. Manning. 2017. Deep biaffine attention for neural dependency pars.ing. In ICLR. +EPA. 2018. Emissions & Generation Resource Inte.grated Database (eGRID). Technical report, U.S. Environmental Protection Agency. +Christopher Forster, Thor Johnsen, Swetha Man.dava, Sharath Turuvekere Sreenivas, Deyu Fu, Julie Bernauer, Allison Gray, Sharan Chetlur, and Raul Puri. 2019. BERT Meets GPUs. Technical report, NVIDIA AI. +Da Li, Xinbo Chen, Michela Becchi, and Ziliang Zong. 2016. Evaluating the energy ef�ciency of deep con.volutional neural networks on cpus and gpus. 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Comput.ing and Communications (SustainCom) (BDCloud.SocialCom-SustainCom), pages 477�484. +Thang Luong, Hieu Pham, and Christopher D. Man.ning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412�1421. Associa.tion for Computational Linguistics. +Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word rep.resentations. In NAACL. +Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. +Jasper Snoek, Hugo Larochelle, and Ryan P Adams. 2012. Practical bayesian optimization of machine learning algorithms. In Advances in neural informa.tion processing systems, pages 2951�2959. +David R. So, Chen Liang, and Quoc V. Le. 2019. The evolved transformer. In Proceedings of the 36th International Conference on Machine Learning (ICML). +Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-Informed Self-Attention for Se.mantic Role Labeling. In Conference on Empir.ical Methods in Natural Language Processing (EMNLP), Brussels, Belgium. +Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In 31st Conference on Neural Information Processing Systems (NIPS). +<> <> <> + + +<> <> <> +Finite-Element Neural Networks for Solving Differential Equations +Pradeep Ramuhalli, Member, IEEE, Lalita Udpa, Senior Member, IEEE, and Satish S. Udpa, Fellow, IEEE + +Abstract + +The solution of partial differential equations (PDE) arises in a wide variety of engineering problems. Solutions to most practical problems use numerical analysis techniques such as finite-element or finite-difference methods. The drawbacks of these approaches include computational costs associated with the modeling of complex geometries. This paper proposes a finite-element neural network (FENN) obtained by embedding a finite-element model in a neural network architecture that enables fast and ac.curate solution of the forward problem. Results of applying the FENN to several simple electromagnetic forward and inverse problems are presented. Initial results indicate that the FENN performance as a forward model is comparable to that of the conventional finite-element method (FEM). The FENN can also be used in an iterative approach to solve inverse problems associated with the PDE. Results showing the ability of the FENN to solve the in.verse problem given the measured signal are also presented. The parallel nature of the FENN also makes it an attractive solution for parallel implementation in hardware and software. + +I. INTRODUCTION + +Solutions of differential equations arise in a wide variety of engineering applications in electromagnetics, signal processing, computational fluid dynamics, etc. These equations are typically solved using either analytical or numerical methods. Analytical solution methods are however feasible only for simple geometries, which limits their applicability. In most practical problems with complex boundary conditions, numerical analysis methods are required in order to obtain a reasonable solution. An example is the solution of Maxwell's equations in electromagnetics. Solutions to Maxwell's equations are used in a variety of applications for calculating the interaction of electromagnetic (EM) fields with different types of media. +Very often, the solution to differential equations is necessary for solving the corresponding inverse problems. Inverse problems in general are ill-posed, lacking continuous dependence of the measurements on the input. This has resulted in the development of a variety of solution techniques ranging from simple calibration procedures to other direct (analytical) and iterative approaches [1]. Iterative methods typically employ a forward model that simulates the underlying physical process (Fig. 1) [2]. An initial estimate of the solution of the inverse problem (represented by +in Fig. 1) is applied to the forward model, +Manuscript received January 17, 2004; revised April 2, 2005. + +<
> + +Fig. 1. Iterative inversion method for solving inverse problems. +resulting in the corresponding solution to the forward problem + +<> + +Although finite-element methods (FEMs) [3], [4] are extremely popular for solving differential equations, their major drawback is computational complexity. This problem becomes more acute when three-dimensional (3-D) finite-element models are used in an iterative algorithm for solving the inverse problem. Recently, several authors have suggested the use of neural networks (MLP or RBF networks [5]) for solving differential equations [6][9]. +In these techniques, a neural network is trained using a large database containing the input data and the solution of the differential equation. The neural network during generalization learns the mapping corresponding to the PDE. Alternatively, in [10], the solution to a differential equation is written as a constant term, and an adjustable term with parameters that need to be determined. A neural network is used to determine the optimal values of the parameters. This approach is applicable only to problems with regular boundaries. An extension of the approach to problems with irregular boundaries is given in [11]. Other neural network based differential equation solvers use multilayer perceptron networks or variations on the MLP to approximate the unknown function in a PDE [12][14]. A combination of the PDE and boundary conditions is used to construct an objective function that is minimized during the training process. +A major limitation of these approaches is that the network architecture is selected somewhat arbitrarily. A second drawback is that the performance of the neural networks depends on the data used in training and testing. As long the test data is similar to the training data, the network can interpolate between the training data points to obtain a reasonable prediction. However, when the test signal is no longer similar to the training data, the +network is forced to extrapolate and the performance degrades. One way around this difficulty is to ensure that the training data.base has a diverse set of signals. However, this is difficult to ensure in practice. Alternatively, we have to design neural net.works that are capable of extrapolation. Extrapolation methods are discussed extensively in literature [15][18], but the design of an extrapolation neural network involves several issues particularly for ensuring that the error in the network prediction stays within reasonable bounds during the extrapolation procedure. +An ideal solution to this problem would be to combine the power of numerical models with the computational speed of neural networks, i.e., to embed a numerical model in a neural network structure. One such finite-element neural network (FENN) formulation has been reported by Takeuchi and Kosugi [19]. This approach, based on error minimization, derives the neural network using the energy functional resulting from the finite-element formulation. Other reports of FENN combinations are either similar to the Takeuchi method [20], [21] or use Hopfield neural networks to solve the forward problem [22], [23]. Kalkkuhl et al. [24] provide a description of a FEM-based approach to NARX modeling that may be interpreted both as a local model network, as well as a single layer feedforward network. A slightly different approach to merging numerical methods and neural networks is given in [25], where the finite-difference time domain (FDTD) method is cast in a neural network framework for the purpose of solving electromagnetic forward problems. The related problem of mesh generation in finite-element models has also been tackled using neural networks (for instance, [26]). Generally, these networks are designed to solve the forward problem, and must be modified to solve inverse problems. +This paper proposes a new approach that embeds a finite-element model commonly used in the solution of differential equations in a neural network. The network, called the FENN, can solve the forward problem and can also be used in an iterative algorithm to solve inverse problems. The primary advantage of this approach is that the FEM is represented in a parallel form. Thus, it has the potential to alleviate the computational cost associated with using the FEM in an iterative algorithm for solving inverse problems. More importantly, the FENN does not need any training, and the computation of the weights is a one-time process. The proposed approach is also different in that the neural network architecture developed can be used to solve the forward and inverse problems. The structure of the neural network is also simpler than those reported in the literature, making it easier to implement in parallel in both hardware and software. +The rest of this paper is organized as follows. Section II briefly describes the FEM, and derives the proposed FENN. In this paper, we focus on the problem of solving typical equations encountered in electromagnetic nondestructive evaluation (NDE). However, the same concepts can be easily applied to solve differential equations encountered in other fields. Sections III, IV and V present the application of the FENN to solving forward and inverse problems, along with initial results. A discussion of the advantages and disadvantages of the proposed FENN architecture is given in Section IV. Finally, Section V draws conclusions from the results and presents ideas for future work. + +II. THE FENN + +This section briefly describes the FEM and proposes its reformulation into a parallel neural network structure. Details about the FEM can be found in [3] and [4]. + +A. The FEM + +Consider a typical boundary value problem with the governing differential equation + +<> (1) + +where <> is a differential operator, <> is the applied source or forcing function, and +is the unknown quantity. This differential equation can be solved in conjunction with boundary conditions on the boundary +enclosing the domain +The variational formulation used in finite-element analysis determines the unknown + +by minimizing the functional [3], [4] (2) with respect to the trial function + +The minimization procedure starts by dividing into small subdomains called elements (Fig. 2) and representing in each element by means of basis functions defined over the element (3) where +is the unknown solution in element + +<> (3) + +is the basis function associated with node in element , is the value of the unknown quantity at node and is the total number of nodes associated with element <> In general, the basis functions (also referred to as interpolation functions or shape functions) can be linear, quadratic, or of higher order. Typically, finite-element models use either linear or polynomial spline basis functions. +The functional within an element is expressed as + +<> (4) + +By substituting (3) in (4), we obtain the discrete version of the functional within each element + +<> (5) +where is the transpose of a matrix, mental matrix with elements is the ele. + +<> (6) + +and is an vector with elements + +<> (7) + +Combining the values in (5) for each of the elements (8) where is the global matrix derived from the terms of the elemental matrices for different elements, and +is the total number of nodes, also called the stiffness matrix, is a sparse, banded matrix. Equation (8) is the discrete version of the functional and can be minimized with respect to the nodal parameters +by taking the derivative of with respect to <> and setting it equal to zero, which results in the matrix equation + +<> (9) + +Boundary conditions for these problems are usually of two types: natural boundary conditions and essential boundary conditions. Essential boundary conditions (also referred to as Dirichlet boundary conditions) impose constraints on the value of the unknown +at several nodes. Natural boundary conditions (of which Neumann boundary conditions are a special case) impose constraints on the change in +across a boundary. Dirichlet boundary conditions are imposed on the functional minimization (9), by deleting the rows and columns of the matrix corresponding to the nodes on the Dirichlet boundary and modifying +in (9). + + +Natural boundary conditions are applied in the FEM by adding an additional term to the functional. These boundary conditions are then incorporated into the functional and are satisfied automatically during the solution procedure. As an example, consider the natural boundary condition represented by the following equation [3] on + +<> (10) + +where <> represents the Neumann boundary, is its outward normal unit vector, is some constant, and , <>, and are known parameters associated with the boundary. Assuming that the boundary +is made up of segments, we can define boundary matrices and with elements + +<> (11) + +where <>are basis functions defined over segment and is the length of the segment. The elements of <> are added to the elements of that correspond to the nodes on the boundary. Similarly, the elements of <> are added to the corresponding elements of +<> The global matrix (9) is thus modified as follows before solving for + + +<> (12) + + + <
> + +Fig. 3. FEM domain discretization using two elements and four nodes. + +This process ensures that natural boundary conditions are implicitly and automatically satisfied during the FEM solution procedure. + +B. The FENN + +This section describes how the finite-element model can be converted into a parallel network form. We focus on solving typical inverse problems arising in electromagnetic NDE, but the basic idea is applicable to other areas as well. NDE inverse problems can be formulated as the problem of finding the material properties (such as the conductivity or the permeability) within the domain of the problem. Since the domain is discretized in the FEM method by a large number of elements, the problem can be posed as one of finding the material properties in each of these elements. These properties are usually embedded in the differential operator <> or equivalently, in the global matrix +<> Thus, in order to be able to iteratively estimate these properties from the measurements, the material properties need to be separated out from +<> This separation is easier to achieve at the element matrix level. For nodes <> and in element + +> (13) + +where <> is the parameter representing the material property in element <> and <> represents the differential operator at the + +<
> + +Fig. 4. FENN. + +element level without embedded in it. Substituting (13) into the functional, we get + +<> (14) +If we define + +<> (15) + +where + +<> (16) + +<> (17) + + +Equation (17) expresses the functional explicitly in terms of <> The assumption that is constant within each element is implicit in this expression. This assumption is usually satisfied in problems in NDE where each element in the FEM mesh is defined within the confines of a domain, and at no time does a single element cross domain boundaries. Furthermore, each element is small enough that minor variations in +within an element may be ignored. Equation (17) can be easily converted into a parallel network form. The neural network comprises an input, output and hidden layer. In the general case with +<> elements and <> nodes in the FEM mesh, the input layer with network inputs takes the values in each element as input. The hidden layer has +neurons arranged in groups of neurons, corresponding to the members of the global <> matrix + +. The output of each group of hidden layer neurons is the corresponding row vector of +. The weights from the input to the hidden layer are set to the appropriate values of +. Each neuron in the hidden layer acts as a summation unit, (equivalent to a summation followed by a linear activation function [5]). The outputs of the hidden layer neurons are the elements of the global matrix + +as given in (15). Each group of hidden neurons is connected to one output neuron (giving a total of output neurons) by a set of weights with each element of +representing the nodal values. Note that the set of weights +between the first group of hidden neurons and the first output neuron are the same as the set of weights between the second group of hidden neurons and the second output neuron (as well as between successive groups of hidden neurons and the corresponding output neuron). Each output neuron is also a summation unit followed by a linear activation function, and the output of each neuron is equal to + +<> (18) + +where the second part of (18) is obtained by using (15). As an example, the FENN architecture for a two-element, four-node FEM mesh (Fig. 3) is shown in Fig. 4. In this case, the FENN has two input neurons, 16 hidden layer neurons and four output neurons. The �gure illustrates the grouping of the hidden layer neurons, as well as the similarity inherent in the weights that connect each group of hidden layer neurons to the corresponding output neuron. To simplify the �gure, the weights between the network input and hidden layer neurons are depicted by means of vectors +(for , 2, 3, 4 and , 2), where the individual weight values <> are defined as in (16). +1) Boundary Conditions in the FENN: Note that the elements of <> and in (11) do not depend on the material properties <> and need to be added appropriately to the global matrix +and the source vector as shown in (12). + +<
> + +Fig. 5. Geometry of mesh for 1-D FEM. + +<
> + +Fig. 6. Flowchart (with example) for designing the FENN for a general PDE. + +Equation (12) thus implies that natural boundary conditions can be ap-layer neurons. These weights will be referred to as the clamped plied in the FENN as bias inputs to the hidden layer neurons weights, while the remaining weights will be referred to as the that are a part of the boundary, and the corresponding output free weights. An example of these weights is presented later. neurons. Dirichlet boundary conditions are applied by clamping The FENN architecture was derived without consideration of the corresponding weights between the hidden layer and output the dimensionality of the problem at hand, and thus can be used for 1-, 2-, 3-, or higher dimensional problems. The number of nodes and elements in the FEM mesh dictates the number of neurons in the different layers. The weights between the input and hidden layer change depending on node-element connectivity information. +The major drawback of the FENN is the number of neurons and weights necessary. However, the memory requirements can be reduced considerably, since most of the weights between the input and hidden layer are zero. These weights, and the corresponding connections, can be discarded. Similarly, most of the elements of the +matrix are also zero (is a banded matrix). The corresponding neurons in the hidden layer can also be discarded, reducing memory and computation requirements considerably. Furthermore, the weights between each group of hidden layer neurons and the output layer are the same +. Weight-sharing approaches can be used here to further reduce the storage requirements. + +C. A 1-D Example + +Consider the 1-D equation + +<> (19) + +on the boundary <> defined by <> and +are constants depending on the material and +is the applied source. Laplace's equation and Poisson's equation are special cases of this equation. The FENN formulation for this problem starts by discretizing the domain of interest with <> elements and +nodes. In one dimension, each element is defined by two nodes (Fig. 5). define basis functions <> and <> over each element <> and let +is the value of <> on node <> in element <> An example of the basis functions is shown in Fig. 5. For these basis functions, i.e., + +<> (20) + +the element matrices are given by [3] + +<> (21) + +<> (22) + +Here, <> is the length of element <> The global matrix +is then constructed by selectively adding the element matrices based on the nodes that form an element. Specifically, +is a sparse tridiagonal matrix, and its nonzero elements are given by + +<> (23) + +Fig. 7. Shielded microstrip geometry. (a) Complete problem description. (b) Problem description using symmetry considerations. +The network implementation of (23) can be derived as fol.lows. If <> and <> values at each element are the inputs to the network, +<> and <> form the weights between the input and hidden layers. The network thus uses input neurons and +hidden neurons. The values of <> at each of the nodes are assigned as weights between the hidden and output layers, and the source +is the desired output of this network (corresponding to the output neurons). Dirichlet boundary conditions on +are applied as explained earlier. + +D. General Case + +Fig. 6 shows a flowchart of the general scheme for converting a differential equation into the FENN structure. An example in two dimensions is also provided next to the flowchart. We start with the differential equation and the boundary conditions and formulate the FEM using the variational method. This in.volves discretizing the domain of interest with +elements and + +nodes, selecting basis functions, writing the functional for each element and obtaining the element matrices and the source vector. The example presented uses the FEM mesh shown in Fig. 3, with +elements, and <> nodes, and linear basis functions. The unknown solution to the differential equation +is represented by its values at each of the nodes in the finite-element mesh <> The element matrices +are then separated into two parts, with one part dependent on the material properties <> and +while the other is independent of them. The FENN is then designed to have input neurons, hidden neurons, and output neurons, where <> is the number of material property parameters. In the example under consideration, <>, since we have two +material property parameters ( and ). The first group of input neurons takes in the values while the second group takes in the +values in each element. The weights from the input to the hidden layer are set to the appropriate values of +<> In the example, since nodes 1, 2, and 3 are part of element 1 (see Fig. 3), the weights from the first input node +to the first group of four neurons in the hidden layer are given by + +<> (24) + +The last weight is zero since node 4 is not a part of element 1. Each group of hidden neurons is connected to one output neuron (giving a total of +output neurons) by a set of weights <> with each element of representing the nodal values. The output of each neuron in the output layer is equal to + +<
> + +Fig. 8. Forward problem solutions for shielded microstrip problem show the contours of constant potential for: (a) FEM solution and (b) FENN solution. (c) Error between (a) and (b). The x-and y-axes show the nodes in the FEM discretization of the domain, and the z-axis in (c) shows the error at each of these nodes in volts. + +III. FORWARD AND INVERSE PROBLEM FORMULATION USING FENN. + +where is the output of the FENN based approach, then, for the gradients of the error with respect to the free hidden layer weights is given by the FENN architecture and algorithm lends itself to solving + +<> (27) + +both the forward and inverse problems. The forward problem involves determining the weights +given the material parameters Equation (27) can be used to solve the forward problem. + +Similarly, the applied source to solve the inverse problem, +while the inverse problem the gradients of the error involves determining and (input of the FENN) are necessary, and approach can be used to solve both these problems. Suppose we are given by define the error at the output of the FENN as + + <
> + +TABLE I SUMMARY OF PERFORMANCE OF THE FENN ALGORITHM FOR VARIOUS PDES + +For the forward problem, such an approach is equivalent to the iterative approaches used to solve for the unknown nodal values in the FEM [4]. + +IV. RESULTS + +A. Forward Model Results +The FENN was tested using both 1-and 2-D versions of Poisson�s equation +<> (30) +where represents the material property, and is the applied source. For instance, in electromagnetics may represent the permittivity while represents the charge density. +As the first example, consider the following 2-D equation +<> (31) +with boundary conditions and <> on <> (32) +on <> (33) + +This is the governing equation for the shielded microstrip trans.mission line problem shown in Fig. 7. The forward problem computes the electric potential due to the shielded microstrip shown in Fig. 7(a). The potentials are zero on the shielding con.ductor. Since the geometry is symmetric, we can solve the equiv.alent problem shown in Fig. 7(b), by applying the homogeneous Neumann condition on the plane of symmetry. The inner con.ductor (microstrip) is held at a constant potential of volts. Finally, we also assume that the material inside the shielding conductor has a permittivity , where K is a constant. The permittivity in this case corresponds to the material property . Specifically, and . The homogeneous Neu.mann boundary condition is equivalent to setting . The microstrip and the shielding conductor correspond to the Dirichlet boundary, with <> on the microstrip and +on the outer boundary [Fig. 7(b)]. Finally, there is no source term in this example (the source term would correspond to a charge distribution in the domain of interest), i.e., <> In this ex.ample, we assume that volts. Further, we assume that the domain of interest is + +The solution to the forward problem is presented in Fig. 8, with the FEM solution using 11 nodes in each direction shown in Fig. 8(a) and the corresponding FENN solution in Fig. 8(b). These �gures show contours of constant potential. The error be.tween the FEM and FENN solutions is presented in Fig. 8(c). As seen from the �gure, the FENN is seen to match the FEM solu.tion accurately, with the peak error at any node on the order of +Several other examples were also used to test the FENN and the results are summarized in Table I. Column 1 shows the PDE used to evaluate the FENN performance, while column 2 shows the boundary conditions used. The analytic solution to the problem is indicated in Column 3. The FENN structure and the number of iterations for convergence using a gradient de.scent approach are indicated in Columns 4 and 5, respectively. The FENN structure, as explained earlier, has an +are the number of elements and nodes in the FEM mesh, respectively, and +is the number of hidden neurons, and corresponds to the number of nonzero elements in the FEM global matrix +Finally, Columns 6 and 7 present the sum-squared error (SSE) and the maximum error in the solution, respectively, where the errors are computed with respect to the analytical solution. These results indicate that the FENN is capable of accurately deter.mining the potential +One advantage of the FENN approach is that the computation of the input-hidden layer weights is a one-time process, as long as the differential equation does not change. The only changes necessary to solve the different problems are changes in the input +and the desired output. + +B. Inverse Model Results + +The FENN was also used to solve several simple inverse problems based on (30). In all cases, the objective was to determine + +<
> + +Fig. 9. FENN inversion results for Poisson's equation with initial solutions (a) +the value of <> and <> for given values of <> and +The <> first example is a 1-D problem that involves determining +given and <> +for the differential equation +<> (34) + +with boundary conditions <> and <>. The analytical solution to this inverse problem is +<> and + +<> (35) + +As seen from (35), the problem has an infinite number of solutions and we expect the solution procedure to converge to one of these solutions depending on the initial value. + +Fig. 9(a) and (b) shows two solutions to this inverse problem for two different initializations (shown using triangles). In both cases, the FENN solution (in stars) is seen to match the analytical solution (squares). The SSE in both cases was on the order of + +<> + +In order to obtain a unique solution, we need to constrain the value of at the boundary as well. Consider the same differen. +tial equation as (34), but with and specified as follows: +and +(36) + + +The analytical solution for this equation is .To solve this problem, we set and clamp the value of at and as follows: , . The results of the constrained inversion obtained using 11 nodes and 10 elements in the corresponding finite-element mesh are shown in Fig. 10. Fig. 10(a) shows the comparison between the analytical solution (solid line with squares) and the FENN result (solid line with stars). The initial value of is shown in the figure as a dashed line. Fig. 10(b) shows the comparison between the actual and desired forcing function at the FENN + + +output. This result indicates that the SSE in the forcing function, as well as the SSE in the inversion result, is fairly large (0.0148 and 0.0197, respectively). The reason for this was traced back to the mesh discretization. Fig. 11 shows the SSE in the output of the FENN and the SSE in the inverse problem solution as a function of FEM discretization. It is seen that increasing the discretization significantly improves the solution. Similar results were observed for other problems. + +V. DISCUSSION AND CONCLUSION + +The FENN is closely related to the finite-element model used to solve differential equations. The FENN architecture has a weight structure that allows both the forward and inverse problems to be solved using simple gradient-based algorithms. Initial results indicate that the proposed FENN algorithm is capable of accurately solving both the forward and inverse problems. In addition, the forward problem solution from the FENN is seen to exactly match the FEM solution, indicating that the FENN represents the finite-element model exactly in a parallel configuration. +The major advantage of the FENN is that it represents the finite-element model in a parallel form, enabling parallel implementation in either hardware or software. Further, computing gradients in the FENN is very simple. This is an advantage in solving both forward and inverse problems using gradient-based methods. The gradients can also be computed in parallel and the lack of nonlinearities in the neuron activation functions makes the computation of gradients simpler. A major advantage of this approach for solving inverse problems is that it avoids inverting the global matrix in each iteration. The FENN also does not require any training, since most of its weights can be computed in advance and stored. The weights depend on the governing differential equation and its associated boundary conditions, and as long as these two factors do not change, the weights do not change. This is especially an advantage in solving inverse problems in electromagnetic NDE. This approach also reduces the computational effort associated with the network. + +Future work will concentrate on applying the FENN to 3-D electromagnetic NDE problems. The robustness of the approach will also be tested, since the ability of these approaches to in.vert practical noisy measurements is important. Furthermore, the use of better optimization algorithms, like conjugate gradient methods, is expected to improve the solution speed. In addition, parallel implementation of the FENN in both hardware and software is under investigation. The approach described in this paper is very general in that it can be applied to a variety of inverse problems in fields other than electromagnetic NDE. Some of these other applications will also be investigated to show the general nature of the proposed method. + +REFERENCES + +[1] L. Udpa and S. S. Udpa, �Application of signal processing and pattern recognition techniques to inverse problems in NDE,� Int. J. Appl. Elec.tromagn. Mechan., vol. 8, pp. 99�117, 1997. +[2] M. Yan, M. Afzal, S. Udpa, S. Mandayam, Y. Sun, L. Udpa, and P. Sacks, �Iterative algorithms for electromagnetic NDE signal inversion,� in ENDE �97, Reggio Calabria, Italy, Sep. 14�16, 1997. +[3] J. Jin, The Finite Element Method in Electromagnetics. New York: Wiley, 1993. +[4] P. Zhou, Numerical Analysis of Electromagnetic Fields. Berlin, Ger.many: Springer-Verlag, 1993. +[5] S. Haykin, Neural Networks: A Comprehensive Foundation. Upper Saddle River, NJ: Prentice-Hall, 1994. +[6] C. A. Jensen et al., �Inversion of feedforward neural networks: algo.rithms and applications,� Proc. IEEE, vol. 87, no. 9, pp. 1536�1549, 1999. +[7] P. Ramuhalli, L. Udpa, and S. Udpa, �Neural network algorithm for elec.tromagnetic NDE signal inversion,� in ENDE 2000, Budapest, Hungary, Jun. 2000. +[8] C. H. Barbosa, A. C. Bruno, M. Vellasco, M. Pacheco, J. P. Wikswo Jr., and A. P. Ewing, �Automation of SQUID nondestructive evaluation of steel plates by neural networks,� IEEE Trans. Appl. Supercond., vol. 9, no. 2, pp. 3475�3478, 1999. +[9] W. Qing, S. Xueqin, Y. Qingxin, and Y. Weili, �Using wavelet neural net.works for the optimal design of electromagnetic devices,� IEEE Trans. Magn., vol. 33, no. 2, pp. 1928�1930, 1997. +[10] I. E. Lagaris, A. C. Likas, and D. I. Fotiadis, �Arti�cial neural networks for solving ordinary and partial differential equations,� IEEE Trans. Neural Netw., vol. 9, no. 5, pp. 987�1000, 1998. +[11] I. E. Lagaris, A. C. Likas, and D. G. Papageorgiou, �Neural-network methods for boundary value problems with irregular boundaries,� IEEE Trans. Neural Netw., vol. 11, no. 5, pp. 1041�1049, 2000. +[12] B. P. Van Milligen, V. Tribaldos, and J. A. Jimenez, �Neural network differential equation and plasma equilibrium solver,� Phys. Rev. Lett., vol. 75, no. 20, pp. 3594�3597, 1995. +[13] M. W. M. G. Dissanayake and N. Phan-Thien, �Neural-network-based approximations for solving partial differential equations,� Commun. Numer. Meth. Eng., vol. 10, pp. 195�201, 1994. +[14] R. Masuoka, �Neural networks learning differential data,� IEICE Trans. Inform. Syst., vol. E83-D, no. 6, pp. 1291�1300, 2000. +[15] D. C. Youla, �Generalized image restoration by the method of alternating orthogonal projections,� IEEE Trans. Circuits Syst., vol. CAS-25, no. 9, pp. 694�702, 1978. +[16] D. C. Youla and H. Webb, �Image restoration by the method of convex projections: part I�theory,� IEEE Trans. Med. Imag., vol. MI-1, no. 2, pp. 81�94, 1982. +[17] A. Lent and H. Tuy, �An iterative method for the extrapolation of band-limited functions,� J. Math. Analysis and Applicat., vol. 83, pp. 554�565, 1981. +[18] W. Chen, �A new extrapolation algorithm for band-limited signals using the regularization method,� IEEE Trans. Signal Process., vol. 41, no. 3, pp. 1048�1060, 1993. +[19] J. Takeuchi and Y. Kosugi, �Neural network representation of the finite element method,� Neural Netw., vol. 7, no. 2, pp. 389�395, 1994. +[20] R. Sikora, J. Sikora, E. Cardelli, and T. Chady, �Arti�cial neural net.work application for material evaluation by electromagnetic methods,� in Proc. Int. Joint Conf. Neural Networks, vol. 6, 1999, pp. 4027�4032. +[21] G. Xu, G. Littlefair, R. Penson, and R. Callan, �Application of FE-based neural networks to dynamic problems,� in Proc. Int. Conf. Neural Infor.mation Processing, vol. 3, 1999, pp. 1039�1044. +[22] F. Guo, P. Zhang, F. Wang, X. Ma, and G. Qiu, �Finite element anal.ysis-based Hop�eld neural network model for solving nonlinear elec.tromagnetic �eld problems,� in Proc. Int. Joint Conf. Neural Networks, vol. 6, 1999, pp. 4399�4403. +[23] H. Lee and I. S. Kang, �Neural algorithm for solving differential equations,� J. Computat. Phys., vol. 91, pp. 110�131, 1990. +[24] J. Kalkkuhl, K. J. Hunt, and H. Fritz, �FEM-based neural-network approach to nonlinear modeling with application to longitudinal vehicle dynamics control,� IEEE Trans. Neural Netw., vol. 10, no. 4, pp. 885�897, 1999. +[25] R. K. Mishra and P. S. Hall, �NFDTD concept,� IEEE Trans. Neural Netw., vol. 16, no. 2, pp. 484�490, 2005. +[26] D. G. Triantafyllidis and D. P. Labridis, �A finite-element mesh gener.ator based on growing neural networks,� IEEE Trans. Neural Netw., vol. 13, no. 6, pp. 1482�1496, 2002. <> <> <> \ No newline at end of file diff --git a/Corpus/Efficient Processing of Deep Neural Networks- A Tutorial and Survey.txt b/Corpus/Efficient Processing of Deep Neural Networks- A Tutorial and Survey.txt deleted file mode 100644 index bf1fb21..0000000 Binary files a/Corpus/Efficient Processing of Deep Neural Networks- A Tutorial and Survey.txt and /dev/null differ diff --git a/Corpus/EfficientNet Rethinking Model Scaling for Convolutional Neural Networks.txt b/Corpus/EfficientNet Rethinking Model Scaling for Convolutional Neural Networks.txt deleted file mode 100644 index 64f926a..0000000 Binary files a/Corpus/EfficientNet Rethinking Model Scaling for Convolutional Neural Networks.txt and /dev/null differ diff --git a/Corpus/Energy and Policy Considerations for Deep Learning in NLP - Emma Strubell.txt b/Corpus/Energy and Policy Considerations for Deep Learning in NLP - Emma Strubell.txt deleted file mode 100644 index 2c16ab6..0000000 --- a/Corpus/Energy and Policy Considerations for Deep Learning in NLP - Emma Strubell.txt +++ /dev/null @@ -1,261 +0,0 @@ - Energy and Policy Considerations for Deep Learning in NLP - - - Emma Strubell Ananya Ganesh Andrew McCallum - College of Information and Computer Sciences - University of Massachusetts Amherst - {strubell, aganesh, mccallum}@cs.umass.edu - - - - - - Abstract Consumption CO 2 e (lbs) - Air travel, 1 passenger, NY↔SF 1984 Recent progress in hardware and methodol- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - arXiv:1906.02243v1 [cs.CL] 5 Jun 2019 Human life, avg, 1 year 11,023 ogy for training neural networks has ushered - in a new generation of large networks trained American life, avg, 1 year 36,156 - on abundant data. These models have ob- Car, avg incl. fuel, 1 lifetime 126,000 - tained notable gains in accuracy across many - NLP tasks. However, these accuracy improve- Training one model (GPU) - ments depend on the availability of exception- NLP pipeline (parsing, SRL) 39 ally large computational resources that neces- w/ tuning & experimentation 78,468 sitate similarly substantial energy consump- Transformer (big) 192 tion. As a result these models are costly to - train and develop, both financially, due to the w/ neural architecture search 626,155 - cost of hardware and electricity or cloud com- Table 1: Estimated COpute time, and environmentally,due to the car- 2 emissions from training com- - mon NLP models, compared to familiar consumption. 1 bon footprint required to fuel modern tensor - processing hardware. In this paper we bring - this issue to the attention of NLP researchers NLP models could be trained and developed on by quantifying the approximate financial and a commodity laptop or server, many now require environmental costs of training a variety of re- - cently successful neural network models for multiple instances of specialized hardware such as - NLP. Based on these findings, we propose ac- GPUs or TPUs, therefore limiting access to these - tionable recommendations to reduce costs and highly accurate models on the basis of finances. - improve equity in NLP research and practice. Even when these expensive computational re- - 1 Introduction sources are available, model training also incurs a - substantial cost to the environment due to the en- - Advances in techniques and hardware for train- ergy required to power this hardware for weeks or - ing deep neural networks have recently en- months at a time. Though some of this energy may - abled impressive accuracy improvements across come from renewable or carbon credit-offset re- - many fundamental NLP tasks ( Bahdanau et al., sources, the high energy demands of these models - 2015; Luong et al., 2015; Dozat and Man- are still a concern since (1) energy is not currently - ning, 2017; Vaswani et al., 2017), with the derived from carbon-neural sources in many loca- - most computationally-hungry models obtaining tions, and (2) when renewable energy is available, - the highest scores (Peters et al.,2018;Devlin et al., it is still limited to the equipment we have to pro- - 2019;Radford et al.,2019;So et al.,2019). As duce and store it, and energy spent training a neu- - a result, training a state-of-the-art model now re- ral network might better be allocated to heating a - quires substantial computational resources which family’s home. It is estimated that we must cut - demand considerable energy, along with the as- carbon emissions by half over the next decade to - sociated financial and environmental costs. Re- deter escalating rates of natural disaster, and based - search and development of new models multiplies on the estimated CO 2 emissions listed in Table 1, - these costs by thousands of times by requiring re- - training to experiment with model architectures 1 Sources: (1) Air travel and per-capita consump- - tion: https://bit.ly/2Hw0xWc; (2) car lifetime: and hyperparameters. Whereas a decade ago most https://bit.ly/2Qbr0w1. model training and development likely make up Consumer Renew. Gas Coal Nuc. - a substantial portion of the greenhouse gas emis- China 22% 3% 65% 4% - sions attributed to many NLP researchers. Germany 40% 7% 38% 13% - To heighten the awareness of the NLP commu- United States 17% 35% 27% 19% - nity to this issue and promote mindful practice and Amazon-AWS 17% 24% 30% 26% - policy, we characterize the dollar cost and carbon Google 56% 14% 15% 10% - emissions that result from training the neural net- Microsoft 32% 23% 31% 10% - works at the core of many state-of-the-art NLP - models. We do this by estimating the kilowatts Table 2: Percent energy sourced from: Renewable (e.g. - of energy required to train a variety of popular hydro, solar, wind), natural gas, coal and nuclear for - off-the-shelf NLP models, which can be converted the top 3 cloud compute providers (Cook et al.,2017), - to approximate carbon emissions and electricity compared to the United States, 4 China 5 and Germany - costs. To estimate the even greater resources re- (Burger,2019). - quired to transfer an existing model to a new task - or develop new models, we perform a case study We estimate the total time expected for mod- - of the full computational resources required for the els to train to completion using training times and - development and tuning of a recent state-of-the-art hardware reported in the original papers. We then - NLP pipeline (Strubell et al.,2018). We conclude calculate the power consumption in kilowatt-hours - with recommendations to the community based on (kWh) as follows. Letpc be the average power - our findings, namely: (1) Time to retrain and sen- draw (in watts) from all CPU sockets during train- - sitivity to hyperparameters should be reported for ing, letpr be the average power draw from all - NLP machine learning models; (2) academic re- DRAM (main memory) sockets, letpg be the aver- - searchers need equitable access to computational age power draw of a GPU during training, and let - resources; and (3) researchers should prioritize de- gbe the number of GPUs used to train. We esti- - veloping efficient models and hardware. mate total power consumption as combined GPU, - CPU and DRAM consumption, then multiply this - 2 Methods by Power Usage Effectiveness (PUE), which ac- - counts for the additional energy required to sup-To quantify the computational and environmen- port the compute infrastructure (mainly cooling).tal cost of training deep neural network mod- We use a PUE coefficient of 1.58, the 2018 globalels for NLP, we perform an analysis of the en- average for data centers (Ascierto,2018). Then theergy required to train a variety of popular off- total powerpthe-shelf NLP models, as well as a case study of t required at a given instance during - training is given by:the complete sum of resources required to develop - LISA (Strubell et al.,2018), a state-of-the-art NLP 1.58t(pp c +pr +gp g ) - model from EMNLP 2018, including all tuning t = (1)1000 - and experimentation. The U.S. Environmental Protection Agency (EPA)We measure energy use as follows. We train the provides average COmodels described in§2.1using the default settings 2 produced (in pounds per - kilowatt-hour) for power consumed in the U.S.provided, and sample GPU and CPU power con- (EPA,2018), which we use to convert power tosumption during training. Each model was trained estimated COfor a maximum of 1 day. We train all models on 2 emissions: - - a single NVIDIA Titan X GPU, with the excep- CO 2 e = 0.954pt (2) - tion of ELMo which was trained on 3 NVIDIA This conversion takes into account the relative pro-GTX 1080 Ti GPUs. While training, we repeat- portions of different energy sources (primarily nat-edly query the NVIDIA System Management In- ural gas, coal, nuclear and renewable) consumedterface 2 to sample the GPU power consumption to produce energy in the United States. Table2and report the average over all samples. To sample lists the relative energy sources for China, Ger-CPU power consumption, we use Intel’s Running many and the United States compared to the topAverage Power Limit interface. 3 - 5 U.S. Dept. of Energy:https://bit.ly/2JTbGnI - 2 nvidia-smi:https://bit.ly/30sGEbi 5 China Electricity Council; trans. China Energy Portal: - 3 RAPL power meter:https://bit.ly/2LObQhV https://bit.ly/2QHE5O3 three cloud service providers. The U.S. break- ence. Devlin et al.(2019) report that the BERT - down of energy is comparable to that of the most base model (110M parameters) was trained on 16 - popular cloud compute service, Amazon Web Ser- TPU chips for 4 days (96 hours). NVIDIA reports - vices, so we believe this conversion to provide a that they can train a BERT model in 3.3 days (79.2 - reasonable estimate of CO 2 emissions per kilowatt hours) using 4 DGX-2H servers, totaling 64 Tesla - hour of compute energy used. V100 GPUs (Forster et al.,2019). - GPT-2. This model is the latest edition of - 2.1 Models OpenAI’s GPT general-purpose token encoder, - We analyze four models, the computational re- also based on Transformer-style self-attention and - quirements of which we describe below. All mod- trained with a language modeling objective (Rad- - els have code freely available online, which we ford et al.,2019). By training a very large model - used out-of-the-box. For more details on the mod- on massive data,Radford et al.(2019) show high - els themselves, please refer to the original papers. zero-shot performance on question answering and - language modeling benchmarks. The large modelTransformer. The Transformer model (Vaswani described inRadford et al.(2019) has 1542M pa-et al.,2017) is an encoder-decoder architecture rameters and is reported to require 1 week (168primarily recognized for efficient and accurate ma- hours) of training on 32 TPUv3 chips. 6 chine translation. The encoder and decoder each - consist of 6 stacked layers of multi-head self- - attention. Vaswani et al.(2017) report that the 3 Related work - Transformerbasemodel (65M parameters) was - trained on 8 NVIDIA P100 GPUs for 12 hours, There is some precedent for work characterizing - and the Transformerbigmodel (213M parame- the computational requirements of training and in- - ters) was trained for 3.5 days (84 hours; 300k ference in modern neural network architectures in - steps). This model is also the basis for recent the computer vision community.Li et al.(2016) - work on neural architecture search (NAS) for ma- present a detailed study of the energy use required - chine translation and language modeling (So et al., for training and inference in popular convolutional - 2019), and the NLP pipeline that we study in more models for image classification in computer vi- - detail in§4.2(Strubell et al.,2018). So et al. sion, including fine-grained analysis comparing - (2019) report that their full architecture search ran different neural network layer types. Canziani - for a total of 979M training steps, and that their et al.(2016) assess image classification model ac- - base model requires 10 hours to train for 300k curacy as a function of model size and gigaflops - steps on one TPUv2 core. This equates to 32,623 required during inference. They also measure av- - hours of TPU or 274,120 hours on 8 P100 GPUs. erage power draw required during inference on - GPUs as a function of batch size. Neither work an-ELMo. The ELMo model (Peters et al.,2018) alyzes the recurrent and self-attention models thatis based on stacked LSTMs and provides rich have become commonplace in NLP, nor do theyword representations in context by pre-training on extrapolate power to estimates of carbon and dol-a large amount of data using a language model- lar cost of training.ing objective. Replacing context-independent pre- - trained word embeddings with ELMo has been Analysis of hyperparameter tuning has been - shown to increase performance on downstream performed in the context of improved algorithms - tasks such as named entity recognition, semantic for hyperparameter search (Bergstra et al.,2011; - role labeling, and coreference.Peters et al.(2018) Bergstra and Bengio,2012;Snoek et al.,2012). To - report that ELMo was trained on 3 NVIDIA GTX our knowledge there exists to date no analysis of - 1080 GPUs for 2 weeks (336 hours). the computation required for R&D and hyperpa- - rameter tuning of neural network models in NLP.BERT.The BERT model (Devlin et al.,2019) pro- - vides a Transformer-based architecture for build- - ing contextual representations similar to ELMo, 6 Via the authorson Reddit. - 7 GPU lower bound computed using pre-emptible but trained with a different language modeling ob- P100/V100 U.S. resources priced at $0.43–$0.74/hr, upper - jective. BERT substantially improves accuracy on bound uses on-demand U.S. resources priced at $1.46– - tasks requiring sentence-level representations such $2.48/hr. We similarly use pre-emptible ($1.46/hr–$2.40/hr) - and on-demand ($4.50/hr–$8/hr) pricing as lower and upper as question answering and natural language infer- bounds for TPU v2/3; cheaper bulk contracts are available. Model Hardware Power (W) Hours kWh·PUE CO 2 e Cloud compute cost - Transformer base P100x8 1415.78 12 27 26 $41–$140 - Transformer big P100x8 1515.43 84 201 192 $289–$981 - ELMo P100x3 517.66 336 275 262 $433–$1472 - BERT base V100x64 12,041.51 79 1507 1438 $3751–$12,571 - BERT base TPUv2x16 — 96 — — $2074–$6912 - NAS P100x8 1515.43 274,120 656,347 626,155 $942,973–$3,201,722 - NAS TPUv2x1 — 32,623 — — $44,055–$146,848 - GPT-2 TPUv3x32 — 168 — — $12,902–$43,008 - - Table 3: Estimated cost of training a model in terms of CO 2 emissions (lbs) and cloud compute cost (USD). 7 Power - and carbon footprint are omitted for TPUs due to lack of public information on power draw for this hardware. - - - 4 Experimental results Estimated cost (USD) - Models Hours Cloud compute Electricity4.1 Cost of training 1 120 $52–$175 $5Table3lists CO 2 emissions and estimated cost of 24 2880 $1238–$4205 $118training the models described in§2.1. Of note is 4789 239,942 $103k–$350k $9870that TPUs are more cost-efficient than GPUs on - workloads that make sense for that hardware (e.g. Table 4: Estimated cost in terms of cloud compute and - BERT). We also see that models emit substan- electricity for training: (1) a single model (2) a single - tial carbon emissions; training BERT on GPU is tune and (3) all models trained during R&D. - roughly equivalent to a trans-American flight.So - et al.(2019) report that NAS achieves a new state- about 60 GPUs running constantly throughout theof-the-art BLEU score of 29.7 for English to Ger- 6 month duration of the project. Table4lists upperman machine translation, an increase of just 0.1 and lower bounds of the estimated cost in termsBLEU at the cost of at least $150k in on-demand of Google Cloud compute and raw electricity re-compute time and non-trivial carbon emissions. quired to develop and deploy this model. 9 We see - that while training a single model is relatively in-4.2 Cost of development: Case study expensive, the cost of tuning a model for a newTo quantify the computational requirements of dataset, which we estimate here to require 24 jobs,R&D for a new model we study the logs of or performing the full R&D required to developall training required to develop Linguistically- this model, quickly becomes extremely expensive.Informed Self-Attention (Strubell et al.,2018), a - multi-task model that performs part-of-speech tag- 5 Conclusions - ging, labeled dependency parsing, predicate detec- - tion and semantic role labeling. This model makes Authors should report training time and - for an interesting case study as a representative sensitivity to hyperparameters. - NLP pipeline and as a Best Long Paper at EMNLP. Our experiments suggest that it would be benefi- - Model training associated with the project cial to directly compare different models to per- - spanned a period of 172 days (approx. 6 months). form a cost-benefit (accuracy) analysis. To ad- - During that time 123 small hyperparameter grid dress this, when proposing a model that is meant - searches were performed, resulting in 4789 jobs to be re-trained for downstream use, such as re- - in total. Jobs varied in length ranging from a min- training on a new domain or fine-tuning on a new - imum of 3 minutes, indicating a crash, to a maxi- task, authors should report training time and com- - mum of 9 days, with an average job length of 52 putational resources required, as well as model - hours. All training was done on a combination of sensitivity to hyperparameters. This will enable - NVIDIA Titan X (72%) and M40 (28%) GPUs. 8 direct comparison across models, allowing subse- - The sum GPU time required for the project quent consumers of these models to accurately as- - totaled 9998 days (27 years). This averages to sess whether the required computational resources - 8 We approximate cloud compute cost using P100 pricing. 9 Based on average U.S cost of electricity of $0.12/kWh. are compatible with their setting. More explicit half the estimated cost to use on-demand cloud - characterization of tuning time could also reveal GPUs. Unlike money spent on cloud compute, - inconsistencies in time spent tuning baseline mod- however, that invested in centralized resources - els compared to proposed contributions. Realiz- would continue to pay off as resources are shared - ing this will require: (1) a standard, hardware- across many projects. A government-funded aca- - independent measurement of training time, such demic compute cloud would provide equitable ac- - as gigaflops required to convergence, and (2) a cess to all researchers. - standard measurement of model sensitivity to data - and hyperparameters, such as variance with re- Researchers should prioritize computationally - spect to hyperparameters searched. efficient hardware and algorithms. - We recommend a concerted effort by industry and - Academic researchers need equitable access to academia to promote research of more computa- - computation resources. tionally efficient algorithms, as well as hardware - that requires less energy. An effort can also beRecent advances in available compute come at a made in terms of software. There is already ahigh price not attainable to all who desire access. precedent for NLP software packages prioritizingMost of the models studied in this paper were de- efficient models. An additional avenue throughveloped outside academia; recent improvements in which NLP and machine learning software de-state-of-the-art accuracy are possible thanks to in- velopers could aid in reducing the energy asso-dustry access to large-scale compute. ciated with model tuning is by providing easy-Limiting this style of research to industry labs to-use APIs implementing more efficient alterna-hurts the NLP research community in many ways. tives to brute-force grid search for hyperparameterFirst, it stifles creativity. Researchers with good tuning, e.g. random or Bayesian hyperparameterideas but without access to large-scale compute search techniques (Bergstra et al.,2011;Bergstrawill simply not be able to execute their ideas, and Bengio,2012;Snoek et al.,2012). Whileinstead constrained to focus on different prob- software packages implementing these techniqueslems. Second, it prohibits certain types of re- do exist, 10 they are rarely employed in practicesearch on the basis of access to financial resources. for tuning NLP models. This is likely becauseThis even more deeply promotes the already prob- their interoperability with popular deep learninglematic “rich get richer” cycle of research fund- frameworks such as PyTorch and TensorFlow ising, where groups that are already successful and not optimized, i.e. there are not simple exam-thus well-funded tend to receive more funding ples of how to tune TensorFlow Estimators usingdue to their existing accomplishments. Third, the Bayesian search. Integrating these tools into theprohibitive start-up cost of building in-house re- workflows with which NLP researchers and practi-sources forces resource-poor groups to rely on tioners are already familiar could have notable im-cloud compute services such as AWS, Google pact on the cost of developing and tuning in NLP.Cloud and Microsoft Azure. - While these services provide valuable, flexi- Acknowledgements - ble, and often relatively environmentally friendly We are grateful to Sherief Farouk and the anony- compute resources, it is more cost effective for mous reviewers for helpful feedback on earlieracademic researchers, who often work for non- drafts. This work was supported in part by theprofit educational institutions and whose research Centers for Data Science and Intelligent Infor-is funded by government entities, to pool resources mation Retrieval, the Chan Zuckerberg Initiativeto build shared compute centers at the level of under the Scientific Knowledge Base Construc-funding agencies, such as the U.S. National Sci- tion project, the IBM Cognitive Horizons Networkence Foundation. For example, an off-the-shelf agreement no. W1668553, and National ScienceGPU server containing 8 NVIDIA 1080 Ti GPUs Foundation grant no. IIS-1514053. Any opinions,and supporting hardware can be purchased for findings and conclusions or recommendations ex-approximately $20,000 USD. At that cost, the pressed in this material are those of the authors andhardware required to develop the model in our do not necessarily reflect those of the sponsor.case study (approximately 58 GPUs for 172 days) - would cost $145,000 USD plus electricity, about 10 For example, theHyperopt Python library. References Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt - Gardner, Christopher Clark, Kenton Lee, and LukeRhonda Ascierto. 2018.Uptime Institute Global Data Zettlemoyer. 2018. Deep contextualized word rep-Center Survey. Technical report, Uptime Institute. resentations. InNAACL. - Dzmitry Bahdanau, KyunghyunCho, and Yoshua Ben- - gio. 2015. Neural Machine Translation by Jointly Alec Radford, Jeffrey Wu, Rewon Child, David Luan, - Learning to Align and Translate. In3rd Inter- Dario Amodei, and Ilya Sutskever. 2019.Language - national Conference for Learning Representations models are unsupervised multitask learners. - (ICLR), San Diego, California, USA. Jasper Snoek, Hugo Larochelle, and Ryan P Adams. - James Bergstra and Yoshua Bengio. 2012. Random 2012. Practical bayesian optimization of machine - search for hyper-parameter optimization.Journal of learning algorithms. InAdvances in neural informa- - Machine Learning Research, 13(Feb):281–305. tion processing systems, pages 2951–2959. - - James S Bergstra, R´emi Bardenet, Yoshua Bengio, and David R. So, Chen Liang, and Quoc V. Le. 2019. - Bal´azs K´egl. 2011. Algorithms for hyper-parameter The evolved transformer. InProceedings of the - optimization. InAdvances in neural information 36th InternationalConference on Machine Learning - processing systems, pages 2546–2554. (ICML). - - Bruno Burger. 2019.Net Public Electricity Generation Emma Strubell, Patrick Verga, Daniel Andor, - in Germany in 2018. Technical report, Fraunhofer David Weiss, and Andrew McCallum. 2018. - Institute for Solar Energy Systems ISE. Linguistically-Informed Self-Attention for Se- - mantic Role Labeling. InConference on Empir-Alfredo Canziani, Adam Paszke, and Eugenio Culur- ical Methods in Natural Language Processingciello. 2016. An analysis of deep neural network (EMNLP), Brussels, Belgium. models for practical applications . - Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobGary Cook, Jude Lee, Tamina Tsai, Ada Kongn, John Uszkoreit, Llion Jones, Aidan N Gomez, LukaszDeans, Brian Johnson, Elizabeth Jardim, and Brian Kaiser, and Illia Polosukhin. 2017. Attention is allJohnson. 2017. Clicking Clean: Who is winning you need. In31st Conference on Neural Informationthe race to build a green internet?Technical report, Processing Systems (NIPS).Greenpeace. - Jacob Devlin, Ming-Wei Chang, Kenton Lee, and - Kristina Toutanova. 2019. BERT: Pre-training of - Deep Bidirectional Transformers for Language Un- - derstanding. InNAACL. - Timothy Dozat and Christopher D. Manning. 2017. - Deep biaffine attention for neural dependency pars- - ing. InICLR. - EPA. 2018. Emissions & Generation Resource Inte- - grated Database (eGRID). Technical report, U.S. - Environmental Protection Agency. - Christopher Forster, Thor Johnsen, Swetha Man- - dava, Sharath Turuvekere Sreenivas, Deyu Fu, Julie - Bernauer, Allison Gray, Sharan Chetlur, and Raul - Puri. 2019. BERT Meets GPUs. Technical report, - NVIDIA AI. - Da Li, Xinbo Chen, Michela Becchi, and Ziliang Zong. - 2016. Evaluating the energy efficiency of deep con- - volutional neural networks on cpus and gpus.2016 - IEEE International Conferences on Big Data and - Cloud Computing (BDCloud), Social Computing - and Networking (SocialCom), Sustainable Comput- - ing and Communications (SustainCom) (BDCloud- - SocialCom-SustainCom), pages 477–484. - Thang Luong, Hieu Pham, and Christopher D. Man- - ning. 2015.Effective approaches to attention-based - neural machine translation. InProceedings of the - 2015 Conference on Empirical Methods in Natural - Language Processing, pages 1412–1421. Associa- - tion for Computational Linguistics. \ No newline at end of file diff --git a/Corpus/Finite-Element Neural Networks for Solving Differential Equations.txt b/Corpus/Finite-Element Neural Networks for Solving Differential Equations.txt deleted file mode 100644 index e2f2323..0000000 --- a/Corpus/Finite-Element Neural Networks for Solving Differential Equations.txt +++ /dev/null @@ -1,793 +0,0 @@ - IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005 1381 - Finite-Element Neural Networks for Solving - Differential Equations - Pradeep Ramuhalli, Member, IEEE, Lalita Udpa, Senior Member, IEEE, and Satish S. Udpa, Fellow, IEEE - - Abstract—The solution of partial differential equations (PDE) - arises in a wide variety of engineering problems. Solutions to most - practical problems use numerical analysis techniques such as fi- - nite-element or finite-difference methods. The drawbacks of these - approaches include computational costs associated with the mod- - eling of complex geometries. This paper proposes a finite-element - neural network (FENN) obtained by embedding a finite-element - model in a neural network architecture that enables fast and ac- - curate solution of the forward problem. Results of applying the - FENN to severalsimpleelectromagnetic forward and inverseprob- - lems are presented. Initial results indicate that the FENN perfor- - mance as a forward model is comparable to that of the conven- - tional finite-element method (FEM). The FENN can also be used - in an iterative approach to solve inverse problems associated with Fig. 1. Iterative inversion method for solving inverse problems. the PDE. Results showing the ability of the FENN to solve the in- - verse problem given the measured signal are also presented. The - parallel nature of the FENN also makes it an attractive solution resulting in the corresponding solution to the forward problem - for parallel implementation in hardware and software. . The model output is compared to the measurement , - Index Terms—Finite-element method (FEM), finite-element using a cost function .If is less than a toler- - neural network (FENN), inverse problems. ance, the estimateis used as the desired solution. If not, - is updated to minimize the cost function. - S I. I Although finite-element methods (FEMs) [3], [4] are ex- NTRODUCTION tremely popular for solving differential equations, their majorOLUTIONS of differential equations arise in a widedrawback is computational complexity. This problem becomesvariety of engineering applications in electromagnetics,more acute when three-dimensional (3-D) finite-elementsignal processing, computational fluid dynamics, etc. Thesemodels are used in an iterative algorithm for solving the inverseequations are typically solved using either analytical or numer-problem. Recently, several authors have suggested the use ofical methods. Analytical solution methods are however feasibleneural networks (MLP or RBF networks [5]) for solving differ-only for simple geometries, which limits their applicability. Inential equations [6]–[9]. In these techniques, a neural networkmost practical problems with complex boundary conditions,is trained using a large database containing the input data andnumerical analysis methods are required in order to obtain athe solution of the differential equation. The neural networkreasonable solution. An example is the solution of Maxwell’sduring generalization learns the mapping corresponding toequations in electromagnetics. Solutions to Maxwell’s equa-the PDE. Alternatively, in [10], the solution to a differentialtions are used in a variety of applications for calculating theequation is written as a constant term, and an adjustable term interaction of electromagnetic (EM) fields with different typeswith parameters that need to be determined. A neural networkof media. is used to determine the optimal values of the parameters.Very often, the solution to differential equations is necessaryThis approach is applicable only to problems with regularfor solving the corresponding inverse problems. Inverse prob-boundaries. An extension of the approach to problems withlems in general are ill-posed, lacking continuous dependence ofirregular boundaries is given in [11]. Other neural networkthe measurements on the input. This has resulted in the devel-based differential equation solvers use multilayer perceptronopment of a variety of solution techniques ranging from simplenetworks or variations on the MLP to approximate the unknowncalibration procedures to other direct (analytical) and iterativefunction in a PDE [12]–[14]. A combination of the PDE andapproaches [1]. Iterative methods typically employ a forwardboundary conditions is used to construct an objective functionmodel that simulates the underlying physical process (Fig. 1)that is minimized during the training process.[2]. An initial estimate of the solution of the inverse problem A major limitation of these approaches is that the network ar- (represented byin Fig. 1) is applied to the forward model,chitecture is selected somewhat arbitrarily. A second drawback - is that the performance of the neural networks depends on the - Manuscript received January 17, 2004; revised April 2, 2005. data used in training and testing. As long the test data is sim- - The authors are with the Department of Electrical and Computer Engi- ilar to the training data, the network can interpolate between the neering, Michigan State University, East Lansing, MI 48824 USA (e-mail: training data points to obtain a reasonable prediction. However, rpradeep@egr.msu.edu; udpal@egr.msu.edu; udpa@egr.msu.edu). - Digital Object Identifier 10.1109/TNN.2005.857945 when the test signal is no longer similar to the training data, the - 1045-9227/$20.00 © 2005 IEEE 1382 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005 - - - network is forced to extrapolate and the performance degrades. Section V draws conclusions from the results and presents - One way around this difficulty is to ensure that the training data- ideas for future work. - base has a diverse set of signals. However, this is difficult to - ensure in practice. Alternatively, we have to design neural net- II. T HE FENN - works that are capable of extrapolation. Extrapolation methods This section briefly describes the FEM and proposes its refor-are discussed extensively in literature [15]–[18], but the design mulation into a parallel neural network structure. Details aboutof an extrapolation neural network involves several issues par- the FEM can be found in [3] and [4].ticularly for ensuring that the error in the network prediction - stays within reasonable bounds during the extrapolation proce- A. The FEMdure. Consider a typical boundary value problem with the gov-An ideal solution to this problem would be to combine the erning differential equationpower of numerical models with the computational speed of - neural networks, i.e., to embed a numerical model in a neural (1)network structure. One suchfinite-element neural network - (FENN) formulation has been reported by Takeuchi and Kosugi where is a differential operator, is the applied source or - [19]. This approach, based on error minimization, derives the forcing function, and is the unknown quantity. This differen- - neural network using the energy functional resulting from the tial equation can be solved in conjunction with boundary condi- - finite-element formulation. Other reports of FENN combina- tionson theboundary enclosingthedomain .Thevariational - tions are either similar to the Takeuchi method [20], [21] or use formulation used infinite-element analysis determines the un- - Hopfield neural networks to solve the forward problem [22], known by minimizing the functional [3], [4] - [23]. Kalkkuhlet al.[24] provide a description of a FEM-based - approach to NARX modeling that may be interpreted both as (2) - a local model network, as well as a single layer feedforward - network. A slightly different approach to merging numerical with respect to the trial function . The minimization procedure - methods and neural networks is given in [25], where thefi- starts by dividing into small subdomains called elements - nite-difference time domain (FDTD) method is cast in a neural (Fig. 2) and representing in each element by means of basis - network framework for the purpose of solving electromagnetic functions defined over the element - forward problems. The related problem of mesh generation - infinite-element models has also been tackled using neural (3)networks (for instance, [26]). Generally, these networks are - designed to solve the forward problem, and must be modified - to solve inverse problems. where is the unknown solution in element , is the basis - This paper proposes a new approach that embeds afinite-ele- function associated with node in element , is the value - ment model commonly used in the solution of differential equa- of the unknown quantity at node and is the total number of - tions in a neural network. The network, called the FENN, can nodes associated with element . In general, the basis functions - solve the forward problem and can also be used in an itera- (also referred to as interpolation functions or shape functions) - tive algorithm to solve inverse problems. The primary advan- can be linear, quadratic, or of higher order. Typically,finite-el- - tage of this approach is that the FEM is represented in a parallel ement models use either linear or polynomial spline basis func- - form. Thus, it has the potential to alleviate the computational tions. - cost associated with using the FEM in an iterative algorithm The functional within an element is expressed as - for solving inverse problems. More importantly, the FENN does - not need any training, and the computation of the weights is (4) - a one-time process. The proposed approach is also different in - that the neural network architecture developed can be used to - solve the forward and inverse problems. The structure of the By substituting (3) in (4), we obtain the discrete version of the - neural network is also simpler than those reported in the litera- functional within each element - ture, making it easier to implement in parallel in both hardware (5)and software. - The rest of this paper is organized as follows. Section II where is the transpose of a matrix, is the ele-briefly describes the FEM, and derives the proposed FENN. In mental matrix with elements this paper, we focus on the problem of solving typical equa- - tions encountered in electromagnetic nondestructive evaluation (6)(NDE). However, the same concepts can be easily applied - to solve differential equations encountered in otherfields. - Sections III, IV and V present the application of the FENN and is an vector with elements - to solving forward and inverse problems, along with initial - results. A discussion of the advantages and disadvantages of (7) - the proposed FENN architecture is given in Section IV. Finally, RAMUHALLI et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS 1383 - - - Combining the values in (5) for each of the elements - - (8) - - where is the global matrix derived from the terms - of the elemental matrices for different elements, and is the - total number of nodes. , also called the stiffness matrix, is a - sparse, banded matrix. Equation (8) is the discrete version of - the functional and can be minimized with respect to the nodal - parameters by taking the derivative of with respect to and - setting it equal to zero, which results in the matrix equation Fig.2. (a)Schematicrepresentationofdomainandboundary. (b)SampleFEM - mesh for the domain. - (9) - - Boundary conditions for these problems are usually of two - types: natural boundary conditions and essential boundary - conditions. Essential boundary conditions (also referred to as - Dirichlet boundary conditions) impose constraints on the value - of the unknown at several nodes. Natural boundary condi- - tions (of which Neumann boundary conditions are a special - case) impose constraints on the change in across a boundary. - Dirichlet boundary conditions are imposed on the functional - minimization (9), by deleting the rows and columns of the - matrix corresponding to the nodes on the Dirichlet boundary - and modifying in (9). Fig. 3. FEM domain discretization using two elements and four nodes. - Natural boundary conditions are applied in the FEM by - adding an additional term to the functional. These boundary This process ensures that natural boundary conditions are im-conditions are then incorporated into the functional and are plicitlyandautomatically satisfiedduring theFEMsolutionpro-satisfied automatically during the solution procedure. As an cedure.example, consider the natural boundary condition represented - by the following equation [3] B. The FENN - on (10) This section describes how thefinite-element model can be - converted intoa parallel network form. Wefocus on solving typ- - where represents the Neumann boundary, is its outward ical inverse problems arising in electromagnetic NDE, but the - normal unit vector, is some constant, and , , and are basicideaisapplicabletootherareas aswell.NDEinverseprob- - known parameters associated with the boundary. Assuming that lems can be formulated as the problem offinding the material - the boundary is made up of segments, we can define properties (such as the conductivity or the permeability) within - boundary matrices and with elements the domain of the problem. Since the domain is discretized in - the FEM method by a large number of elements, the problem - can be posed as one offinding the material properties in each - of these elements. These properties are usually embedded in the - differential operator , or equivalently, in the global matrix . - Thus, in order to be able to iteratively estimate these properties - from the measurements, the material properties need to be sep- - arated out from . This separation is easier to achieve at the - element matrix level. For nodes and in element - (11) - - where are basis functions defined over segment and is - the length of the segment. The elements of are added to the - elementsof that correspond tothe nodeson the boundary . - Similarly, the elements of are added to the corresponding - elements of . The global matrix (9) is thus modified as follows - before solving for (13) - - where is the parameter representing the material property(12) in element and represents the differential operator at the 1384 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Fig. 4. FENN. - - - element level without embedded in it. Substituting (13) into neurons, corresponding to the members of the global ma- - the functional, we get trix . The output of each group of hidden layer neurons is the - corresponding row vector of . The weights from the input to - the hidden layer are set to the appropriate values of . Each(14) neuron in the hidden layer acts as a summation unit, (equivalent - toasummationfollowedbyalinearactivationfunction[5]).The - If we define outputs of the hidden layer neurons are the elements of the - global matrix as given in (15). - (15) Each group of hidden neurons is connected to one output - neuron (giving a total of output neurons) by a set of weights - , with each element of representing the nodal values .where Note that the set of weights between thefirst group of hidden - neurons and thefirst output neuron are the same as the set of(16)else weights between the second group of hidden neurons and the - second output neuron (as well as between successive groups - of hidden neurons and the corresponding output neuron). Each - output neuron is also a summation unit followed by a linear ac- - tivation function, and the output of each neuron is equal to : - - - (18) - (17) - - where the second part of (18) is obtained by using (15). As an - Equation (17) expresses the functional explicitly in terms of . example, the FENN architecture for a two-element, four-node - The assumption that is constant within each element is im- FEM mesh (Fig. 3) is shown in Fig. 4. In this - plicit in this expression. This assumption is usually satisfied in case, the FENN has two input neurons, 16 hidden layer neurons - problems in NDE where each element in the FEM mesh is de- and four output neurons. Thefigure illustrates the grouping of - fined within the confines of a domain, and at no time does a the hidden layer neurons, as well as the similarity inherent in - single element cross domain boundaries. Furthermore, each el- the weights that connect each group of hidden layer neurons - ement is small enough that minor variations in within an el- to the corresponding output neuron. To simplify thefigure, the - ement may be ignored. Equation (17) can be easily converted weights between the network input and hidden layer neurons - into a parallel network form. The neural network comprises an are depicted by means of vectors (for - input, output and hidden layer. In the general case with el- , 2, 3, 4 and , 2), where the individual weight values - ements and nodes in the FEM mesh, the input layer with are defined as in (16). - network inputs takes the values in each element as input. 1) Boundary Conditions in the FENN: Note that the ele- - The hidden layer has neurons 1 arranged in groups of ments of and in (11) do not depend on the material prop- - 1 erties . and need to be added appropriately to the global In this paper, we use the term“neurons”in the FENN (in the hidden and - output layers) to avoid confusion with the nodes in afinite-element mesh. matrix and the source vector as shown in (12). Equation RAMUHALLI et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS 1385 - - - - - - - - - - - - - - - - - - - - - Fig. 5. Geometry of mesh for 1-D FEM. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Fig. 6. Flowchart (with example) for designing the FENN for a general PDE. - - - (12) thus implies that natural boundary conditions can be ap- layer neurons. These weights will be referred to as the clamped - plied in the FENN as bias inputs to the hidden layer neurons weights, while the remaining weights will be referred to as the - that are a part of the boundary, and the corresponding output free weights. An example of these weights is presented later. - neurons. Dirichlet boundary conditions are applied by clamping The FENN architecture was derived without consideration of - the corresponding weights between the hidden layer and output the dimensionality of the problem at hand, and thus can be used 1386 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005 - - - for 1-, 2-, 3-, or higher dimensional problems. The number of - nodes and elements in the FEM mesh dictates the number of - neurons in the different layers. The weights between the input - and hidden layer change depending on node-element connec- - tivity information. - The major drawback of the FENN is the number of neurons - and weights necessary. However, the memory requirements can - be reduced considerably, since most of the weights between the - input and hidden layer are zero. These weights, and the corre- - sponding connections, can be discarded. Similarly, most of the Fig. 7. Shielded microstrip geometry. (a) Complete problem description. (b) - elements of the matrix are also zero ( is a banded ma- Problem description using symmetry considerations. - trix). The corresponding neurons in the hidden layer can also - be discarded, reducing memory and computation requirements The network implementation of (23) can be derived as fol- - considerably. Furthermore, the weights between each group of lows. If and values at each element are the inputs to the - hidden layer neurons and the output layer are the same . network, , , , and form the weights - Weight-sharing approaches can be used here to further reduce between the input and hidden layers. The network thus uses - the storage requirements. inputneuronsand hiddenneurons.Thevaluesof ateachof - thenodesareassigned asweightsbetweenthehidden andoutput - C. A 1-D Example layers, and the source is the desired output of this network - Consider the 1-D equation (corresponding to the output neurons). Dirichlet boundary - conditions on are applied as explained earlier. - - (19) D. General Case - Fig. 6 shows aflowchart of the general scheme for convertingboundary conditions on the boundary defined by . a differential equation into the FENN structure. An exampleand are constants depending on the material and is the in two dimensions is also provided next to theflowchart. Weapplied source. Laplace’s equation and Poisson’s equation are start with the differential equation and the boundary conditionsspecial cases of this equation. The FENN formulation for this and formulate the FEM using the variational method. This in-problem starts by discretizing the domain of interest with el- volves discretizing the domain of interest with elements andements and nodes. In one dimension, each element is defined nodes, selecting basis functions, writing the functional forby two nodes (Fig. 5). Define basis functions and over each element and obtaining the element matrices and the sourceeach element and let is the value of on node in element vector. The example presented uses the FEM mesh shown in. An example of the basis functions is shown in Fig. 5. Fig. 3, with elements, and nodes, and linearFor these basis functions, i.e., basis functions. The unknown solution to the differential equa- - tion is represented by its values at each of the nodes in the(20) finite-element mesh . The element matrices are then - separated into two parts, with one part dependent on the mate-the element matrices are given by [3] rial properties and while the other is independent of them. - The FENN is then designed to have input neurons, - hidden neurons, and output neurons, where is the number - of material property parameters. In the example under consid- - eration, , since we have two material property parameters(21) ( and ). Thefirst group of input neurons takes in the - values while the second group takes in the values in each ele- - ment. The weights from the input to the hidden layer are set to - the appropriate values of . In the example, since nodes 1, 2, - (22) and 3 are part of element 1 (see Fig. 3), the weights from thefirst - input node to thefirst group of four neurons in the hidden - Here, is the length of element . The global matrix is then layer are given by - constructed by selectively adding the element matrices based - on the nodes that form an element. Specifically, is a sparse - tridiagonal matrix, and its nonzero elements are given by (24) - - The last weight is zero since node 4 is not a part of element 1. - Each group of hidden neurons is connected to one output - neuron (giving a total of output neurons) by a set of weights - , with each element of representing the nodal values . The - (23) output of each neuron in the output layer is equal to . RAMUHALLI et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS 1387 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Fig. 8. Forward problem solutions for shielded microstrip problem show the contours of constant potential for: (a) FEM solution and (b) FENN solution. (c) Error - between (a) and (b). Thex- andy-axes show the nodes in the FEM discretization of the domain, and thez-axis in (c) shows the error at each of these nodes in volts. - - - - III. F ORWARD AND INVERSE PROBLEM FORMULATION USING where is the output of the FENN. Then, for a gradient- - FENN based approach, the gradients of the error with respect to the - free hidden layer weights is given by - - The FENN architecture and algorithm lends itself to solving (27)both the forward and inverse problems. The forward problem - involves determining the weights given the material parame- Equation (27) can be used to solve the forward problem. Sim-ters and and the applied source while the inverse problem ilarly, to solve the inverse problem, the gradients of the errorinvolves determining and given and . Any optimization with respect to and (input of the FENN) are necessary, andapproach can be used to solve both these problems. Suppose we are given bydefine the error at the output of the FENN as - - - - - (28) - - - - - (26) (29) 1388 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005 - - - - TABLE I - SUMMARY OF PERFORMANCE OF THE FENN A LGORITHM FOR VARIOUS PDE S - - - - - - - - - - - - - - - - - - - - - - - - - - - For the forward problem, such an approach is equivalent to the Dirichlet boundary, with on the microstrip and on - iterative approaches used to solve for the unknown nodal values the outer boundary [Fig. 7(b)]. Finally, there is no source term - in the FEM [4]. in this example (the source term would correspond to a charge - distribution in the domain of interest), i.e., . In this ex- - IV. R ESULTS ample, we assume that volts and . Further, we - assume that the domain of interest is .A. Forward Model Results The solution to the forward problem is presented in Fig. 8, - The FENN was tested using both 1- and 2-D versions of with the FEM solution using 11 nodes in each direction shown - Poisson’s equation in Fig. 8(a) and the corresponding FENN solution in Fig. 8(b). - - (30) Thesefigures show contours of constant potential. The error be- - tween the FEM and FENN solutions is presented in Fig. 8(c). As - where represents the material property, and is the applied seen from thefigure, the FENN is seen to match the FEM solu- - source. For instance, in electromagnetics may represent the tion accurately, with the peak error at any node on the order of - permittivity while represents the charge density. . - As thefirst example, consider the following 2-D equation Several other examples were also used to test the FENN and - the results are summarized in Table I. Column 1 shows the - (31) PDE used to evaluate the FENN performance, while column 2 - shows the boundary conditions used. The analytic solution to - with boundary conditions the problem is indicated in Column 3. The FENN structure and - - on (32) the number of iterations for convergence using a gradient de- - scent approach are indicated in Columns 4 and 5, respectively. - and The FENN structure, as explained earlier, has inputs, - hidden neurons and output neurons, where and are the - on (33) number of elements and nodes in the FEM mesh, respectively, - and is the number of hidden neurons, and corresponds to the - This is the governing equation for the shielded microstrip trans- number of nonzero elements in the FEM global matrix . Fi- - mission line problem shown in Fig. 7. The forward problem nally, Columns 6 and 7 present the sum-squared error (SSE) and - computes the electric potential due to the shielded microstrip the maximum error in the solution, respectively, where the er- - shown in Fig. 7(a). The potentials are zero on the shielding con- rors are computed with respect to the analytical solution. These - ductor.Sincethegeometryissymmetric,wecansolvetheequiv- results indicate that the FENN is capable of accurately deter- - alent problem shown in Fig. 7(b), by applying the homogeneous mining the potential . One advantage of the FENN approach - Neumann condition on the plane of symmetry. The inner con- is that the computation of the input-hidden layer weights is a - ductor (microstrip) is held at a constant potential of volts. one-time process, as long as the differential equation does not - Finally, we also assume that the material inside the shielding change. The only changes necessary to solve the different prob- - conductor has a permittivity , where K is a constant. The lems are changes in the input and the desired output . - permittivity in this case corresponds to the material property . - Specifically, and . The homogeneous Neu- B. Inverse Model Results - mann boundary condition is equivalent to setting . TheFENNwasalsousedtosolveseveralsimpleinverseprob- - The microstrip and the shielding conductor correspond to the lems based on (30). In all cases, the objective was to determine RAMUHALLI et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS 1389 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Fig. 9. FENN inversion results for Poisson’s equation with initial solutions (a) = x . (b) =1+ x . - - - the value of and for given values of and . Thefirst ex- In order to obtain a unique solution, we need to constrain the - ample is a 1-D problem that involves determining given value of at the boundary as well. Consider the same differen- - and , for the differential equation tial equation as (34), but with and specified as follows: - - (34) and - - with boundary conditions and . The analyt- (36) - ical solution to this inverse problem is The analytical solution for this equation is .To - and (35) solve this problem, we set and clamp the value of at - As seen from (35), the problem has an infinite number of solu- and as follows: , . - tions and we expect the solution procedure to converge to one The results of the constrained inversion obtained using 11 - of these solutions depending on the initial value. nodes and 10 elements in the correspondingfinite-element mesh - Fig. 9(a) and (b) shows two solutions to this inverse problem are shown in Fig. 10. Fig. 10(a) shows the comparison between - for two different initializations (shown using triangles). In both the analytical solution (solid line with squares) and the FENN - cases, the FENN solution (in stars) is seen to match the analyt- result (solid line with stars). The initial value of is shown in - ical solution (squares). The SSE in both cases was on the order thefigure as a dashed line. Fig. 10(b) shows the comparison - of . between the actual and desired forcing function at the FENN 1390 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Fig. 10. Constrained inversion result with eleven nodes. (a) Comparison of analytic and simulation results for . (b) Comparison of actual and desired NN outputs. - - - output. This result indicates that the SSE in the forcing function, weight structure that allows both the forward and inverse prob- - as well as the SSE in the inversion result, is fairly large (0.0148 lemstobesolvedusingsimplegradient-basedalgorithms.Initial - and 0.0197, respectively). The reason for this was traced back results indicate that the proposed FENN algorithm is capable of - to the mesh discretization. Fig. 11 shows the SSE in the output accurately solving both the forward and inverse problems. In - of the FENN and the SSE in the inverse problem solution as a addition, the forward problem solution from the FENN is seen - function of FEM discretization. It is seen that increasing the dis- to exactly match the FEM solution, indicating that the FENN - cretization significantly improves the solution. Similar results represents thefinite-element model exactly in a parallel config- - were observed for other problems. uration. - The major advantage of the FENN is that it represents the - finite-element model in a parallel form, enabling parallel imple- - V. D ISCUSSION AND CONCLUSION mentation in either hardware or software. Further, computing - gradients in the FENN is very simple. This is an advantage in - The FENN is closely related to thefinite-element model used solving bothforward and inverse problems using gradient-based - to solve differential equations. The FENN architecture has a methods. The gradients can also be computed in parallel and RAMUHALLI et al.: FENNs FOR SOLVING DIFFERENTIAL EQUATIONS 1391 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Fig. 11. SSE in FENN output and inversion results as a function of discretization. - - - the lack of nonlinearities in the neuron activation functions [6] C. A. Jensenet al.,“Inversion of feedforward neural networks: algo- - makes the computation of gradients simpler. A major advantage rithms and applications,”Proc. IEEE, vol. 87, no. 9, pp. 1536–1549, - of this approach for solving inverse problems is that it avoids 1999. - [7] P. Ramuhalli, L. Udpa, and S. Udpa,“Neural networkalgorithm for elec- - inverting the global matrix in each iteration. The FENN also tromagnetic NDE signal inversion,”inENDE 2000, Budapest, Hungary, - does not require any training, since most of its weights can be Jun. 2000. - computed in advance and stored. The weights depend on the [8] C. H. Barbosa, A. C. Bruno, M. Vellasco, M. Pacheco, J. P. Wikswo Jr., - and A. P. Ewing,“Automation of SQUID nondestructive evaluation of - governing differential equation and its associated boundary steel plates by neural networks,”IEEE Trans. Appl. Supercond., vol. 9, - conditions, and as long as these two factors do not change, no. 2, pp. 3475–3478, 1999. - the weights do not change. This is especially an advantage [9] W.Qing, S. Xueqin,Y.Qingxin,and Y.Weili,“Usingwaveletneural net- - works for the optimal design of electromagnetic devices,”IEEE Trans. - in solving inverse problems in electromagnetic NDE. This Magn., vol. 33, no. 2, pp. 1928–1930, 1997. - approach also reduces the computational effort associated with [10] I. E. Lagaris, A. C. Likas, and D. I. Fotiadis,“Artificial neural networks - the network. for solving ordinary and partial differential equations,”IEEE Trans. - Neural Netw., vol. 9, no. 5, pp. 987–1000, 1998. - Future work will concentrate on applying the FENN to 3-D [11] I. E. Lagaris, A. C. Likas, and D. G. Papageorgiou,“Neural-network - electromagnetic NDE problems. The robustness of the approach methods for boundary value problems with irregular boundaries,”IEEE - will also be tested, since the ability of these approaches to in- Trans. Neural Netw., vol. 11, no. 5, pp. 1041–1049, 2000. - [12] B. P. Van Milligen, V. Tribaldos, and J. A. Jimenez,“Neural network - vert practical noisy measurements is important. Furthermore, differential equation and plasma equilibrium solver,”Phys. Rev. Lett., - the use of better optimization algorithms, like conjugate gra- vol. 75, no. 20, pp. 3594–3597, 1995. - dient methods, is expected to improve the solution speed. In ad- [13] M. W. M. G. Dissanayake and N. Phan-Thien,“Neural-network-based - approximations for solving partial differential equations,”Commun. - dition, parallel implementation of the FENN in both hardware Numer. Meth. Eng., vol. 10, pp. 195–201, 1994. - and software is under investigation. The approach described in [14] R. Masuoka,“Neural networks learning differential data,”IEICE Trans. - this paper is very general in that it can be applied to a variety Inform. Syst., vol. E83-D, no. 6, pp. 1291–1300, 2000. - [15] D.C.Youla,“Generalizedimagerestorationbythemethodofalternating - of inverse problems infields other than electromagnetic NDE. orthogonal projections,”IEEE Trans. Circuits Syst., vol. CAS-25, no. 9, - Some of these other applications will also be investigated to pp. 694–702, 1978. - show the general nature of the proposed method. [16] D. C. Youla and H. Webb,“Image restoration by the method of convex - projections: part I—theory,”IEEE Trans. Med. Imag., vol. MI-1, no. 2, - pp. 81–94, 1982. - REFERENCES [17] A. Lent and H. Tuy,“An iterative method for the extrapolation of band- - limitedfunctions,”J.Math.AnalysisandApplicat.,vol.83, pp.554–565, - [1] L. Udpa and S. S. Udpa,“Application of signal processing and pattern 1981. - recognition techniques to inverse problems in NDE,”Int. J. Appl. Elec- [18] W. Chen,“A new extrapolation algorithm for band-limited signals using - tromagn. Mechan., vol. 8, pp. 99–117, 1997. the regularization method,”IEEE Trans. Signal Process., vol. 41, no. 3, - [2] M. Yan, M. Afzal, S. Udpa, S. Mandayam, Y. Sun, L. Udpa, and P. pp. 1048–1060, 1993. - Sacks,“Iterative algorithms for electromagnetic NDE signal inversion,” [19] J. Takeuchi and Y. Kosugi,“Neural network representation of thefinite - inENDE ’97, Reggio Calabria, Italy, Sep. 14–16, 1997. element method,”Neural Netw., vol. 7, no. 2, pp. 389–395, 1994. - [3] J. Jin,The Finite Element Method in Electromagnetics. New York: [20] R. Sikora, J. Sikora, E. Cardelli, and T. Chady,“Artificial neural net- - Wiley, 1993. work application for material evaluation by electromagnetic methods,” - [4] P. Zhou,Numerical Analysis of Electromagnetic Fields. Berlin, Ger- inProc. Int. Joint Conf. Neural Networks, vol. 6, 1999, pp. 4027–4032. - many: Springer-Verlag, 1993. [21] G. Xu, G. Littlefair, R. Penson, and R. Callan,“Application of FE-based - [5] S. Haykin,Neural Networks: A Comprehensive Foundation. Upper neural networks to dynamic problems,”inProc. Int. Conf. Neural Infor- - Saddle River, NJ: Prentice-Hall, 1994. mation Processing, vol. 3, 1999, pp. 1039–1044. 1392 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 6, NOVEMBER 2005 - - - - [22] F. Guo, P. Zhang, F. Wang, X. Ma, and G. Qiu,“Finite element anal- Lalita Udpa (S’84–M’86–SM’96) received the - ysis-based Hopfield neural network model for solving nonlinear elec- Ph.D. degree in electrical engineering from Col- - tromagneticfield problems,”inProc. Int. Joint Conf. Neural Networks, orado State University, Fort Collins, in 1986. - vol. 6, 1999, pp. 4399–4403. She is currently a Professor with the Department - [23] H. Lee and I. S. Kang,“Neural algorithm for solving differential equa- of Electrical and Computer Engineering, Michigan - tions,”J. Computat. Phys., vol. 91, pp. 110–131, 1990. State University, East Lansing. She works primarily - [24] J. Kalkkuhl, K. J. Hunt, and H. Fritz,“FEM-based neural-network in the broad areas of nondestructive evaluation, - approach to nonlinear modeling with application to longitudinal vehicle signal processing, and biomedical applications. Her - dynamics control,”IEEE Trans. Neural Netw., vol. 10, no. 4, pp. research interests include various aspects of NDE, - 885–897, 1999. such as development of computational models for - [25] R. K. Mishra and P. S. Hall,“NFDTD concept,”IEEE Trans. Neural the forward problem in NDE, signal and image pro- - Netw., vol. 16, no. 2, pp. 484–490, 2005. cessing, pattern recognition and neural networks, and development of solution - [26] D. G. Triantafyllidis and D. P. Labridis,“Afinite-element mesh gener- techniques for inverse problems. Her current projects includefinite-element - ator based on growing neural networks,”IEEE Trans. Neural Netw., vol. modeling of electromagnetic NDE phenomena, application of neural network - 13, no. 6, pp. 1482–1496, 2002. and signal processing algorithms to NDE data, and development of image - processing techniques for the analysis of NDE and biomedical images. - Dr. Udpa is a Member of Eta Kappa Nu and Sigma Xi. - - - - Satish S. Udpa(S’82–M’82–SM’91–F’03) received - the B.Tech. degree in 1975 and the Post Graduate - Diplomainelectricalengineeringin1977fromJ.N.T. - University, Hyderabad, India. He received the M.S. - degree in 1980 and the Ph.D. degree in electrical en- - gineering in 1983, both from Colorado State Univer- - sity, Fort Collins. - He has been with Michigan State University, East - Lansing, since 2001 and is currently Acting Dean for - the College of Engineering and a Professor with the - Electrical and Computer Engineering Department. - Prior to joining Michigan State, he was a Professor with Iowa State University, - Ames, from 1990 to 2001 and was associated with the Materials Assessment - Research Group. Prior to joining Iowa State, he was an Associate Professor - with the Department of Electrical Engineering at Colorado State University. - His research interests span the broad area of materials characterization and - nondestructive evaluation (NDE). Work done by him to date in the area includes - an extensive repertoire of forward models for simulating physical processes - underlying several inspection techniques. Coupled with careful experimental - Pradeep Ramuhalli (S’92–M’02) received the work, such forward models can be used for designing new sensors, optimizing - B.Tech. degree from J.N.T. University, Hyderabad, test conditions, estimating the probability of detection, assessing designs for - India, in electronics and communications engi- inspectability and training inverse models for characterizing defects. He has - neering in 1995, and the M.S. and Ph.D. degrees in also been involved in the development of system-, as well as model-based, - electrical engineering from Iowa State University, inverse solutions for defect and material property characterization. His interests - Ames, in 1998 and 2002, respectively. have expanded in recent years to include the development of noninvasive - He is currently an Assistant Professor with the tools for clinical applications. Work done to date in thisfield includes the - Department of Electrical and Computer Engi- development of new electromagnetic-acoustic (EMAT) methods for detecting - neering, Michigan State University, East Lansing. single leg separation failures in artificial heart valves and microwave imaging - His research is in the general area of nondestruc- and ablation therapy systems. He and his research group have been engaged - tive evaluation and materials characterization. His in the design and development of high-performance instrumentation including - research interests include the application of signal and image processing acoustic microscopes and single and multifrequency eddy current NDE instru- - methods, pattern recognition and neural networks for nondestructive evaluation ments. These systems, as well as software packages embodying algorithms - applications, development of model-based solutions for inverse problems in developed by Udpa for defect classification and characterization, have been - NDE, and the development of information fusion algorithms for multimodal licensed to industry. - data fusion. He is a Fellow of the American Society for Nondestructive Testing (ASNT) - Dr. Ramuhalli is a Member of Phi Kappa Phi. and a Fellow of the Indian Society of Nondestructive Testing. \ No newline at end of file