testing_generation/Corpus/Energy and Policy Considera...

                 Energy and Policy Considerations for Deep Learning in NLP


                      Emma Strubell Ananya Ganesh Andrew McCallum
                            College of Information and Computer Sciences
                                University of Massachusetts Amherst
                       {strubell, aganesh, mccallum}@cs.umass.edu


                        Abstract                Consumption CO 2 e (lbs)
                                               Air travel, 1 passenger, NY↔SF 1984 Recent progress in hardware and methodol-


    arXiv:1906.02243v1  [cs.CL]  5 Jun 2019                                          Human life, avg, 1 year 11,023 ogy for training neural networks has ushered
             in a new generation of large networks trained      American life, avg, 1 year 36,156
             on abundant data. These models have ob-      Car, avg incl. fuel, 1 lifetime 126,000
             tained notable gains in accuracy across many
             NLP tasks. However, these accuracy improve-      Training one model (GPU)
             ments depend on the availability of exception-      NLP pipeline (parsing, SRL) 39 ally large computational resources that neces-       w/ tuning & experimentation 78,468 sitate similarly substantial energy consump-      Transformer (big) 192 tion. As a result these models are costly to
             train and develop, both ﬁnancially, due to the       w/ neural architecture search 626,155
             cost of hardware and electricity or cloud com-     Table 1: Estimated COpute time, and environmentally,due to the car-                   2 emissions from training com-
                                              mon NLP models, compared to familiar consumption. 1 bon footprint required to fuel modern tensor
             processing hardware. In this paper we bring
             this issue to the attention of NLP researchers     NLP models could be trained and developed on by quantifying the approximate ﬁnancial and     a commodity laptop or server, many now require environmental costs of training a variety of re-
             cently successful neural network models for     multiple instances of specialized hardware such as
             NLP. Based on these ﬁndings, we propose ac-     GPUs or TPUs, therefore limiting access to these
             tionable recommendations to reduce costs and     highly accurate models on the basis of ﬁnances.
             improve equity in NLP research and practice.       Even when these expensive computational re-
           1 Introduction                       sources are available, model training also incurs a
                                              substantial cost to the environment due to the en-
           Advances in techniques and hardware for train-  ergy required to power this hardware for weeks or
           ing deep neural networks have recently en-  months at a time. Though some of this energy may
           abled impressive accuracy improvements across  come from renewable or carbon credit-offset re-
           many fundamental NLP tasks ( Bahdanau et al.,  sources, the high energy demands of these models
           2015; Luong et al., 2015; Dozat and Man-  are still a concern since (1) energy is not currently
           ning, 2017; Vaswani et al., 2017), with the  derived from carbon-neural sources in many loca-
           most computationally-hungry models obtaining  tions, and (2) when renewable energy is available,
           the highest scores (Peters et al.,2018;Devlin et al.,  it is still limited to the equipment we have to pro-
           2019;Radford et al.,2019;So et al.,2019). As  duce and store it, and energy spent training a neu-
           a result, training a state-of-the-art model now re-  ral network might better be allocated to heating a
           quires substantial computational resources which  family’s home. It is estimated that we must cut
           demand considerable energy, along with the as-  carbon emissions by half over the next decade to
           sociated ﬁnancial and environmental costs. Re-  deter escalating rates of natural disaster, and based
           search and development of new models multiplies  on the estimated CO 2 emissions listed in Table 1,
           these costs by thousands of times by requiring re-
           training to experiment with model architectures    1 Sources: (1) Air travel and per-capita consump-
                                              tion: https://bit.ly/2Hw0xWc; (2) car lifetime: and hyperparameters. Whereas a decade ago most  https://bit.ly/2Qbr0w1.           model training and development likely make up   Consumer Renew. Gas Coal Nuc.
           a substantial portion of the greenhouse gas emis-   China 22% 3% 65% 4%
           sions attributed to many NLP researchers.         Germany 40% 7% 38% 13%
            To heighten the awareness of the NLP commu-   United States 17% 35% 27% 19%
           nity to this issue and promote mindful practice and   Amazon-AWS 17% 24% 30% 26%
           policy, we characterize the dollar cost and carbon   Google 56% 14% 15% 10%
           emissions that result from training the neural net-   Microsoft 32% 23% 31% 10%
           works at the core of many state-of-the-art NLP
           models. We do this by estimating the kilowatts  Table 2: Percent energy sourced from: Renewable (e.g.
           of energy required to train a variety of popular  hydro, solar, wind), natural gas, coal and nuclear for
           off-the-shelf NLP models, which can be converted  the top 3 cloud compute providers (Cook et al.,2017),
           to approximate carbon emissions and electricity  compared to the United States, 4 China 5 and Germany
           costs. To estimate the even greater resources re-  (Burger,2019).
           quired to transfer an existing model to a new task
           or develop new models, we perform a case study    We estimate the total time expected for mod-
           of the full computational resources required for the  els to train to completion using training times and
           development and tuning of a recent state-of-the-art  hardware reported in the original papers. We then
           NLP pipeline (Strubell et al.,2018). We conclude  calculate the power consumption in kilowatt-hours
           with recommendations to the community based on  (kWh) as follows. Letpc be the average power
           our ﬁndings, namely: (1) Time to retrain and sen-  draw (in watts) from all CPU sockets during train-
           sitivity to hyperparameters should be reported for  ing, letpr be the average power draw from all
           NLP machine learning models; (2) academic re-  DRAM (main memory) sockets, letpg be the aver-
           searchers need equitable access to computational  age power draw of a GPU during training, and let
           resources; and (3) researchers should prioritize de-  gbe the number of GPUs used to train. We esti-
           veloping efﬁcient models and hardware.         mate total power consumption as combined GPU,
                                              CPU and DRAM consumption, then multiply this
           2 Methods                          by Power Usage Effectiveness (PUE), which ac-
                                              counts for the additional energy required to sup-To quantify the computational and environmen-  port the compute infrastructure (mainly cooling).tal cost of training deep neural network mod-  We use a PUE coefﬁcient of 1.58, the 2018 globalels for NLP, we perform an analysis of the en-  average for data centers (Ascierto,2018). Then theergy required to train a variety of popular off-  total powerpthe-shelf NLP models, as well as a case study of           t required at a given instance during
                                              training is given by:the complete sum of resources required to develop
           LISA (Strubell et al.,2018), a state-of-the-art NLP             1.58t(pp       c +pr +gp g )
           model from EMNLP 2018, including all tuning          t =                    (1)1000
           and experimentation.                      The U.S. Environmental Protection Agency (EPA)We measure energy use as follows. We train the  provides average COmodels described in§2.1using the default settings                 2 produced (in pounds per
                                              kilowatt-hour) for power consumed in the U.S.provided, and sample GPU and CPU power con-  (EPA,2018), which we use to convert power tosumption during training. Each model was trained  estimated COfor a maximum of 1 day. We train all models on           2 emissions:

           a single NVIDIA Titan X GPU, with the excep-             CO 2 e = 0.954pt         (2)
           tion of ELMo which was trained on 3 NVIDIA  This conversion takes into account the relative pro-GTX 1080 Ti GPUs. While training, we repeat-  portions of different energy sources (primarily nat-edly query the NVIDIA System Management In-  ural gas, coal, nuclear and renewable) consumedterface 2 to sample the GPU power consumption  to produce energy in the United States. Table2and report the average over all samples. To sample  lists the relative energy sources for China, Ger-CPU power consumption, we use Intel’s Running  many and the United States compared to the topAverage Power Limit interface. 3
                                                5 U.S. Dept. of Energy:https://bit.ly/2JTbGnI
            2 nvidia-smi:https://bit.ly/30sGEbi        5 China Electricity Council; trans. China Energy Portal:
            3 RAPL power meter:https://bit.ly/2LObQhV   https://bit.ly/2QHE5O3           three cloud service providers. The U.S. break-  ence. Devlin et al.(2019) report that the BERT
           down of energy is comparable to that of the most  base model (110M parameters) was trained on 16
           popular cloud compute service, Amazon Web Ser-  TPU chips for 4 days (96 hours). NVIDIA reports
           vices, so we believe this conversion to provide a  that they can train a BERT model in 3.3 days (79.2
           reasonable estimate of CO 2 emissions per kilowatt  hours) using 4 DGX-2H servers, totaling 64 Tesla
           hour of compute energy used.                V100 GPUs (Forster et al.,2019).
                                              GPT-2. This model is the latest edition of
           2.1 Models                           OpenAI’s GPT general-purpose token encoder,
           We analyze four models, the computational re-  also based on Transformer-style self-attention and
           quirements of which we describe below. All mod-  trained with a language modeling objective (Rad-
           els have code freely available online, which we  ford et al.,2019). By training a very large model
           used out-of-the-box. For more details on the mod-  on massive data,Radford et al.(2019) show high
           els themselves, please refer to the original papers.  zero-shot performance on question answering and
                                              language modeling benchmarks. The large modelTransformer. The Transformer model (Vaswani  described inRadford et al.(2019) has 1542M pa-et al.,2017) is an encoder-decoder architecture  rameters and is reported to require 1 week (168primarily recognized for efﬁcient and accurate ma-  hours) of training on 32 TPUv3 chips. 6 chine translation. The encoder and decoder each
           consist of 6 stacked layers of multi-head self-
           attention. Vaswani et al.(2017) report that the  3 Related work
           Transformerbasemodel (65M parameters) was
           trained on 8 NVIDIA P100 GPUs for 12 hours,  There is some precedent for work characterizing
           and the Transformerbigmodel (213M parame-  the computational requirements of training and in-
           ters) was trained for 3.5 days (84 hours; 300k  ference in modern neural network architectures in
           steps). This model is also the basis for recent  the computer vision community.Li et al.(2016)
           work on neural architecture search (NAS) for ma-  present a detailed study of the energy use required
           chine translation and language modeling (So et al.,  for training and inference in popular convolutional
           2019), and the NLP pipeline that we study in more  models for image classiﬁcation in computer vi-
           detail in§4.2(Strubell et al.,2018). So et al.  sion, including ﬁne-grained analysis comparing
           (2019) report that their full architecture search ran  different neural network layer types. Canziani
           for a total of 979M training steps, and that their  et al.(2016) assess image classiﬁcation model ac-
           base model requires 10 hours to train for 300k  curacy as a function of model size and gigaﬂops
           steps on one TPUv2 core. This equates to 32,623  required during inference. They also measure av-
           hours of TPU or 274,120 hours on 8 P100 GPUs.   erage power draw required during inference on
                                              GPUs as a function of batch size. Neither work an-ELMo. The ELMo model (Peters et al.,2018)  alyzes the recurrent and self-attention models thatis based on stacked LSTMs and provides rich  have become commonplace in NLP, nor do theyword representations in context by pre-training on  extrapolate power to estimates of carbon and dol-a large amount of data using a language model-  lar cost of training.ing objective. Replacing context-independent pre-
           trained word embeddings with ELMo has been    Analysis of hyperparameter tuning has been
           shown to increase performance on downstream  performed in the context of improved algorithms
           tasks such as named entity recognition, semantic  for hyperparameter search (Bergstra et al.,2011;
           role labeling, and coreference.Peters et al.(2018)  Bergstra and Bengio,2012;Snoek et al.,2012). To
           report that ELMo was trained on 3 NVIDIA GTX  our knowledge there exists to date no analysis of
           1080 GPUs for 2 weeks (336 hours).           the computation required for R&D and hyperpa-
                                              rameter tuning of neural network models in NLP.BERT.The BERT model (Devlin et al.,2019) pro-
           vides a Transformer-based architecture for build-
           ing contextual representations similar to ELMo,    6 Via the authorson Reddit.
                                                7 GPU lower bound computed using pre-emptible but trained with a different language modeling ob-  P100/V100 U.S. resources priced at $0.43–$0.74/hr, upper
           jective. BERT substantially improves accuracy on  bound uses on-demand U.S. resources priced at $1.46–
           tasks requiring sentence-level representations such  $2.48/hr. We similarly use pre-emptible ($1.46/hr–$2.40/hr)
                                              and on-demand ($4.50/hr–$8/hr) pricing as lower and upper as question answering and natural language infer-  bounds for TPU v2/3; cheaper bulk contracts are available.           Model Hardware Power (W) Hours kWh·PUE CO 2 e Cloud compute cost
           Transformer base P100x8 1415.78 12 27 26 $41–$140
           Transformer big  P100x8 1515.43 84 201 192 $289–$981
           ELMo P100x3 517.66 336 275 262 $433–$1472
           BERT base     V100x64 12,041.51 79 1507 1438 $3751–$12,571
           BERT base     TPUv2x16 — 96 — — $2074–$6912
           NAS P100x8 1515.43 274,120 656,347 626,155 $942,973–$3,201,722
           NAS TPUv2x1 — 32,623 — — $44,055–$146,848
           GPT-2 TPUv3x32 — 168 — — $12,902–$43,008

           Table 3: Estimated cost of training a model in terms of CO 2 emissions (lbs) and cloud compute cost (USD). 7 Power
           and carbon footprint are omitted for TPUs due to lack of public information on power draw for this hardware.


           4 Experimental results                                  Estimated cost (USD)
                                               Models Hours Cloud compute Electricity4.1 Cost of training                     1 120 $52–$175 $5Table3lists CO 2 emissions and estimated cost of   24 2880 $1238–$4205 $118training the models described in§2.1. Of note is   4789 239,942 $103k–$350k $9870that TPUs are more cost-efﬁcient than GPUs on
           workloads that make sense for that hardware (e.g.  Table 4: Estimated cost in terms of cloud compute and
           BERT). We also see that models emit substan-  electricity for training: (1) a single model (2) a single
           tial carbon emissions; training BERT on GPU is  tune and (3) all models trained during R&D.
           roughly equivalent to a trans-American ﬂight.So
           et al.(2019) report that NAS achieves a new state-  about 60 GPUs running constantly throughout theof-the-art BLEU score of 29.7 for English to Ger-  6 month duration of the project. Table4lists upperman machine translation, an increase of just 0.1  and lower bounds of the estimated cost in termsBLEU at the cost of at least $150k in on-demand  of Google Cloud compute and raw electricity re-compute time and non-trivial carbon emissions.    quired to develop and deploy this model. 9 We see
                                              that while training a single model is relatively in-4.2 Cost of development: Case study        expensive, the cost of tuning a model for a newTo quantify the computational requirements of  dataset, which we estimate here to require 24 jobs,R&D for a new model we study the logs of  or performing the full R&D required to developall training required to develop Linguistically-  this model, quickly becomes extremely expensive.Informed Self-Attention (Strubell et al.,2018), a
           multi-task model that performs part-of-speech tag-  5 Conclusions
           ging, labeled dependency parsing, predicate detec-
           tion and semantic role labeling. This model makes  Authors should report training time and
           for an interesting case study as a representative  sensitivity to hyperparameters.
           NLP pipeline and as a Best Long Paper at EMNLP.  Our experiments suggest that it would be beneﬁ-
            Model training associated with the project  cial to directly compare different models to per-
           spanned a period of 172 days (approx. 6 months).  form a cost-beneﬁt (accuracy) analysis. To ad-
           During that time 123 small hyperparameter grid  dress this, when proposing a model that is meant
           searches were performed, resulting in 4789 jobs  to be re-trained for downstream use, such as re-
           in total. Jobs varied in length ranging from a min-  training on a new domain or ﬁne-tuning on a new
           imum of 3 minutes, indicating a crash, to a maxi-  task, authors should report training time and com-
           mum of 9 days, with an average job length of 52  putational resources required, as well as model
           hours. All training was done on a combination of  sensitivity to hyperparameters. This will enable
           NVIDIA Titan X (72%) and M40 (28%) GPUs. 8   direct comparison across models, allowing subse-
            The sum GPU time required for the project  quent consumers of these models to accurately as-
           totaled 9998 days (27 years). This averages to  sess whether the required computational resources
            8 We approximate cloud compute cost using P100 pricing.    9 Based on average U.S cost of electricity of $0.12/kWh.           are compatible with their setting. More explicit  half the estimated cost to use on-demand cloud
           characterization of tuning time could also reveal  GPUs. Unlike money spent on cloud compute,
           inconsistencies in time spent tuning baseline mod-  however, that invested in centralized resources
           els compared to proposed contributions. Realiz-  would continue to pay off as resources are shared
           ing this will require: (1) a standard, hardware-  across many projects. A government-funded aca-
           independent measurement of training time, such  demic compute cloud would provide equitable ac-
           as gigaﬂops required to convergence, and (2) a  cess to all researchers.
           standard measurement of model sensitivity to data
           and hyperparameters, such as variance with re-  Researchers should prioritize computationally
           spect to hyperparameters searched.            efﬁcient hardware and algorithms.
                                              We recommend a concerted effort by industry and
           Academic researchers need equitable access to   academia to promote research of more computa-
           computation resources.                   tionally efﬁcient algorithms, as well as hardware
                                              that requires less energy. An effort can also beRecent advances in available compute come at a  made in terms of software. There is already ahigh price not attainable to all who desire access.  precedent for NLP software packages prioritizingMost of the models studied in this paper were de-  efﬁcient models. An additional avenue throughveloped outside academia; recent improvements in  which NLP and machine learning software de-state-of-the-art accuracy are possible thanks to in-  velopers could aid in reducing the energy asso-dustry access to large-scale compute.           ciated with model tuning is by providing easy-Limiting this style of research to industry labs  to-use APIs implementing more efﬁcient alterna-hurts the NLP research community in many ways.  tives to brute-force grid search for hyperparameterFirst, it stiﬂes creativity. Researchers with good  tuning, e.g. random or Bayesian hyperparameterideas but without access to large-scale compute  search techniques (Bergstra et al.,2011;Bergstrawill simply not be able to execute their ideas,  and Bengio,2012;Snoek et al.,2012). Whileinstead constrained to focus on different prob-  software packages implementing these techniqueslems. Second, it prohibits certain types of re-  do exist, 10 they are rarely employed in practicesearch on the basis of access to ﬁnancial resources.  for tuning NLP models. This is likely becauseThis even more deeply promotes the already prob-  their interoperability with popular deep learninglematic “rich get richer” cycle of research fund-  frameworks such as PyTorch and TensorFlow ising, where groups that are already successful and  not optimized, i.e. there are not simple exam-thus well-funded tend to receive more funding  ples of how to tune TensorFlow Estimators usingdue to their existing accomplishments. Third, the  Bayesian search. Integrating these tools into theprohibitive start-up cost of building in-house re-  workﬂows with which NLP researchers and practi-sources forces resource-poor groups to rely on  tioners are already familiar could have notable im-cloud compute services such as AWS, Google  pact on the cost of developing and tuning in NLP.Cloud and Microsoft Azure.
            While these services provide valuable, ﬂexi-  Acknowledgements
           ble, and often relatively environmentally friendly  We are grateful to Sherief Farouk and the anony- compute resources, it is more cost effective for  mous reviewers for helpful feedback on earlieracademic researchers, who often work for non-  drafts. This work was supported in part by theproﬁt educational institutions and whose research  Centers for Data Science and Intelligent Infor-is funded by government entities, to pool resources  mation Retrieval, the Chan Zuckerberg Initiativeto build shared compute centers at the level of  under the Scientiﬁc Knowledge Base Construc-funding agencies, such as the U.S. National Sci-  tion project, the IBM Cognitive Horizons Networkence Foundation. For example, an off-the-shelf  agreement no. W1668553, and National ScienceGPU server containing 8 NVIDIA 1080 Ti GPUs  Foundation grant no. IIS-1514053. Any opinions,and supporting hardware can be purchased for  ﬁndings and conclusions or recommendations ex-approximately $20,000 USD. At that cost, the  pressed in this material are those of the authors andhardware required to develop the model in our  do not necessarily reﬂect those of the sponsor.case study (approximately 58 GPUs for 172 days)
           would cost $145,000 USD plus electricity, about    10 For example, theHyperopt Python library.           References                              Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
                                                     Gardner, Christopher Clark, Kenton Lee, and LukeRhonda Ascierto. 2018.Uptime Institute Global Data    Zettlemoyer. 2018. Deep contextualized word rep-Center Survey. Technical report, Uptime Institute.     resentations. InNAACL.
           Dzmitry Bahdanau, KyunghyunCho, and Yoshua Ben-
             gio. 2015. Neural Machine Translation by Jointly  Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
             Learning to Align and Translate. In3rd Inter-    Dario Amodei, and Ilya Sutskever. 2019.Language
             national Conference for Learning Representations    models are unsupervised multitask learners.
             (ICLR), San Diego, California, USA.            Jasper Snoek, Hugo Larochelle, and Ryan P Adams.
           James Bergstra and Yoshua Bengio. 2012. Random    2012. Practical bayesian optimization of machine
             search for hyper-parameter optimization.Journal of    learning algorithms. InAdvances in neural informa-
             Machine Learning Research, 13(Feb):281–305.       tion processing systems, pages 2951–2959.

           James S Bergstra, R´emi Bardenet, Yoshua Bengio, and  David R. So, Chen Liang, and Quoc V. Le. 2019.
             Bal´azs K´egl. 2011. Algorithms for hyper-parameter    The evolved transformer. InProceedings of the
             optimization. InAdvances in neural information    36th InternationalConference on Machine Learning
             processing systems, pages 2546–2554.             (ICML).

           Bruno Burger. 2019.Net Public Electricity Generation  Emma Strubell, Patrick Verga, Daniel Andor,
             in Germany in 2018. Technical report, Fraunhofer    David Weiss, and Andrew McCallum. 2018.
             Institute for Solar Energy Systems ISE.             Linguistically-Informed Self-Attention for Se-
                                                     mantic Role Labeling. InConference on Empir-Alfredo Canziani, Adam Paszke, and Eugenio Culur-    ical Methods in Natural Language Processingciello. 2016. An analysis of deep neural network    (EMNLP), Brussels, Belgium. models for practical applications .
                                                   Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobGary Cook, Jude Lee, Tamina Tsai, Ada Kongn, John    Uszkoreit, Llion Jones, Aidan N Gomez, LukaszDeans, Brian Johnson, Elizabeth Jardim, and Brian    Kaiser, and Illia Polosukhin. 2017. Attention is allJohnson. 2017. Clicking Clean: Who is winning    you need. In31st Conference on Neural Informationthe race to build a green internet?Technical report,    Processing Systems (NIPS).Greenpeace.
            Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
             Kristina Toutanova. 2019. BERT: Pre-training of
             Deep Bidirectional Transformers for Language Un-
             derstanding. InNAACL.
            Timothy Dozat and Christopher D. Manning. 2017.
             Deep biafﬁne attention for neural dependency pars-
             ing. InICLR.
            EPA. 2018. Emissions & Generation Resource Inte-
             grated Database (eGRID). Technical report, U.S.
             Environmental Protection Agency.
            Christopher Forster, Thor Johnsen, Swetha Man-
             dava, Sharath Turuvekere Sreenivas, Deyu Fu, Julie
             Bernauer, Allison Gray, Sharan Chetlur, and Raul
             Puri. 2019. BERT Meets GPUs. Technical report,
             NVIDIA AI.
            Da Li, Xinbo Chen, Michela Becchi, and Ziliang Zong.
             2016. Evaluating the energy efﬁciency of deep con-
             volutional neural networks on cpus and gpus.2016
             IEEE International Conferences on Big Data and
             Cloud Computing (BDCloud), Social Computing
             and Networking (SocialCom), Sustainable Comput-
             ing and Communications (SustainCom) (BDCloud-
             SocialCom-SustainCom), pages 477–484.
           Thang Luong, Hieu Pham, and Christopher D. Man-
             ning. 2015.Effective approaches to attention-based
             neural machine translation. InProceedings of the
             2015 Conference on Empirical Methods in Natural
             Language Processing, pages 1412–1421. Associa-
             tion for Computational Linguistics.