testing_generation/Corpus/Energy and Policy Considera...

261 lines
28 KiB
Plaintext
Raw Normal View History

2020-08-06 20:53:44 +00:00
Energy and Policy Considerations for Deep Learning in NLP
Emma Strubell Ananya Ganesh Andrew McCallum
College of Information and Computer Sciences
University of Massachusetts Amherst
{strubell, aganesh, mccallum}@cs.umass.edu
Abstract Consumption CO 2 e (lbs)
Air travel, 1 passenger, NY↔SF 1984 Recent progress in hardware and methodol-
arXiv:1906.02243v1 [cs.CL] 5 Jun 2019 Human life, avg, 1 year 11,023 ogy for training neural networks has ushered
in a new generation of large networks trained American life, avg, 1 year 36,156
on abundant data. These models have ob- Car, avg incl. fuel, 1 lifetime 126,000
tained notable gains in accuracy across many
NLP tasks. However, these accuracy improve- Training one model (GPU)
ments depend on the availability of exception- NLP pipeline (parsing, SRL) 39 ally large computational resources that neces- w/ tuning & experimentation 78,468 sitate similarly substantial energy consump- Transformer (big) 192 tion. As a result these models are costly to
train and develop, both financially, due to the w/ neural architecture search 626,155
cost of hardware and electricity or cloud com- Table 1: Estimated COpute time, and environmentally,due to the car- 2 emissions from training com-
mon NLP models, compared to familiar consumption. 1 bon footprint required to fuel modern tensor
processing hardware. In this paper we bring
this issue to the attention of NLP researchers NLP models could be trained and developed on by quantifying the approximate financial and a commodity laptop or server, many now require environmental costs of training a variety of re-
cently successful neural network models for multiple instances of specialized hardware such as
NLP. Based on these findings, we propose ac- GPUs or TPUs, therefore limiting access to these
tionable recommendations to reduce costs and highly accurate models on the basis of finances.
improve equity in NLP research and practice. Even when these expensive computational re-
1 Introduction sources are available, model training also incurs a
substantial cost to the environment due to the en-
Advances in techniques and hardware for train- ergy required to power this hardware for weeks or
ing deep neural networks have recently en- months at a time. Though some of this energy may
abled impressive accuracy improvements across come from renewable or carbon credit-offset re-
many fundamental NLP tasks ( Bahdanau et al., sources, the high energy demands of these models
2015; Luong et al., 2015; Dozat and Man- are still a concern since (1) energy is not currently
ning, 2017; Vaswani et al., 2017), with the derived from carbon-neural sources in many loca-
most computationally-hungry models obtaining tions, and (2) when renewable energy is available,
the highest scores (Peters et al.,2018;Devlin et al., it is still limited to the equipment we have to pro-
2019;Radford et al.,2019;So et al.,2019). As duce and store it, and energy spent training a neu-
a result, training a state-of-the-art model now re- ral network might better be allocated to heating a
quires substantial computational resources which familys home. It is estimated that we must cut
demand considerable energy, along with the as- carbon emissions by half over the next decade to
sociated financial and environmental costs. Re- deter escalating rates of natural disaster, and based
search and development of new models multiplies on the estimated CO 2 emissions listed in Table 1,
these costs by thousands of times by requiring re-
training to experiment with model architectures 1 Sources: (1) Air travel and per-capita consump-
tion: https://bit.ly/2Hw0xWc; (2) car lifetime: and hyperparameters. Whereas a decade ago most https://bit.ly/2Qbr0w1. model training and development likely make up Consumer Renew. Gas Coal Nuc.
a substantial portion of the greenhouse gas emis- China 22% 3% 65% 4%
sions attributed to many NLP researchers. Germany 40% 7% 38% 13%
To heighten the awareness of the NLP commu- United States 17% 35% 27% 19%
nity to this issue and promote mindful practice and Amazon-AWS 17% 24% 30% 26%
policy, we characterize the dollar cost and carbon Google 56% 14% 15% 10%
emissions that result from training the neural net- Microsoft 32% 23% 31% 10%
works at the core of many state-of-the-art NLP
models. We do this by estimating the kilowatts Table 2: Percent energy sourced from: Renewable (e.g.
of energy required to train a variety of popular hydro, solar, wind), natural gas, coal and nuclear for
off-the-shelf NLP models, which can be converted the top 3 cloud compute providers (Cook et al.,2017),
to approximate carbon emissions and electricity compared to the United States, 4 China 5 and Germany
costs. To estimate the even greater resources re- (Burger,2019).
quired to transfer an existing model to a new task
or develop new models, we perform a case study We estimate the total time expected for mod-
of the full computational resources required for the els to train to completion using training times and
development and tuning of a recent state-of-the-art hardware reported in the original papers. We then
NLP pipeline (Strubell et al.,2018). We conclude calculate the power consumption in kilowatt-hours
with recommendations to the community based on (kWh) as follows. Letpc be the average power
our findings, namely: (1) Time to retrain and sen- draw (in watts) from all CPU sockets during train-
sitivity to hyperparameters should be reported for ing, letpr be the average power draw from all
NLP machine learning models; (2) academic re- DRAM (main memory) sockets, letpg be the aver-
searchers need equitable access to computational age power draw of a GPU during training, and let
resources; and (3) researchers should prioritize de- gbe the number of GPUs used to train. We esti-
veloping efficient models and hardware. mate total power consumption as combined GPU,
CPU and DRAM consumption, then multiply this
2 Methods by Power Usage Effectiveness (PUE), which ac-
counts for the additional energy required to sup-To quantify the computational and environmen- port the compute infrastructure (mainly cooling).tal cost of training deep neural network mod- We use a PUE coefficient of 1.58, the 2018 globalels for NLP, we perform an analysis of the en- average for data centers (Ascierto,2018). Then theergy required to train a variety of popular off- total powerpthe-shelf NLP models, as well as a case study of t required at a given instance during
training is given by:the complete sum of resources required to develop
LISA (Strubell et al.,2018), a state-of-the-art NLP 1.58t(pp c +pr +gp g )
model from EMNLP 2018, including all tuning t = (1)1000
and experimentation. The U.S. Environmental Protection Agency (EPA)We measure energy use as follows. We train the provides average COmodels described in§2.1using the default settings 2 produced (in pounds per
kilowatt-hour) for power consumed in the U.S.provided, and sample GPU and CPU power con- (EPA,2018), which we use to convert power tosumption during training. Each model was trained estimated COfor a maximum of 1 day. We train all models on 2 emissions:
a single NVIDIA Titan X GPU, with the excep- CO 2 e = 0.954pt (2)
tion of ELMo which was trained on 3 NVIDIA This conversion takes into account the relative pro-GTX 1080 Ti GPUs. While training, we repeat- portions of different energy sources (primarily nat-edly query the NVIDIA System Management In- ural gas, coal, nuclear and renewable) consumedterface 2 to sample the GPU power consumption to produce energy in the United States. Table2and report the average over all samples. To sample lists the relative energy sources for China, Ger-CPU power consumption, we use Intels Running many and the United States compared to the topAverage Power Limit interface. 3
5 U.S. Dept. of Energy:https://bit.ly/2JTbGnI
2 nvidia-smi:https://bit.ly/30sGEbi 5 China Electricity Council; trans. China Energy Portal:
3 RAPL power meter:https://bit.ly/2LObQhV https://bit.ly/2QHE5O3 three cloud service providers. The U.S. break- ence. Devlin et al.(2019) report that the BERT
down of energy is comparable to that of the most base model (110M parameters) was trained on 16
popular cloud compute service, Amazon Web Ser- TPU chips for 4 days (96 hours). NVIDIA reports
vices, so we believe this conversion to provide a that they can train a BERT model in 3.3 days (79.2
reasonable estimate of CO 2 emissions per kilowatt hours) using 4 DGX-2H servers, totaling 64 Tesla
hour of compute energy used. V100 GPUs (Forster et al.,2019).
GPT-2. This model is the latest edition of
2.1 Models OpenAIs GPT general-purpose token encoder,
We analyze four models, the computational re- also based on Transformer-style self-attention and
quirements of which we describe below. All mod- trained with a language modeling objective (Rad-
els have code freely available online, which we ford et al.,2019). By training a very large model
used out-of-the-box. For more details on the mod- on massive data,Radford et al.(2019) show high
els themselves, please refer to the original papers. zero-shot performance on question answering and
language modeling benchmarks. The large modelTransformer. The Transformer model (Vaswani described inRadford et al.(2019) has 1542M pa-et al.,2017) is an encoder-decoder architecture rameters and is reported to require 1 week (168primarily recognized for efficient and accurate ma- hours) of training on 32 TPUv3 chips. 6 chine translation. The encoder and decoder each
consist of 6 stacked layers of multi-head self-
attention. Vaswani et al.(2017) report that the 3 Related work
Transformerbasemodel (65M parameters) was
trained on 8 NVIDIA P100 GPUs for 12 hours, There is some precedent for work characterizing
and the Transformerbigmodel (213M parame- the computational requirements of training and in-
ters) was trained for 3.5 days (84 hours; 300k ference in modern neural network architectures in
steps). This model is also the basis for recent the computer vision community.Li et al.(2016)
work on neural architecture search (NAS) for ma- present a detailed study of the energy use required
chine translation and language modeling (So et al., for training and inference in popular convolutional
2019), and the NLP pipeline that we study in more models for image classification in computer vi-
detail in§4.2(Strubell et al.,2018). So et al. sion, including fine-grained analysis comparing
(2019) report that their full architecture search ran different neural network layer types. Canziani
for a total of 979M training steps, and that their et al.(2016) assess image classification model ac-
base model requires 10 hours to train for 300k curacy as a function of model size and gigaflops
steps on one TPUv2 core. This equates to 32,623 required during inference. They also measure av-
hours of TPU or 274,120 hours on 8 P100 GPUs. erage power draw required during inference on
GPUs as a function of batch size. Neither work an-ELMo. The ELMo model (Peters et al.,2018) alyzes the recurrent and self-attention models thatis based on stacked LSTMs and provides rich have become commonplace in NLP, nor do theyword representations in context by pre-training on extrapolate power to estimates of carbon and dol-a large amount of data using a language model- lar cost of training.ing objective. Replacing context-independent pre-
trained word embeddings with ELMo has been Analysis of hyperparameter tuning has been
shown to increase performance on downstream performed in the context of improved algorithms
tasks such as named entity recognition, semantic for hyperparameter search (Bergstra et al.,2011;
role labeling, and coreference.Peters et al.(2018) Bergstra and Bengio,2012;Snoek et al.,2012). To
report that ELMo was trained on 3 NVIDIA GTX our knowledge there exists to date no analysis of
1080 GPUs for 2 weeks (336 hours). the computation required for R&D and hyperpa-
rameter tuning of neural network models in NLP.BERT.The BERT model (Devlin et al.,2019) pro-
vides a Transformer-based architecture for build-
ing contextual representations similar to ELMo, 6 Via the authorson Reddit.
7 GPU lower bound computed using pre-emptible but trained with a different language modeling ob- P100/V100 U.S. resources priced at $0.43$0.74/hr, upper
jective. BERT substantially improves accuracy on bound uses on-demand U.S. resources priced at $1.46
tasks requiring sentence-level representations such $2.48/hr. We similarly use pre-emptible ($1.46/hr$2.40/hr)
and on-demand ($4.50/hr$8/hr) pricing as lower and upper as question answering and natural language infer- bounds for TPU v2/3; cheaper bulk contracts are available. Model Hardware Power (W) Hours kWh·PUE CO 2 e Cloud compute cost
Transformer base P100x8 1415.78 12 27 26 $41$140
Transformer big P100x8 1515.43 84 201 192 $289$981
ELMo P100x3 517.66 336 275 262 $433$1472
BERT base V100x64 12,041.51 79 1507 1438 $3751$12,571
BERT base TPUv2x16 — 96 — — $2074$6912
NAS P100x8 1515.43 274,120 656,347 626,155 $942,973$3,201,722
NAS TPUv2x1 — 32,623 — — $44,055$146,848
GPT-2 TPUv3x32 — 168 — — $12,902$43,008
Table 3: Estimated cost of training a model in terms of CO 2 emissions (lbs) and cloud compute cost (USD). 7 Power
and carbon footprint are omitted for TPUs due to lack of public information on power draw for this hardware.
4 Experimental results Estimated cost (USD)
Models Hours Cloud compute Electricity4.1 Cost of training 1 120 $52$175 $5Table3lists CO 2 emissions and estimated cost of 24 2880 $1238$4205 $118training the models described in§2.1. Of note is 4789 239,942 $103k$350k $9870that TPUs are more cost-efficient than GPUs on
workloads that make sense for that hardware (e.g. Table 4: Estimated cost in terms of cloud compute and
BERT). We also see that models emit substan- electricity for training: (1) a single model (2) a single
tial carbon emissions; training BERT on GPU is tune and (3) all models trained during R&D.
roughly equivalent to a trans-American flight.So
et al.(2019) report that NAS achieves a new state- about 60 GPUs running constantly throughout theof-the-art BLEU score of 29.7 for English to Ger- 6 month duration of the project. Table4lists upperman machine translation, an increase of just 0.1 and lower bounds of the estimated cost in termsBLEU at the cost of at least $150k in on-demand of Google Cloud compute and raw electricity re-compute time and non-trivial carbon emissions. quired to develop and deploy this model. 9 We see
that while training a single model is relatively in-4.2 Cost of development: Case study expensive, the cost of tuning a model for a newTo quantify the computational requirements of dataset, which we estimate here to require 24 jobs,R&D for a new model we study the logs of or performing the full R&D required to developall training required to develop Linguistically- this model, quickly becomes extremely expensive.Informed Self-Attention (Strubell et al.,2018), a
multi-task model that performs part-of-speech tag- 5 Conclusions
ging, labeled dependency parsing, predicate detec-
tion and semantic role labeling. This model makes Authors should report training time and
for an interesting case study as a representative sensitivity to hyperparameters.
NLP pipeline and as a Best Long Paper at EMNLP. Our experiments suggest that it would be benefi-
Model training associated with the project cial to directly compare different models to per-
spanned a period of 172 days (approx. 6 months). form a cost-benefit (accuracy) analysis. To ad-
During that time 123 small hyperparameter grid dress this, when proposing a model that is meant
searches were performed, resulting in 4789 jobs to be re-trained for downstream use, such as re-
in total. Jobs varied in length ranging from a min- training on a new domain or fine-tuning on a new
imum of 3 minutes, indicating a crash, to a maxi- task, authors should report training time and com-
mum of 9 days, with an average job length of 52 putational resources required, as well as model
hours. All training was done on a combination of sensitivity to hyperparameters. This will enable
NVIDIA Titan X (72%) and M40 (28%) GPUs. 8 direct comparison across models, allowing subse-
The sum GPU time required for the project quent consumers of these models to accurately as-
totaled 9998 days (27 years). This averages to sess whether the required computational resources
8 We approximate cloud compute cost using P100 pricing. 9 Based on average U.S cost of electricity of $0.12/kWh. are compatible with their setting. More explicit half the estimated cost to use on-demand cloud
characterization of tuning time could also reveal GPUs. Unlike money spent on cloud compute,
inconsistencies in time spent tuning baseline mod- however, that invested in centralized resources
els compared to proposed contributions. Realiz- would continue to pay off as resources are shared
ing this will require: (1) a standard, hardware- across many projects. A government-funded aca-
independent measurement of training time, such demic compute cloud would provide equitable ac-
as gigaflops required to convergence, and (2) a cess to all researchers.
standard measurement of model sensitivity to data
and hyperparameters, such as variance with re- Researchers should prioritize computationally
spect to hyperparameters searched. efficient hardware and algorithms.
We recommend a concerted effort by industry and
Academic researchers need equitable access to academia to promote research of more computa-
computation resources. tionally efficient algorithms, as well as hardware
that requires less energy. An effort can also beRecent advances in available compute come at a made in terms of software. There is already ahigh price not attainable to all who desire access. precedent for NLP software packages prioritizingMost of the models studied in this paper were de- efficient models. An additional avenue throughveloped outside academia; recent improvements in which NLP and machine learning software de-state-of-the-art accuracy are possible thanks to in- velopers could aid in reducing the energy asso-dustry access to large-scale compute. ciated with model tuning is by providing easy-Limiting this style of research to industry labs to-use APIs implementing more efficient alterna-hurts the NLP research community in many ways. tives to brute-force grid search for hyperparameterFirst, it stifles creativity. Researchers with good tuning, e.g. random or Bayesian hyperparameterideas but without access to large-scale compute search techniques (Bergstra et al.,2011;Bergstrawill simply not be able to execute their ideas, and Bengio,2012;Snoek et al.,2012). Whileinstead constrained to focus on different prob- software packages implementing these techniqueslems. Second, it prohibits certain types of re- do exist, 10 they are rarely employed in practicesearch on the basis of access to financial resources. for tuning NLP models. This is likely becauseThis even more deeply promotes the already prob- their interoperability with popular deep learninglematic “rich get richer” cycle of research fund- frameworks such as PyTorch and TensorFlow ising, where groups that are already successful and not optimized, i.e. there are not simple exam-thus well-funded tend to receive more funding ples of how to tune TensorFlow Estimators usingdue to their existing accomplishments. Third, the Bayesian search. Integrating these tools into theprohibitive start-up cost of building in-house re- workflows with which NLP researchers and practi-sources forces resource-poor groups to rely on tioners are already familiar could have notable im-cloud compute services such as AWS, Google pact on the cost of developing and tuning in NLP.Cloud and Microsoft Azure.
While these services provide valuable, flexi- Acknowledgements
ble, and often relatively environmentally friendly We are grateful to Sherief Farouk and the anony- compute resources, it is more cost effective for mous reviewers for helpful feedback on earlieracademic researchers, who often work for non- drafts. This work was supported in part by theprofit educational institutions and whose research Centers for Data Science and Intelligent Infor-is funded by government entities, to pool resources mation Retrieval, the Chan Zuckerberg Initiativeto build shared compute centers at the level of under the Scientific Knowledge Base Construc-funding agencies, such as the U.S. National Sci- tion project, the IBM Cognitive Horizons Networkence Foundation. For example, an off-the-shelf agreement no. W1668553, and National ScienceGPU server containing 8 NVIDIA 1080 Ti GPUs Foundation grant no. IIS-1514053. Any opinions,and supporting hardware can be purchased for findings and conclusions or recommendations ex-approximately $20,000 USD. At that cost, the pressed in this material are those of the authors andhardware required to develop the model in our do not necessarily reflect those of the sponsor.case study (approximately 58 GPUs for 172 days)
would cost $145,000 USD plus electricity, about 10 For example, theHyperopt Python library. References Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and LukeRhonda Ascierto. 2018.Uptime Institute Global Data Zettlemoyer. 2018. Deep contextualized word rep-Center Survey. Technical report, Uptime Institute. resentations. InNAACL.
Dzmitry Bahdanau, KyunghyunCho, and Yoshua Ben-
gio. 2015. Neural Machine Translation by Jointly Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Learning to Align and Translate. In3rd Inter- Dario Amodei, and Ilya Sutskever. 2019.Language
national Conference for Learning Representations models are unsupervised multitask learners.
(ICLR), San Diego, California, USA. Jasper Snoek, Hugo Larochelle, and Ryan P Adams.
James Bergstra and Yoshua Bengio. 2012. Random 2012. Practical bayesian optimization of machine
search for hyper-parameter optimization.Journal of learning algorithms. InAdvances in neural informa-
Machine Learning Research, 13(Feb):281305. tion processing systems, pages 29512959.
James S Bergstra, R´emi Bardenet, Yoshua Bengio, and David R. So, Chen Liang, and Quoc V. Le. 2019.
Bal´azs K´egl. 2011. Algorithms for hyper-parameter The evolved transformer. InProceedings of the
optimization. InAdvances in neural information 36th InternationalConference on Machine Learning
processing systems, pages 25462554. (ICML).
Bruno Burger. 2019.Net Public Electricity Generation Emma Strubell, Patrick Verga, Daniel Andor,
in Germany in 2018. Technical report, Fraunhofer David Weiss, and Andrew McCallum. 2018.
Institute for Solar Energy Systems ISE. Linguistically-Informed Self-Attention for Se-
mantic Role Labeling. InConference on Empir-Alfredo Canziani, Adam Paszke, and Eugenio Culur- ical Methods in Natural Language Processingciello. 2016. An analysis of deep neural network (EMNLP), Brussels, Belgium. models for practical applications .
Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobGary Cook, Jude Lee, Tamina Tsai, Ada Kongn, John Uszkoreit, Llion Jones, Aidan N Gomez, LukaszDeans, Brian Johnson, Elizabeth Jardim, and Brian Kaiser, and Illia Polosukhin. 2017. Attention is allJohnson. 2017. Clicking Clean: Who is winning you need. In31st Conference on Neural Informationthe race to build a green internet?Technical report, Processing Systems (NIPS).Greenpeace.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
Deep Bidirectional Transformers for Language Un-
derstanding. InNAACL.
Timothy Dozat and Christopher D. Manning. 2017.
Deep biaffine attention for neural dependency pars-
ing. InICLR.
EPA. 2018. Emissions & Generation Resource Inte-
grated Database (eGRID). Technical report, U.S.
Environmental Protection Agency.
Christopher Forster, Thor Johnsen, Swetha Man-
dava, Sharath Turuvekere Sreenivas, Deyu Fu, Julie
Bernauer, Allison Gray, Sharan Chetlur, and Raul
Puri. 2019. BERT Meets GPUs. Technical report,
NVIDIA AI.
Da Li, Xinbo Chen, Michela Becchi, and Ziliang Zong.
2016. Evaluating the energy efficiency of deep con-
volutional neural networks on cpus and gpus.2016
IEEE International Conferences on Big Data and
Cloud Computing (BDCloud), Social Computing
and Networking (SocialCom), Sustainable Comput-
ing and Communications (SustainCom) (BDCloud-
SocialCom-SustainCom), pages 477484.
Thang Luong, Hieu Pham, and Christopher D. Man-
ning. 2015.Effective approaches to attention-based
neural machine translation. InProceedings of the
2015 Conference on Empirical Methods in Natural
Language Processing, pages 14121421. Associa-
tion for Computational Linguistics.