It's Easy Being Green in Machine Learning

Or, how I learned to stop killing my plants

Apr 18, 2023

I’m a proud plant parent, though admittedly I’m not always the best at keeping them alive. One of my favorites is Chinese money plant.

https://www.gardenersworld.com/how-to/grow-plants/how-to-grow-pilea-peperomioides/

I have killed 3 of these in the past 3 years. It goes like this: my plant is happy and green, I water it regularly and get it lots of sunlight, it stays happy and green, and starts to grow lots of new leaves. Then out of nowhere, little black spots start to form and those leaves begin to wither and die.

“Clearly I’m not watering it enough, I’ll just add a bit more water!”

That killed my first one.

“Maybe I’m watering this one too much! I’ll hold back a bit this time.”

That killed my second one, this time faster than the first one.

When this started happening to my third money plant and I began to despair, I had a friend take a look at it.

“Looks like you have little bugs in the soil.”

I took a close look and sure enough, almost imperceptibly, there were tiny green bugs worming their way through the soil wreaking havoc on my poor plant. It was unfortunately too late at that point, but I knew for the future to defend my plants against these minuscule pests.

My problem wasn’t that I was a horrible and negligent plant owner, it was simply that I lacked data about the problem and a set of protocols to solve it.

When we lack an understanding of a problem and appropriate actions we can take to deal with it, we leave a lot of potential progress towards solving that problem on the table.

A foundation for green ML

During my PhD, I was trying to understand and solve problems in scientific fact checking with machine learning. Despite concern about the effects of manmade climate change, the environmental impact of my work wasn’t really something that I thought about. I was perfectly content to run experiment after experiment in pursuit of that elusive exciting result.

Why would I have thought about environmental impact? Neither the incentives nor the awareness were there to motivate me towards greener practices. This was true until I started my postdoc which is literally on sustainable machine learning.

The reality is, there are a lot of really simple things we can do to improve our awareness of the extent of emissions in ML, as well as reduce them outright. Like things that take only 2 lines of code or a tiny amount of planning. My colleague Raghav Selvan puts it well in his recent MICCAI paper where he and his co-authors define THETA [1]:

Track-log-report for control and transparency
Hyperparameter optimization frameworks instead of grid search
Energy-efficient hardware and settings are useful
Training location and time of day/year are important
Automatic mixed precision training always

These five concepts serve as a solid and easy to adopt foundation for a sustainable mindset and workflow in machine learning. Anyone can apply them to both understand and reduce their emissions.

Track-log-report

The less data we have about the actual carbon emissions from machine learning, the harder it is to know how to appropriately address them. You get papers with diametrically opposed conclusions: some analyses say that the carbon emissions from machine learning are increasing exponentially [8], others say that emissions will plateau and then shrink [6].

Hans Rosling put it well in his book Factfulness in relation to tackling an outbreak of Ebola in West Africa in 2014 [2]:

If you can’t track progress, you don’t know whether your actions are working.

You need to collect data to know what to do. So to know how to mitigate the environmental impact of machine learning, we need to systematically track our emissions.

This is insanely easy today. It only takes a few lines of code with any of the major tools currently available.

CodeCarbon: 2 lines of code (https://github.com/mlco2/codecarbon)

from codecarbon import track_emissions
@track_emissions()
def your_function_to_track() 
# your code

Carbontracker: 5 lines of code (https://github.com/lfwa/carbontracker)

from carbontracker.tracker import CarbonTracker
tracker = CarbonTracker(epochs=max_epochs)
# Training loop.
for epoch in range(max_epochs):
    tracker.epoch_start()
    # Your model training.
    tracker.epoch_end()
# Optional: Add a stop in case of early termination before all 
# monitor_epochs has been monitored to ensure that actual consumption is 
# reported.
tracker.stop()

Experiment-Impact-Tracker: 3 lines of code (https://github.com/Breakend/experiment-impact-tracker)

from experiment_impact_tracker.compute_tracker import ImpactTracker
tracker = ImpactTracker(<your log directory here>)
tracker.launch_impact_monitor()

It sucks that there isn’t really any honor or glory in writing a paper where you “only” emitted 20kg of CO2 vs 200kg (for now [3]). But at least with tracking you can see when your project starts to emit more carbon than the yearly emissions of an average home in some wealthy region and take appropriate action to curb this. And if we start to do this field-wide, we can begin to have a more informed discussion about where exactly to tackle environmental sustainability in ML.

Hyperparameter optimization

I love experiment trackers; my favorite is weights and biases. They have a really nice framework for running hyperparameter sweeps, with options for doing efficient sweeps. It’s incredibly easy to set up: you just specify the type of search you want to perform in your configuration. So using a more efficient search strategy e.g. random or Bayesian, is as simple as changing a single string. In addition to reducing energy consumption, it saves lots of time.

Energy-efficient hardware

Different people have different constraints on the hardware available to them based on their budget and workflow. As such, it can be somewhat difficult to ensure you are using the most efficient hardware. Cloud data centers are a good option if you want more control over which hardware you have to select from. TPUs in general will be more energy efficient than GPUs. You can also select data centers which are more efficient overall than the average data center (i.e. in terms of cooling and operating energy overhead) [4]. When selecting between GPUs, actual measurements of energy efficiency are going to be the most reliable, though measurements of the thermal design power (TDP), a rough upper bound on the power draw of the device, combined with the expected performance of the device, can give a very loose estimate for comparison. If your setup supports Intel RAPL (for CPUs) and/or Nvidia NVML (for GPUs), you can measure power consumption in real time to compare devices. Those libraries are also used by all of the carbon tracking tools listed above for measuring real time energy consumption.

Training location and time of day/year

How urgent is it that you run your experiment right this very second?

Alright, there’s always a deadline around the corner. I’m also subject to the slot-machine-esque pull of running a million experiments in a row, all with tiny tweaks, in the hopes that one will have the particular configuration that really sets the model off. And I want those results right here, right now.

Carbon emissions are super dependent on where and when you run your code. This is because emissions are a function of energy consumption and carbon intensity. Carbon intensity is basically a measure of how clean the energy mixture is in a given location at a given time. Higher carbon intensity means that every watt-hour of energy produces more carbon emissions. Carbon intensity can fluctuate much like the stock market:

7-day rolling average (thick line) and hourly (thin line) carbon intensity in Denmark over all of 2022. Data sourced from https://www.energidataservice.dk/

So the carbon emissions from your experiments can vary dramatically depending on when (and where) you run them. Maybe you happen to be running your experiments at midnight in the middle of summer when the weather has been alright and it isn’t too hot or too cold and everyone is asleep, so demand is low and your local machine is running on power that mostly comes from hydroelectricity because you live in Norway. But maybe it’s actually 3pm and you live in southern California, and everyone is running their air conditioning and the grid is currently pulling in more power from gas or some other high emission source to compensate for increased demand.

Fortunately, you don’t have to factor in all of these possibilities to make some semi-educated guess about when is a “clean” time to run your experiments. You can take a cursory look at a website like ElectricityMaps to see what the local carbon intensity is in the region where your code will run. If it’s super high compared to the average, maybe just wait an hour to launch your job. Carbon intensity has a linear relationship with emissions, so assuming you start your job when the carbon intensity is half of what it was when you originally wanted to run your job, and it stays this low, you’ve just cut your emissions in half. Again, it fluctuates a lot, so if you want to be fancy you can use a scheduling algorithm to start and stop your job and potentially reduce emissions by 20-80% [5]. You might also run your code in a datacenter which specifically pulls in cleaner energy [4].

Automatic mixed precision training always

Using mixed precision lets you train your model where parts of the model are represented with 16-bit floating point numbers instead of 32-bit. This is a super minimal intervention which reduces the memory footprint of the model, speeds up training time, reduces computation, and thus reduces energy consumption, all without sacrificing performance. In pytorch, you can run mixed-precision training with just a few modifications to your training loop:

use_amp = True

net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)

scaler = torch.cuda.amp.GradScaler(enabled=use_amp)

for epoch in range(epochs):
    for input, target in zip(data, targets):
        with torch.autocast(device_type='cuda', 
                            dtype=torch.float16,      
                            enabled=use_amp):
            output = net(input)
            loss = loss_fn(output, target)
        scaler.scale(loss).backward()
        scaler.step(opt)
        scaler.update()
        opt.zero_grad() # set_to_none=True here can modestly improve performance

Tensorflow also supports mixed precision training with relative ease.

A note on efficient models

I’d just add one more easy lever of efficiency to THETA that one can pull, echoing a couple of recent papers [4,6]: use efficient versions of architectures where possible. You can use a distilled or quantized version of your favorite large transformer and fine-tune to achieve almost the same results as the full model on many tasks. You can also use efficient versions of popular architecture types such as Evolved Transformer and Primer. But make sure not to conflate number of parameters with efficiency: sometimes smaller models actually consume more energy [3].

Take care of your plants

I’m a better plant parent now that I know to keep the bugs away. I just have to check and spray for them. The sum of knowledge and small actions has a strong positive impact, extending the life of my plants.

A sustainable mindset in machine learning is about having the right information to take appropriate action to minimize carbon emissions. We’ve been getting a sense of the carbon footprint of machine learning for a few years now [7]. It’s still hard to judge the environmental impact of machine learning as a whole (though people are trying [8]). Many conferences are doing a good job by asking for different metrics of compute and energy consumption in submissions; it wouldn’t be too much of an ask to also include carbon emissions, given the ease with which one can estimate them (and I am certainly not the first to call for this [3,5,7,8]).

Although I’m skeptical that we’ll see widespread adoption of efficient practices until the incentives are there (e.g. carbon leaderboards and energy badges to encourage competition for efficiency [3]), the nice thing is that it’s actually super easy to start to track and reduce emissions. A handful of lines of code and a few minutes of planning can have a large effect on the environmental impact of a project (see Selvan et al. 2022 [1] for an analysis in the medical imaging domain or these papers which talk about efficient practices at Google and Facebook [4,6,9]).

By adopting these practices, we can understand and minimize the emissions of machine learning. It’s a worthy goal: mitigating the extent of climate change means future generations, like me, can do battle against tiny little green bugs with an appetite for money plants. Hopefully their plants will fare better than mine.

References

[1] Selvan, Raghavendra, Nikhil Bhagwat, Lasse F. Wolff Anthony, Benjamin Kanding, and Erik B. Dam. "Carbon footprint of selecting and training deep learning models for medical image analysis." In Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part V, pp. 506-516. Cham: Springer Nature Switzerland, 2022.

[2] Rosling, Hans. Factfulness. Sceptre, 2018.

[3] Henderson, Peter, Jieru Hu, Joshua Romoff, Emma Brunskill, Dan Jurafsky, and Joelle Pineau. "Towards the systematic reporting of the energy and carbon footprints of machine learning." The Journal of Machine Learning Research 21, no. 1 (2020): 10039-10081.

[4] Patterson, David, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. "Carbon emissions and large neural network training." arXiv preprint arXiv:2104.10350 (2021).

[5] Dodge, Jesse, Taylor Prewitt, Remi Tachet des Combes, Erika Odmark, Roy Schwartz, Emma Strubell, Alexandra Sasha Luccioni, Noah A. Smith, Nicole DeCario, and Will Buchanan. "Measuring the carbon intensity of ai in cloud instances." In 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 1877-1894. 2022.

[6] Patterson, David, Joseph Gonzalez, Urs Hölzle, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David R. So, Maud Texier, and Jeff Dean. "The carbon footprint of machine learning training will plateau, then shrink." Computer 55, no. 7 (2022): 18-28.

[7] Strubell, Emma, Ananya Ganesh, and Andrew McCallum. "Energy and Policy Considerations for Deep Learning in NLP." In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3645-3650. 2019.

[8] Luccioni, Alexandra Sasha, and Alex Hernandez-Garcia. "Counting Carbon: A Survey of Factors Influencing the Emissions of Machine Learning." arXiv preprint arXiv:2302.08476 (2023).

[9] Wu, Carole-Jean, Ramya Raghavendra, Udit Gupta, Bilge Acun, Newsha Ardalani, Kiwan Maeng, Gloria Chang et al. "Sustainable ai: Environmental implications, challenges and opportunities." Proceedings of Machine Learning and Systems 4 (2022): 795-813.

theContrarian

Apr 23, 2023

I've been tracking some of the progress in hardware design that can address much of the energy loss problems from the current GPU and TPU based architectures mainly from NVIDIA A-1000. The most advanced seems to be the WSE (Wafer Scale Engine) architecture from Cerebras. Rather than printing 49-50 usable NVIDIA GPUs on a wafer and deploying them into separate racks, Cerebras prints a single wafer with 850,000 cores all connected through extremely short conductors. The short distance relative to rack based systems reduces the inductance losses of conventional circuits saving huge amounts of energy while increasing the potential clock speeds and watts/MFLOP performance.

Of course, this comes with a price tag of "several millions" for their lated CS-2 system, but with efficient sharing and scheduling of a central system this cost can be diffused over many users.

A very interesting publication on this comparison is available at this site:

https://khairy2011.medium.com/tpu-vs-gpu-vs-cerebras-vs-graphcore-a-fair-comparison-between-ml-hardware-3f5a19d89e38

Expand full comment

Daniel Hershcovich

Thanks! It is easy but as you say, you have to care enough to spend the 10 minutes learning how. How do we get people to care enough? Even I find it unreasonably hard to get my students to follow these simple steps, since it would mean having less time for things they *are* rewarded for. What do you do, for example, when you work on or supervise a project that's not itself about efficency/carbon etc.? How do you motivate yourself and others to follow the best practices?

2 replies by Dustin Wright and others

2 more comments...

Calibrating Uncertainty

Discussion about this post