Software Consumes Power
When I was still studying electronics in college, I remember a project in microcontrollers class where the professor let us choose what we wanted to do. At that time, I was almost obsessed with neural networks and proposed to the professor that I would implement a network on a PIC16F877A (my favorite). It was a very simple implementation, composed of no more than 64 nodes, that I trained to recognize the patterns that I entered manually using an array of jumpers.
Sometime after starting the training, I noticed that the microcontroller began to heat up to the point of losing the serial connection with which I was monitoring it. My partner and I decided to try again with a small fan which, to our surprise, allowed the training to be completed. Against all odds, the experiment worked, the grade was good, and I passed Microcontrollers 2.
This was nowhere near one of the most impressive projects I did during my student days, but I always remember it as an important moment in my training as an engineer. It was the first time I could clearly see that software consumes power; a for loop can drain a battery and divert important energy resources. Nowadays it seems obvious, but I think we forget more often than we should.
A lot of time has passed, and many things have changed since then. Now we have LLMs, a type of artificial intelligence designed to interpret and generate text that looks like it was written by humans. These models can answer questions, hold conversations, tell stories, and more.
The basic structure of LLMs is the same as the code that ran in my little Micros 2 project: they use artificial neural networks, which are inspired by the way natural neurons in our brains process information. Artificial neural networks are made up of layers of neurons that do the same thing, and the interconnections between neurons allow them to encode and recognize patterns.
In my experiment, the patterns the network learned were configured with an array of 8 switches. To create an LLM, a huge amount of text, books, articles, websites — is used as input to the neural network. The model learns patterns that allow it to predict the next word in a sentence based on the previous words. (Sorry to spoil the magic, but that’s what ChatGPT really does: predict the next word.) Through this training process, the model develops an apparent ability to understand language, grammar, and knowledge about the presented contexts.
According to information provided by OpenAI, GPT-3 is made up of 96 layers of 12,288 neurons each, totalling around 1.18 million neurons, a model developed in 2020, certainly far from the 64 nodes I presented to my professor. The complexity of GPT-3 is better reflected in its 175 billion parameters (the name given to the interconnections between its neurons) than in the neuron count itself.
Having this context makes it easier to understand how the computational demand of these models drives their intense energy consumption. In the training phase, vast volumes of text are processed using computationally complex algorithms, such as backpropagation, that involve intensive mathematical calculations over billions of parameters. The large size and intricate complexity of LLMs, combined with iterative training phases and real-time processing during inference, requires particularly powerful hardware in continuous operation.
How much energy are we talking about? Reports of leaked information from OpenAI about GPT-4 training mention 25,000 Nvidia A100 GPUs running for more than 90 days. The estimated consumption of that configuration was around 51,000 to 62,000 MWh, roughly equivalent to 5 or 6 years of energy consumption for 1,000 average American homes, or 5,700 average Colombian homes. Alex de Vries, a PhD candidate at the University of Amsterdam, estimated that a single average interaction between a user and ChatGPT uses the same amount of energy as an LED light bulb on for an hour.
The current trend in LLM development seems aimed at smaller models, some that can now run on phones. Techniques such as pruning and distillation allow the creation of smaller models derived from larger ones while retaining much of their performance. Also helping is development in TPUs, improvements in GPUs, and the increasing adoption of ARM-based architectures that have shown better energy performance compared to CISC–x86.
The future of this technology is promising, and its adoption will continue to grow, probably more rapidly than at present. As software developers we must assume the responsibility of being informed and aware of the energy consumption and efficiency of the solutions we create.
My PIC16F877A needed a fan to finish training 64 nodes. We gave GPT-4 the equivalent of a small power plant for three months. The scale is different. The principle is the same.