Unpacking Google's methodology for measuring AI inference energy, emissions, and water, and what "median prompt" actually means.
DraftElsworth et al., Aug 2025arXiv:2508.15734
Draft note: this explainer is still in progress and may change as I refine the framing, calculations, and sourcing.
How much energy does one AI prompt use?
Published estimates span an order of magnitude. If we can't agree on what a single prompt costs, how do we make procurement, architecture, or policy decisions?
Energy per prompt (Wh) — published estimates
Google (this paper) (2025): 0.24 Wh
Epoch AI (2024): 0.3 Wh
EcoLogits (2024): 1.8–7 Wh
De Vries (2023): 3 Wh
The paper & what it gets right
Elsworth et al. (Aug 2025) — Google infra team, Jeff Dean, David Patterson. The first detailed, first-party measurement of AI inference energy at production scale.
First-party
Measured inside Google's own infrastructure, not estimated from outside
Full-stack
Covers accelerator, CPU, idle reserve, and facility overhead
Production-scale
Fleet-wide telemetry across thousands of machines serving real traffic
Measurement pipeline
Accelerator
TPU/GPU
+
CPU & DRAM
Host system
+
Idle reserve
Availability
→
Eactive
Eactive
→
PUE
1.09
/
Emission factor
94 gCO2e/kWh
/
WUE
1.15 L/kWh
In August 2025, Google published the first detailed, first-party measurement of AI inference environmental costs in production. The headline: a median Gemini text prompt uses 0.24 Wh of energy, less than nine seconds of TV. But what goes into that number, and how is "median" defined?
01
Reproduce their numbers
Google's methodology measures four energy components for each prompt, then applies conversion factors to get emissions and water use. Adjust the sliders to see how each piece contributes. The defaults match Google's May 2025 numbers.
Energy components
Active AI Accelerator0.14 Wh
TPU/GPU power during inference
0.01 Wh0.5 Wh
CPU & DRAM0.06 Wh
Host system power for active machines
0.01 Wh0.2 Wh
Idle Machines0.02 Wh
Reserved capacity for availability
0 Wh0.1 Wh
Conversion factors
PUE1.09
Power Usage Effectiveness. 1.0 means zero cooling overhead. Google fleet: 1.09. Industry average: about 1.55.
11.6
Grid emission factor94 gCO2e/kWh
Google market-based: 94. US average: about 380. Coal-heavy grids: 500 and above.
0 gCO2e/kWh600 gCO2e/kWh
WUE1.15 L/kWh
Water Usage Effectiveness. Google: 1.15 L/kWh.
0 L/kWh3 L/kWh
Energy
0.24 Wh
about 9s of TV
Emissions
0.023 g CO2e
Water
0.25 mL
about 5 drops
Energy breakdown
The math:
Eactive = 0.14 + 0.06 + 0.02 = 0.22 Wh
Etotal = 0.22 x 1.09 = 0.24 Wh
CO2e = 0.24 Wh x 94 gCO2e/kWh / 1000 = 0.023 g
Water = 0.22 Wh x 1.15 L/kWh = 0.25 mL
02
The median trick
Google reports the energy of the median prompt, but this is not what many readers expect. Models are ranked by energy efficiency, then Google finds which model serves the 50th-percentile prompt by volume. If a cheap model handles most traffic, the headline median can stay cheap even when expensive models exist in the mix.
Models
Prompts/day5000
10020000
Energy/prompt0.08 Wh
0.01 Wh3 Wh
Prompts/day3000
10020000
Energy/prompt0.18 Wh
0.01 Wh3 Wh
Prompts/day1500
10020000
Energy/prompt0.45 Wh
0.01 Wh3 Wh
Prompts/day500
10020000
Energy/prompt1.20 Wh
0.01 Wh3 Wh
Median prompt
0.08 Wh
via Flash-Lite
Mean prompt
0.22 Wh
177% higher than median
Cumulative prompt distribution (ranked by energy)
Flash-Lite
Flash
Pro
p50
0% (cheapest)100% (most expensive)
Energy/prompt by model
Why it matters: The median is Flash-Lite at 0.08 Wh, but the volume-weighted mean is 0.22 Wh. The 177% gap means heavy users who trigger expensive models are not reflected in the headline figure. Try shifting prompt volume toward the expensive models to see the median jump.
Steelman: defending the median
Before we critique, what's the strongest case for Google's choice? Is the median misleading, or is it just a different question than the one we want answered?
The median is the user experience
Most people interact with Flash or Flash-Lite. The median reflects what a typical user actually triggers, not the tail of agentic or multimodal heavy use.
Countering inflated estimates
Estimates of 3 Wh (De Vries) or 1.8-7 Wh (EcoLogits) were driving public panic. Google may be anchoring to measured reality rather than external speculation.
Different question, not a wrong one
They answered "what does a typical prompt cost?" — but practitioners need "what does the marginal prompt cost?" or "what does the fleet cost?" These are different questions with different answers.
Mean would be dominated by outliers
A small tail of expensive multimodal and agentic queries would pull the mean far from what most users experience. Median is more robust to this skew.
So what do we do?
Given finite time and attention, what's the highest-leverage thing this group can actually influence?
Open floor: what's the highest-leverage action for this community?
Source: Elsworth et al., "Measuring the environmental impact of delivering AI at Google Scale" (arXiv:2508.15734, Aug 2025). Numbers are based on Table 1 and Section 3 of the paper. This explainer is a simplified interactive reproduction; the actual methodology involves fleet-wide telemetry across thousands of machines.