AlphaEvolve is a Gemini-based evolutionary coding agent released by Google DeepMind. It combines the creative proposals of a large language model with verification by an automated evaluator inside an evolutionary framework to discover and optimize general-purpose algorithms. Rather than humans telling it the answer, it is given only the problem and the evaluation criteria, then generates candidate solutions, scores them itself, and evolves toward better ones.

How does AlphaEvolve work?

Gemini proposes code variations (candidate solutions), an automated evaluator actually runs them and scores them, and the higher-scoring candidates survive to become the starting point for the next generation. This propose → verify → select → repeat cycle runs thousands of times, reaching solutions that would be hard for a human to conceive. The core premise is that the problem must be automatically verifiable.

What did it achieve in matrix multiplication?

AlphaEvolve found an algorithm that computes the product of two 4×4 complex matrices using 48 scalar multiplications, surpassing for the first time Strassen (1969), which had been the best known for this setting. It was a record unbroken for 56 years, and because matrix multiplication underlies nearly all computation from AI training to graphics, the impact of the improvement is large.

What were its results in math and infrastructure?

Across more than 50 open problems in analysis, geometry, combinatorics, and number theory, about 75% matched the best known results and about 20% improved on the best known solutions. It also improved scheduling in Google's data centers to recover computing resources, and made a key kernel in Gemini's training 23% faster, cutting overall training time by 1%.

Is AlphaEvolve a case of recursive self-improvement (RSI)?

Because AlphaEvolve even optimized the training of the Gemini that underpins it, it comes close to a real-world case of recursive self-improvement (RSI), in which an AI improves the infrastructure of AI. However, its power comes from problems that can be automatically scored, so it does not transfer directly to domains where it is hard to verify a correct answer cleanly. Designing a verifiable evaluator is the true key.

AlphaEvolve Deep Dive: How an AI Broke a 56-Year-Old Math Record

Google DeepMind's AlphaEvolve is a concrete answer to the question of whether an AI can discover new algorithms on its own. Through an evolutionary loop in which Gemini proposes code and an automated evaluator verifies it, it overturned a matrix-multiplication record that had stood unbroken for 56 years since 1969. It also took on more than 50 open math problems and improved some of them, and it even optimized Google's own data centers and Gemini's training. This article lays out what AlphaEvolve is, how it works, what it achieved, why it differs fundamentally from existing AI coding tools, and its connection to recursive self-improvement (RSI) along with its limits.

What Is AlphaEvolve?

AlphaEvolve is a "Gemini-based evolutionary coding agent" released by Google DeepMind. To discover and optimize general-purpose algorithms, it combines the creative proposals of a large language model with verification by an automated evaluator inside an evolutionary framework.

The key point is that "humans don't tell it the answer." Given only the problem and the evaluation criteria, AlphaEvolve generates candidate solutions, scores them itself, and evolves toward better ones.

How It Works

The way it works is evolution itself. Gemini proposes code variations (candidate solutions), an automated evaluator actually runs that code and scores it, and the higher-scoring candidates survive to become the starting point for the next generation.

As this cycle of "propose → verify → select → repeat" runs thousands of times, it reaches solutions that would be hard for a human to think of. The core premise is that the problem must be "automatically verifiable," which is why it's especially strong in algorithmic and mathematical domains where answers can be scored.

How It Differs From Existing AI Coding Tools

Grasping this point is what reveals AlphaEvolve's true character. Copilot-style coding assistants speed up the rate at which a human writes code. In essence, they help you type out faster an answer you already know or could find by searching. AlphaEvolve's goal, by contrast, is to find "a better solution nobody knows yet." The fact that it surpassed Strassen's method after 56 years compresses that difference into a single point. This is not the result of tidying up or rounding out existing code; it is an actual update to the known best.

A second difference lies in "who does the judging." A general LLM cannot be confident that its own answer is correct, and struggles to filter out plausible-looking wrong answers (hallucinations). AlphaEvolve routes around this weakness structurally. Gemini makes the proposals, but the final verdict of good or bad is handed to an automated evaluator that actually executes the code and produces a score. In other words, it is a division of labor: "creativity comes from the model, the verdict on truth comes from execution results." Thanks to this split, no matter how bold a candidate the model throws out, wrong ones are naturally weeded out by low scores. The key is that the cost of boldness has effectively dropped to near zero.

Breaking a 56-Year Matrix-Multiplication Record

Its most famous achievement is in matrix multiplication. AlphaEvolve found an algorithm that computes the product of two 4×4 complex matrices using 48 scalar multiplications, surpassing for the first time the method by Strassen (1969), which had been the best known for this setting.

This is a record that had stood unbroken for fully 56 years. Matrix multiplication underlies nearly all computation, from AI training to graphics, so a single improvement can ripple out into wide-ranging efficiency gains.

How to Read the Results

Before taking the numbers at face value, it helps to separate the context. The reason "48" is striking is not the absolute amount of saving one multiplication, but that automated search pushed a wall that top mathematicians had touched for half a century without moving. In other words, the weight of this result lies not in the size of the efficiency gain but in the proof that "the methodology works."

At the same time, overreading should be resisted. This result is for a specific setting (4×4 complex), and does not mean all matrix multiplication instantly gets faster. Moreover, Strassen's method is more often cited when discussing theoretical lower bounds than used as-is in practice, so it is hard to claim that immediate execution speed changes dramatically. The real meaning is "showing that search can exceed the ceiling of human experts," and the practical ripple is more direct in the infrastructure results that follow.

Results in Math and Infrastructure

In mathematics, too, AlphaEvolve took on more than 50 open problems across analysis, geometry, combinatorics, and number theory. In about 75% it found solutions matching the best known results, and in about 20% it improved on the best known solution.

Its real-world infrastructure results are also significant. It improved scheduling in Google's data centers to recover computing resources, and it made a key matrix-operation kernel in Gemini's training 23% faster, cutting overall training time by 1%. At Gemini's scale, 1% amounts to hundreds of thousands of GPU-hours.

There is a contrast worth noticing here. In the math problems, the distribution of "75% matching, only 20% improving" shows that this method is not an all-conquering solver but a tool that "confirms already-good solutions and occasionally surpasses them." Conversely, the infrastructure numbers (23% on the kernel, 1% on training) look small, but multiplied by scale they translate directly into money and power. Put another way, the math results are "a proof of possibility," while the infrastructure results are "a profit already being recovered" — a difference in character.

The Connection to RSI — and the Limits

Notably, AlphaEvolve even optimized the training of the very Gemini that underpins it. That comes close to a real-world case of recursive self-improvement (RSI), in which an AI improves the infrastructure of AI.

That said, this "RSI-like character" should be read carefully. AlphaEvolve is not an autonomous loop that decides on its own to strengthen itself without limit. The self-improvement it shows today happens within the bounds of a verifiable goal a human set — "make the training kernel faster." That is, the direction and the scoring criteria of the self-improvement are still held by humans. It is different in kind from the "runaway beyond control" that RSI discourse often paints; it is closer to a controllable form of self-improvement, in which "if humans define the goal well, the AI evolves its own code toward that goal."

So it is no cure-all. AlphaEvolve's power comes from "problems that can be automatically scored," so it doesn't transfer directly to domains where it's hard to verify a correct answer cleanly. For problems like writing, strategy, or design, where a "good answer" is hard to score numerically, the evolutionary loop has no coordinates to steer by in the first place. In other words, designing a "verifiable evaluator" is the true key to this approach, and whether it expands from here ultimately depends on "how many problems can be reshaped into a scorable form."

Implications From a Korean Perspective

From a domestic vantage point, the message this case sends is clear. First, the center of gravity in competition may shift from "making the model bigger" to "designing the evaluator well." Most of AlphaEvolve's results came not from a new model but from reshaping problems into scorable form and attaching an evolutionary loop. This suggests that even for latecomers at a disadvantage in the race to train ultra-large models, there is room to break in through "narrow domains where automatic scoring is clear."

Second, industrial optimization problems with sharply numeric goals — such as semiconductor, logistics, and energy scheduling — are direct targets for this method. Google's already recovering resources in its own data-center scheduling is one example. In Korea too, the capacity to formalize optimization problems in manufacturing, telecom, and logistics as "verifiable evaluators" becomes the crux of turning this trend into real cost savings.

References: Google DeepMind — AlphaEvolve · VentureBeat