The big claim turns out to be a little overstated. The claim:

> AlphaTensor’s algorithm improves on Strassen’s two-level algorithm for the first time, to our knowledge, since its discovery 50 years ago.

reduces to:

> AlphaTensor discovers algorithms that outperform the Strassen-square algorithm, which is a fast algorithm for large square matrices31,32. Although the discovered algorithm has the same theoretical complexity as Strassen-square, it outperforms it in practice, as it is optimized for the considered hardware. Interestingly, AlphaTensor finds algorithms with a larger number of additions compared with Strassen-square (or equivalently, denser decompositions), but the discovered algorithms generate individual operations that can be efficiently fused by the specific XLA33 grouping procedure and thus are more tailored towards the compiler stack we use. The algorithms found by AlphaTensor also provide gains on matrix sizes larger than what they were optimized for. Finally, Fig. 5c shows the importance of tailoring to particular hardware, as algorithms optimized for one hardware do not perform as well on other hardware.

So they improved on Strassen’s 50-year-old algorithm by optimizing it for hardware that has only existed for a few years and de-optimizing it for human use? There is so much cool stuff here without claiming to do something you really didn’t.

boole1854 57 days ago [-]

This is not a correct summary of the results of the paper.

First, you cut out the initial part of the sentence about improving on Strassen's two-level algorithm. Here is the complete sentence:

> Particularly relevant is the case of 4 × 4 matrices in a finite field, where AlphaTensor’s algorithm improves on Strassen’s two-level algorithm for the first time, to our knowledge, since its discovery 50 years ago.

That is, note that the improvement cited in the latter part of the sentence is for the 4 x 4 matrix case.

But then your next quote is not in reference to 4 x 4 matrices. That is, for larger matrix sizes, the best algorithm AlphaTensor discovered had the same theoretical complexity as Strassen-square, but better performance.

EDIT: my next comments are confused and best ignored. See reply from pxx below.

For the 4 x 4 matrix case, the theoretical complexity of AlphaTensor's algorithm was O(N^2.778) compared with the Strassen algorithm's complexity of O(N^2.8074), which is an improvement.

pxx 57 days ago [-]

what, no, you have this all wrong. the complexity of multiplying two 4x4 matrices is clearly constant. the important thing about the constant, however, is that you can use this primitive on the block matrices recursively to possibly improve the asymptotic complexity.

the complexity improvement in this paper is for arithmetic in Z_2 (modular arithmetic over single bits). in standard arithmetic, there is no asymptotic complexity improvement but the developed algorithms involve more efficiently fusible[0] operations, which is useful for the target hardware.

[0] ed: this used to read 'with fewer multiplications,' which is clearly wrong: fewer multiplications would lower complexity. Interestingly, the discovered algorithm seems to have a larger number of additions, but still runs faster.

ColinWright 57 days ago [-]

I thought that savings on the number of multiplications leads to savings in complexity. For example, multiplying 2x2 matrices using 7 multiplications instead of 8 gives an improvement in complexity.

It's possible that my understanding is completely wrong, but your comment is at odds with my other reading, so it would be useful to get some clarification.

sdenton4 57 days ago [-]

This is correct, as I understand it.

Most complexity measures are defined in terms of asymptotic behavior, so any specific finite algorithm has 'constant' complexity. This is an obviously unhelpful bit of pedantry in the theory side.

In fact, it's as you say; we can compute a 'raw' complexity for a finite operation (eg, raw count of arithmetic ops), and then use that operation as a primitive to create an algorithm with an asymptotic complexity which depends directly on the finite operation's 'raw' complexity.

(complexity theory, and specifically for matrix multiplication, is a great example of metrics becoming targets becoming bad incentives, a la Goodhart's Law. The so-called best algorithms are 'galactic' and therefore of no use to anyone. There's some fuzzy-headed hope that one day the people chasing the big-O exponent metric will come up with something practically useful, buuuuuut... you actually gotta work practical problem to solve the practical problem, and that requires different metrics. This seems to be what's motivating the work under discussion here.)

pxx 57 days ago [-]

sorry; whoops, I don't know why I said multiplications there. I somehow thought the "more additions" somehow meant "fewer multiplications," but no, it's just "more additions but faster performance on target." edited.

your other comment notes that for some shapes, even in standard arithmetic, they have found algorithms with fewer multiplications than best-known. but those don't seem to extend to an algorithm with better asymptotic complexity for multiplying general matrices. otherwise I'd assume they'd claim them :)

boole1854 57 days ago [-]

You are correct. I will edit.

a-dub 57 days ago [-]

if you're in gf2 on a traditional cpu and care about wall clock time, i start to wonder how something like simd xors and parity would compare...

Nihmie 57 days ago [-]

To be pedantic about this complexity estimate for the 4 x 4 case, both of them reduce to a O(1). If N == 4, it's a constant, so we have O(4^2.778) == O(4^2.8074) == O(1).

Talking about scaling for a problem that has no scaling factor is a bit odd.

est31 57 days ago [-]

Yeah the paper uses O notation for complexity, but never when N is constant. For constant N and M, the paper uses the number of scalar multiplication operations performed as complexity measure. According to Figure 3, their algorithm has decreased the best known value from 49 to 47 for 4x4 matrices, and from 98 to 96 for 5x5 matrices (as well as some further state of the art improvements).

vlovich123 57 days ago [-]

If you have to process many such matrices, you’re not at O(1). I would imagine that’s what the O is referring to in this case.

cochne 57 days ago [-]

If you look at it that way then any algorithm is just o(n) where n is the number of matrices. O does not care about constant factors

This is for matrix multiplication where elements are themselves 4x4 matrices. So yes, indeed this is about multiplying many many 4x4 matrices where N is the size of the outer matrix.

throwawaymaths 57 days ago [-]

To be pedantic, paper is very clear that O is calculated after decomposing arbitrarily large matrices to 4x4s.

skybrian 57 days ago [-]

Isn’t multiplying two 4x4 matrices a finite problem? What does big O complexity mean in that case?

adgjlsfhk1 57 days ago [-]

any fixed size multiplication algorithm has a corresponding general algorithm where you recursively divide a matrix into blocks and treat each block as an element. for any nxn algorithm that takes k multiplications, this yields a general algorithm with runtime n^log_n(k)

Rioghasarig 57 days ago [-]

My understanding of this situation is that they found a way to multiply 4x4 matrices that requires fewer multiplications than Strassen's . This implies that that can generalize to a way to multiply n x n matrices with a better theoretical complexity than Strassen's algorithm.

dgacmu 57 days ago [-]

In Z_2 - modular matrix multiplication, not conventional matrix multiplication. And in this case, modular over what I assume is GF(2), i.e., single bits.

aaaaaaaaaaab 57 days ago [-]

Modular multiplication over GF(2)… also known as AND?

chabons 57 days ago [-]

Maybe I'm mis-reading the paper, but my interpretation was that the algorithm discovered using a reward which scales with HW performance matches the theoretical complexity of Strassen, but is more performant on the HW.

They also identified algorithms which do fewer matrix multiplications than Strassen, improving the lower bound of matrix multiplies required (They highlight this in Fig 3).

In that light, I thought their claim was fair. They've discovered (different) algorithms which are both theoretically and practically better than Strassen's.

57 days ago [-]

dekhn 57 days ago [-]

That's the DeepMind modus operandi- and why they publish in Nature. It's in their best interest to claim as much covered ground as possible so they can keep working towards their end-goal. But most people who read Nature already know that results in that journal are overhyped. You just have to apply a specific prior/translator when you read their papers.

noobermin 57 days ago [-]

This really is a depressingly true sentiment. Nature is both the most important publication but so often, the big shots publish overhyped stuff there that doesn't deserve it but get away with it since they are bigshots. Underscore here for another piece of evidence of the dysfunction in science.

brutusborn 57 days ago [-]

I don't doubt the sentiment is true, but hasn't this kind of science always been this way? By that I mean authors inflating the importance of their work; everyone wants to be seen as having the biggest breakthroughs.

When I think of dysfunction in 'science' I usually think of unfalsifiable hypothesis, the repeatability crisis in Psychology, p-hacking in Medicine, misuse of statistical methods in Economics and other epistemic issues, but I don't think of exaggerations like this.

dekhn 57 days ago [-]

Scientists have long been self-promoters who desire that their theories become the dominant ones and they use many techniques to achieve this.

However, the trend towards maximizing the predicted outcomes of your research really took off during the human genomics project.

YorkshireSeason 57 days ago [-]

I would imagine that it's the metrics universities and funding agencies apply in promotion decisions. For example, my (well-known) university decided to measure impact, and takes "Twitter engagement" as one proxy metric for impact -- against my explicit recommendations. I'll leave the consequences to everybody's imagination.

dekhn 57 days ago [-]

Yes. In fact many scientists post to twitter but don't look at the replies (Emily Bender is an example) or even block you if you disagree with them. That's not engagement and I wonder what the dean would do about a promotion where the scientist just sort of blathered on twitter and had lots of followers, but wasn't actually providing any real scientific value (again, Emily Bender is an example).

YorkshireSeason 57 days ago [-]

Deans typically lack detailed technical knowledge to evaluate candidates (not to mention time), they just go with the flow, and want to run the department smoothly.

I think the real mechanism is this: those who make the decision (those with the most power, i.e. those who bring in the most funding and can threaten to leave if they don't get their way)
already know whom they want and they cherry-pick data, eg "impact" figures, to bolster their case. That enables the dean then to justify the decision in public with those cherry-picked figures ... (the dean can hardly say
we are hiring X because otherwise top funding getter Y will leave)

I think Twitter is less important in STEM subjects than in social sciences or humanities, as STEM has more clearcut results.

adalacelove 57 days ago [-]

> Scientists have long been self-promoters

The famous ones more than the rest, I guess. To my mind nevertheless comes Cavendish, my hero.

brutusborn 57 days ago [-]

Any more info on the changes from the human genomics project?

atty 57 days ago [-]

On the other hand, real world performance is really the only useful metric for matrix multiplication. I don’t really care about the theoretical performance, or the number of operations. Not disagreeing with your take that the claim is grandiose, just pointing out that finding a generalizable way to automate the improvement of what is almost certainly the most important mathematical operation a computer can do is worth some attention. It also suggests other mathematical operations could be improved this way as well - potentially algorithms that haven’t gotten as much theoretical or practical attention as matrix multiplication.
With all that said, they even point out that other people have worked on this before using other optimization strategies. I’d guess they got into Nature either by using the Deepmind name, because Nature really loves reinforcement learning because it’s a cool topic that draws views, or because they’re the first group to find their algorithm leads to real improvements. (Probably a mixture of all three, I’d guess)

dekhn 57 days ago [-]

Science proceeds at the rate by which linearly larger matrices can be decomposed in less than exponential time.

matheusd 57 days ago [-]

This is excellent! You should make a T-Shirt with this quote on it!

throwawaymaths 57 days ago [-]

> the number of operations

This is a very good proxy for actual real world speed. It's pretty much "as good as it gets" for most straight computational tasks, though sometimes memory movement is your real bottleneck.

sdenton4 57 days ago [-]

This is questionable...

The 'best' algorithms for matrix multiplication are galactic algorithms that provide no actual benefit. Raw operation counts are a good proxy for speed, but the big-O complexity that people actually chase hasn't been especially helpful for this problem in the last twenty+ years.

Probably the most common matrix multiplication is (nx9) x (9xm) (9 = 3x3 cells from a convnet) . If you can optimize the shit out of those you might be in business for something interesting.

Though to be honest the real slow step in machine learning is training and the slow step in training is the outer product of two matrices.... I don't believe there is an algorithmic way out of that one.

For non-ml/non-GF purposes, you might also worry about numerical stability of these matrix multiplications, which is not addressed in this paper.

a1369209993 57 days ago [-]

> and the slow step in training is the outer product of two matrices.... I don't believe there is an algorithmic way out of that one.

Well, there are several, but the obvious ones tend to require strange or unrealistic assumptions about the hardware. The most obvious such assumption, IMO, being that the hardware is arranged in 3D space in a manner roughly analogous to a human brain, which tends to be at odds with the common practice of mostly-planar photolithography, and with the preference to be able to change the network topology experimentally without building new hardware.

fulafel 57 days ago [-]

It's become a much worse proxy since the memory wall.

psychphysic 57 days ago [-]

Surely theoretical improvement begets real world, especially in the context of highly specialised hardware.

It was with theoretical performance improvement that motivated the creation of SIMD and led to real world speed ups.

fulafel 57 days ago [-]

It's interesting how both SIMD and big-o have had so different reasons for being of limited but still significant relevance.

SIMD, and vector processors like it was called in the 70s, delivered practical speedups in simple benchmarks right away but most applications dont't take advantage because of SW engineering reasons. Whereas big-O improvement ignores important components of performance per unit of time (memory access and constant factors) and is purely theoretical in an essential sense.

taeric 57 days ago [-]

Not necessarily. In particular, this could have slower growth, but still higher cost on normal workloads.

chrchang523 57 days ago [-]

Sometimes. In the case of matrix multiplication, there is a pretty large backlog of "galactic algorithms" going down to ~O(n^2.373) that haven't yet led to real-world improvements. Strassen's ~O(n^2.807) algorithm is only considered over the basic O(n^3) strategy for n>1000.

HarHarVeryFunny 57 days ago [-]

Winograd O(n^2.37) is a win for 3x3 convolutions in cuDNN, so it can be implemented efficiently.

chrchang523 57 days ago [-]

My understanding is that the Winograd minimal filtering algorithms used in cuDNN are different from the O(n^2.37) Coppersmith-Winograd-descended matrix multiplication algorithms. But I acknowledge that these can be considered cousins, produced by the same line of research.

HarHarVeryFunny 57 days ago [-]

I'm pretty sure there's no difference. It does seem to be pretty hard to turn the theoretical win into a practical one - the GPU kernel needs to be coded extremely efficiently to match the underlying hardware. AFAIK it's only a win for 3x3 - maybe for one other size too. Originally Winograd wasn't supported by cuDNN on NVidia's Tensor Cores (matmul-specific hardware on more recentish GPUs), vs CUDA cores, but a Google search seems to indicate it can be done - not sure if that's in cuDNN though.

throwawaymaths 57 days ago [-]

No, they count multiplications, 47 vs 49 in the 4x4 case. This assumes that multiplications are the heavy lift relative to additions, memory moves, etc. this is not a bad assumption, and is hardware independent, although there are some proposed 'exotic' hardware solutions where addition and multiplication are equally difficult (via lookup table, limited to about 8-bit x 8-bit) or where addition is harder than multiplication (log space number systems) but I don't think these have been deployed

lifthrasiir 57 days ago [-]

That assumption didn't age very well. In many if not most modern architectures multiplication still does take more cycles than addition but can have a higher throughput if scheduled well, and fused multiply-add can be as fast as a single multiplication, essentially giving a free addition.

throwawaymaths 57 days ago [-]

> but can have a higher throughput if scheduled well

Unlikely to be true for matrix multiplications, which have well-defined data dependencies.

> and fused multiply-add can be as fast as a single multiplication, essentially giving a free addition.

Yes, this supports the assumption that multiplication is the heavy lift.

tlb 57 days ago [-]

With sufficiently parallel hardware, you can do an entire mat4 * mat4 multiplication in 3 cycles. First do all 64 multiplications in parallel, then do 32 adds, then 16 adds to get the final answer.

For operations in GF(2) where they claim a result, a multiply is just an AND gate and an add is an XOR gate. So the fully parallel hardware version is 64 AND gates and 48 XOR gates, with a total gate delay of 3. This is a trivial amount of hardware and could easily be an instruction in some alternate universe where it was useful.

klyrs 57 days ago [-]

You missed the part where the "4x4 matrix multiplication" refers to multiplication of NxN matrices logically partitioned into sixteen blocks of size (N/4)x(N/4), done recursively.

letitgo12345 57 days ago [-]

I think they find a N^2.778 complexity algorithm (obtained by applying their 4x4 discovered algorithm recursively) so this is not correct.

Ar-Curunir 57 days ago [-]

You’re wrong. They discovered an algorithm with fewer multiplications for 4x4 matrices.

They also discovered better algorithms for other dimensions.

The paper is a great read -- an interesting approach, fun theoretical comparisons, plus benchmarks for optimized algorithms for specific hardware (CPU and GPU, allowing for differing instruction sets).

But it doesn't stop there; from the "Discussion" section:

> One important strength of AlphaTensor is its flexibility to support complex stochastic and non-differentiable rewards (from the tensor rank to practical efficiency on specific hardware), in addition to finding algorithms for custom operations in a wide variety of spaces (such as finite fields). We believe this will spur applications of AlphaTensor towards designing algorithms that optimize metrics that we did not consider here, such as numerical stability or energy usage.

mikewarot 57 days ago [-]

I have a gut feeling that there is a faster way to compute logarithms going from least to most significant bit. How would I go about using ML to find it?

[Edit] I think Feynman's algorithm might do it:

"Consider the problem of finding the logarithm of a
fractional number between 1 and 2. (The algorithm can be
generalized without too much difficulty.) Feynman observed
that any such number can be uniquely represented
as a product of numbers of the form 1 + 2^(-k), where k is an
integer. Testing for the presence of each of these factors in
a binary representation is simply a matter of a shift and a
subtraction. Once the factors are determined, the logarithm
can be computed by adding together the precomputed
logarithms of the factors. The algorithm fit the
Connection Machine especially well because the small
table of the logarithms of 1 + 2^(-k) could be shared by all
the processors. The entire computation took less time
than doing a division."

adgjlsfhk1 57 days ago [-]

note that this algorithm will no longer be good. for precisions up to roughly 1000 bits small tables combined with minimax polynomials are optimal and for higher precision, you want to use more complicated methods. if you're interested, the ARB library unlikely the currently fastest known methods.

I'm quite suspicious about their hardware benchmarks. They're not writing custom kernels, they're relying on a graph compiler like XLA to automatically fuse their decomposed matmuls (and my guess is that XLA will not be very good at this).

Moreover, as far as I can tell, they don't report absolute performance numbers anywhere. In other words, I suspect that a naive N^3 matrix multiplication would absolutely smoke them in performance.

57 days ago [-]

optimalsolver 57 days ago [-]

Now, were these algorithms discovered, or invented?

I.e., have they always been there, just sitting in Platonic space waiting for a conscious mind to stumble across them, or have they just now popped into existence?

MichaelBurge 57 days ago [-]

The dovetail over all possible programs solves every problem with at most a constant slowdown.

So if it was invented, it was invented 100 years ago along with every other algorithm since the inventor of the dovetail incorporated it by reference. And there are no more algorithms to invent.

And if it was discovered, you would want to compare the efficiency of your discovery process with the dovetail.

So I tend to say "discovered with X bits of optimization power", where 0 bits reduces to the dovetail over some enumeration process, infinity bits reduces to "invention"(i.e. you consider one(or zero) object from the stream, constructing it directly), and everything in-between grades the search process.

WFHRenaissance 57 days ago [-]

Discovered. I'd go as far to say that all "invention" is actually just discovery that we find useful.

TheRealNGenius 56 days ago [-]

lawrenceyan 57 days ago [-]

> Leveraging this diversity, we adapted AlphaTensor to specifically find algorithms that are fast on a given hardware, such as Nvidia V100 GPU, and Google TPU v2. These algorithms multiply large matrices 10-20% faster than the commonly used algorithms on the same hardware, which showcases AlphaTensor’s flexibility in optimising arbitrary objectives.

10-20% performance improvement in matrix multiplications is pretty amazing[0]!

That doesn't say exactly what was run and how. From measurements with nvblas, that difference could be dwarfed by the effect of tile size and pinning buffers or not.

a-dub 57 days ago [-]

this is cool. i suppose it's only a matter of time until we see optimizing compilers that use transformer-rl style searches for subsets of their optimization and codegen.

mhh__ 57 days ago [-]

Generating pgo data with graph ML is already a thing

a-dub 57 days ago [-]

i'm guessing most of that stuff is register allocation and cache management.

rl-searches could potentially rewrite entire algorithms.

edit: (more musing) maybe the future of software engineering will be authoring bullet proof acceptance tests.

Psyladine 57 days ago [-]

Nevermind compilers, the possibility of updating architectures & code bases by automation. A true feasible attack on tech debt and legacy code.

danuker 57 days ago [-]

Will also generate some fun bugs!

iKlsR 57 days ago [-]

Whenever I see stuff about automated algorithms it reminds me of a movie, can't remember the name or plot exactly but they create a being or upload someones consciousness and the "ai" is solving all the problems the world faces until it reaches the last one, humans. It then begins to synthesize some airborne virus or nukes that would scour us from the earth. We basically created tech that would eventually optimize us out of existence as a cancer to the earth.

Edit, found it Transcendence (2014), also it seems my brain injected some random thoughts into that plot, it's not about that at all but the point stands!

haolez 57 days ago [-]

Transendence is one of the few movies that present an almighty AI that's benevolent in nature.

scotty79 56 days ago [-]

You might watch "I am mother". Greater good was always an interesting concept.

ASalazarMX 57 days ago [-]

Must have been an embarrassing incident for the engineer that commented out the "Don't kill humans" constraint for a quick debug session.

jeffbee 57 days ago [-]

Superoptimizers are already extremely good at discovering latent flaws in high level source code, and in compilers.

mirekrusin 57 days ago [-]

Unless it provides or simply works within formal verification?

f1shy 57 days ago [-]

Formal verification does not prove lack of bugs. In best case, can only catch one certain type of bugs.

Formal verification is very good at proving that compiler transformations preserve semantics. Programming language semantics are pretty well specified (at least for some programming languages...), so the chance of bugs in the specification are low.

wokkel 57 days ago [-]

It can catch all kinds of bugs, but you have to ask the correct questions. So it comes down to define what a bug is and the assumptions you use for testing. And therein lies the problem: what constitutes a bug? A function that add two numbers but bever terminates might be considered bugfree if forget to include as a bug that not giving an answer before the end of the universe is faulty behaviour. We humans are terrible at writing these kind of specifications, so formal verification as a method just pushes the correctness from code to specification.

mirekrusin 56 days ago [-]

If you navigate within algebraic structures with well known properties (which are also verified for example), formal verification is all you need, you can be certain of being bug free.

hdjjhhvvhga 57 days ago [-]

Relax - we'll soon have self-learning debuggers, too.

ColinWright 57 days ago [-]

Quoting:

> ... AlphaTensor finds an algorithm for multiplying 4×4 matrices using 47 multiplications in Z_2 , thereby outperforming Strassen’s two-level algorithm, which involves 7^2 = 49 multiplications. By applying this algorithm recursively, one obtains a practical matrix multiplication algorithm in Z_2 with complexity O(N^2.778).

> Moreover, AlphaTensor discovers efficient algorithms for multiplying matrices in standard arithmetic; for example, AlphaTensor finds a rank-76 decomposition of T_{4,5,5}, improving over the previous state-of-the-art complexity of 80 multiplications.

HarHarVeryFunny 57 days ago [-]

Can someone explain the approach here?

I understand that they transform the problem into a gamified 3-D matrix decomposition, but what exactly is the motivation for using RL to beat this decomposition game ? Why not just use, for example, an evolution algorithm to find incrementally better decompositions ?

gnufx 56 days ago [-]

There's commentary here which seems to be talking about matrix multiplication in general and ignoring real implementations, which are dominated by considerations of the memory hierarchy. They actually introduce operations, like packing for sufficiently large dimensions but not for smaller, and prefetch. Some of that seems only to be tunable empirically for the micro-architecture too. (I don't know how the considerations vary between CPU and GPU.) Fairly recent work on Strassen implementation, talks about that and the normal Goto-type algorithm: https://jianyuhuang.com/papers/sc16.pdf

zekrioca 56 days ago [-]

Always caution someone who wants to follow in this research:

When somebody promotes a fast algorithm there is often a catch. In this case the issue is numerical stability. There is a theorem that states that any algorithm for n-by-n matrix-matrix multiplication that is componentwise forward stable (as good as it gets in this situation) much necessarily use n^3 scalar multiplications.
The authors will therefore waste their time if they carry out their plans and try to optimize for stability.
The standard algorithm has the nice property and no faster algorithm can have this property.
The question of fast matrix multiplication was raised recently on mathoverflow.net, see
https://mathoverflow.net/q/421304/110176
and the answers given there.

WhitneyLand 57 days ago [-]

In general to speed up matrix operations how much is some sort of result caching applied?

For example if you profile calculations done when training a model, is there any significant repetition happening that would allow some kind of benefit from table lookups of certain solutions?

ubik80 53 days ago [-]

Machine learning for finding matrix multiplication algorithms is not so new. The problem was solved before with other methods. For example:

Can someone knowledgeable tell me if it discovered an algorithm we can understand and re-implement ourself, like is the pseudo-code for it known? Or is it kind of stuck in the infered function of the ML model?

bugzz 57 days ago [-]

yes, they implemented the resulting algorithms and have charts showing real world performance on GPUs and TPUs

didibus 57 days ago [-]

That's really cool!

puffoflogic 57 days ago [-]

Why the heck is a math paper in Nature? I'd put Nature somewhere just below vixra in terms of math credibility.

323 57 days ago [-]

If you could find a way to train cats to solve integrals, wouldn't that be a Nature paper and not a math one?

puffoflogic 57 days ago [-]

Yes, but if those cats discovered a faster way to solve integrals than you initially taught them, their findings would have higher credibility if published in an actual mathematics journal.

Otherwise, you end up with things like the "rediscovery" [0] of the trapezoid rule. That wasn't actually published in Nature, but with the way Nature has gone downhill even in its own sphere, it could probably happen today. There's no reason to believe that someone publishing in a bio/chem journal is remotely capable of making math/CS judgements like "faster matrix multiplication"; nor that the referees (does Nature even bother with those any more? O.o) will recognize any issues.

Language is the real AI and its been improving itself since the first word was used between two people.

wbhart 56 days ago [-]

The new 4x4 matrix multiplication over F_2 has practical applications as many matrix operations over F_2 can be reduced to matrix multiplication.

For anyone looking for the algorithm itself, it is actually given in in one of the extended data sections at the very end of the paper.

KKKKkkkk1 57 days ago [-]

Reinforcement learning is usually introduced as a technique to train an agent in simulation that would then be let loose in a real environment. Interesting that they choose to treat it is a search strategy.

nestorD 57 days ago [-]

To be fair, their work is based on alpha-zero which is itself deeplearning on top of monte-carlo tree search, a search strategy.
Having played with those algorithms, I suspect that one could obtain similar results with a solid Monte-Carlo tree search implementation (a lot of them use ucb which is really suboptimal)

57 days ago [-]

bjourne 57 days ago [-]

Never heard of the "matrix multiplication tensor" before. Is there some digestible blog post or article somewhere explaining what it is and why it is useful?

machinekob 57 days ago [-]

Wasn't this based on some other paper? I have read something like that a year+ (?) ago also about deep learning based matrix multiplication boost.

fatneckbeardz 57 days ago [-]

maybe they can use this thing to figure out universal basic income before we all become homeless

ckrapu 57 days ago [-]

50% fewer operations for skewsymmetric matrix vector product is pretty big, IMO.

MILP 55 days ago [-]

Will this be efficient for sparse matrices ?

loufe 57 days ago [-]

This is completely besides the matter, but reading "provably" in the abstract is a frank reminder of how terrible English spelling/pronounciation is. I can't imagine I'm the only well-read native English speaker who read this as "prov-ably" on first take. I don't know about most languages, but you just don't get nonsense like this in French, at least.

evmar 57 days ago [-]

English is terrible for sure, but this comment is pretty funny to compare it to the language that spells /kɛs kə sɛ/ as "qu'est-ce que c'est".

julienchastang 57 days ago [-]

Or quatre-vingt-dix-huit -> four-twenty-eighteen, for the number 98.

elcomet 57 days ago [-]

I don't understand your point, provably means which can be proven, is there another meaning I'm missing ?

613style 57 days ago [-]

Think "probably" vs "provably" (it's about pronunciation/spelling)

ogogmad 57 days ago [-]

He's talking about the spelling not matching the pronunciation

Edit: Sorry, missed your point on first reading, which perhaps reinforces your point. I read papers like this for a living so the language is unambiguous but I see how it might not be for others, even if English is their first language.

> AlphaTensor’s algorithm improves on Strassen’s two-level algorithm for the first time, to our knowledge, since its discovery 50 years ago.

reduces to:

> AlphaTensor discovers algorithms that outperform the Strassen-square algorithm, which is a fast algorithm for large square matrices31,32. Although the discovered algorithm has the same theoretical complexity as Strassen-square, it outperforms it in practice, as it is optimized for the considered hardware. Interestingly, AlphaTensor finds algorithms with a larger number of additions compared with Strassen-square (or equivalently, denser decompositions), but the discovered algorithms generate individual operations that can be efficiently fused by the specific XLA33 grouping procedure and thus are more tailored towards the compiler stack we use. The algorithms found by AlphaTensor also provide gains on matrix sizes larger than what they were optimized for. Finally, Fig. 5c shows the importance of tailoring to particular hardware, as algorithms optimized for one hardware do not perform as well on other hardware.

So they improved on Strassen’s 50-year-old algorithm by optimizing it for hardware that has only existed for a few years and de-optimizing it for human use? There is so much cool stuff here without claiming to do something you really didn’t.

First, you cut out the initial part of the sentence about improving on Strassen's two-level algorithm. Here is the complete sentence:

> Particularly relevant is the case of 4 × 4 matrices in a finite field, where AlphaTensor’s algorithm improves on Strassen’s two-level algorithm for the first time, to our knowledge, since its discovery 50 years ago.

That is, note that the improvement cited in the latter part of the sentence is for the

4 x 4 matrix case.But then your next quote is not in reference to 4 x 4 matrices. That is, for

larger matrix sizes, the best algorithm AlphaTensor discovered had the same theoretical complexity as Strassen-square, but better performance.EDIT: my next comments are confused and best ignored. See reply from pxx below.For the 4 x 4 matrix case, the theoretical complexity of AlphaTensor's algorithm was O(N^2.778) compared with the Strassen algorithm's complexity of O(N^2.8074), which

isan improvement.the complexity improvement in this paper is for arithmetic in Z_2 (modular arithmetic over single bits). in standard arithmetic, there is no asymptotic complexity improvement but the developed algorithms involve more efficiently fusible[0] operations, which is useful for the target hardware.

[0] ed: this used to read 'with fewer multiplications,' which is clearly wrong: fewer multiplications would lower complexity. Interestingly, the discovered algorithm seems to have a larger number of additions, but still runs faster.

See my comment here: https://news.ycombinator.com/item?id=33098192

It's possible that my understanding is completely wrong, but your comment is at odds with my other reading, so it would be useful to get some clarification.

Most complexity measures are defined in terms of asymptotic behavior, so any specific finite algorithm has 'constant' complexity. This is an obviously unhelpful bit of pedantry in the theory side.

In fact, it's as you say; we can compute a 'raw' complexity for a finite operation (eg, raw count of arithmetic ops), and then use that operation as a primitive to create an algorithm with an asymptotic complexity which depends directly on the finite operation's 'raw' complexity.

(complexity theory, and specifically for matrix multiplication, is a great example of metrics becoming targets becoming bad incentives, a la Goodhart's Law. The so-called best algorithms are 'galactic' and therefore of no use to anyone. There's some fuzzy-headed hope that one day the people chasing the big-O exponent metric will come up with something practically useful, buuuuuut... you actually gotta work practical problem to solve the practical problem, and that requires different metrics. This seems to be what's motivating the work under discussion here.)

your other comment notes that for some shapes, even in standard arithmetic, they have found algorithms with fewer multiplications than best-known. but those don't seem to extend to an algorithm with better asymptotic complexity for multiplying general matrices. otherwise I'd assume they'd claim them :)

Talking about

scalingfor a problem that has no scaling factor is a bit odd.This is for matrix multiplication where elements are themselves 4x4 matrices. So yes, indeed this is about multiplying many many 4x4 matrices where N is the size of the outer matrix.

They

alsoidentified algorithms which do fewer matrix multiplications than Strassen, improving the lower bound of matrix multiplies required (They highlight this in Fig 3).In that light, I thought their claim was fair. They've discovered (different) algorithms which are both theoretically and practically better than Strassen's.

When I think of dysfunction in 'science' I usually think of unfalsifiable hypothesis, the repeatability crisis in Psychology, p-hacking in Medicine, misuse of statistical methods in Economics and other epistemic issues, but I don't think of exaggerations like this.

However, the trend towards maximizing the predicted outcomes of your research really took off during the human genomics project.

I think the real mechanism is this: those who make the decision (those with the most power, i.e. those who bring in the most funding and can threaten to leave if they don't get their way) already know whom they want and they cherry-pick data, eg "impact" figures, to bolster their case. That enables the dean then to justify the decision in public with those cherry-picked figures ... (the dean can hardly say we are hiring X because otherwise top funding getter Y will leave)

I think Twitter is less important in STEM subjects than in social sciences or humanities, as STEM has more clearcut results.

The famous ones more than the rest, I guess. To my mind nevertheless comes Cavendish, my hero.

This is a very good proxy for actual real world speed. It's pretty much "as good as it gets" for most straight computational tasks, though sometimes memory movement is your real bottleneck.

The 'best' algorithms for matrix multiplication are galactic algorithms that provide no actual benefit. Raw operation counts are a good proxy for speed, but the big-O complexity that people actually chase hasn't been especially helpful for this problem in the last twenty+ years.

https://en.wikipedia.org/wiki/Matrix_multiplication_algorith...

Though to be honest the real slow step in machine learning is training and the slow step in training is the

outer productof two matrices.... I don't believe there is an algorithmic way out of that one.For non-ml/non-GF purposes, you might also worry about numerical stability of these matrix multiplications, which is not addressed in this paper.

outer productof two matrices.... I don't believe there is an algorithmic way out of that one.Well, there are several, but the obvious ones tend to require strange or unrealistic assumptions about the hardware. The most obvious such assumption, IMO, being that the hardware is arranged in 3D space in a manner roughly analogous to a human brain, which tends to be at odds with the common practice of mostly-planar photolithography, and with the preference to be able to change the network topology experimentally without building new hardware.

It was with theoretical performance improvement that motivated the creation of SIMD and led to real world speed ups.

SIMD, and vector processors like it was called in the 70s, delivered practical speedups in simple benchmarks right away but most applications dont't take advantage because of SW engineering reasons. Whereas big-O improvement ignores important components of performance per unit of time (memory access and constant factors) and is purely theoretical in an essential sense.

Unlikely to be true for matrix multiplications, which have well-defined data dependencies.

> and fused multiply-add can be as fast as a single multiplication, essentially giving a free addition.

Yes, this supports the assumption that multiplication is the heavy lift.

For operations in GF(2) where they claim a result, a multiply is just an AND gate and an add is an XOR gate. So the fully parallel hardware version is 64 AND gates and 48 XOR gates, with a total gate delay of 3. This is a trivial amount of hardware and could easily be an instruction in some alternate universe where it was useful.

They also discovered better algorithms for other dimensions.

https://news.ycombinator.com/item?id=33102761

But it doesn't stop there; from the "Discussion" section:

> One important strength of AlphaTensor is its flexibility to support complex stochastic and non-differentiable rewards (from the tensor rank to practical efficiency on specific hardware), in addition to finding algorithms for custom operations in a wide variety of spaces (such as finite fields). We believe this will spur applications of AlphaTensor towards designing algorithms that optimize metrics that we did not consider here, such as numerical stability or energy usage.

[Edit] I think Feynman's algorithm might do it:

"Consider the problem of finding the logarithm of a fractional number between 1 and 2. (The algorithm can be generalized without too much difficulty.) Feynman observed that any such number can be uniquely represented as a product of numbers of the form 1 + 2^(-k), where k is an integer. Testing for the presence of each of these factors in a binary representation is simply a matter of a shift and a subtraction. Once the factors are determined, the logarithm can be computed by adding together the precomputed logarithms of the factors. The algorithm fit the Connection Machine especially well because the small table of the logarithms of 1 + 2^(-k) could be shared by all the processors. The entire computation took less time than doing a division."

1. How big is the search space?

2. What analysis approaches are likely to bear fruit for the search space? (theoretical analysis? optimization?)

3. If optimization is called for, what kind?

https://twitter.com/cHHillee/status/1577713102434361344

I'm quite suspicious about their hardware benchmarks. They're not writing custom kernels, they're relying on a graph compiler like XLA to automatically fuse their decomposed matmuls (and my guess is that XLA will not be very good at this).

Moreover, as far as I can tell, they don't report absolute performance numbers anywhere. In other words, I suspect that a naive N^3 matrix multiplication would absolutely smoke them in performance.

I.e., have they always been there, just sitting in Platonic space waiting for a conscious mind to stumble across them, or have they just now popped into existence?

So if it was invented, it was invented 100 years ago along with every other algorithm since the inventor of the dovetail incorporated it by reference. And there are no more algorithms to invent.

And if it was discovered, you would want to compare the efficiency of your discovery process with the dovetail.

So I tend to say "discovered with X bits of optimization power", where 0 bits reduces to the dovetail over some enumeration process, infinity bits reduces to "invention"(i.e. you consider one(or zero) object from the stream, constructing it directly), and everything in-between grades the search process.

10-20% performance improvement in matrix multiplications is pretty amazing[0]!

[0]: https://www.nature.com/articles/s41586-022-05172-4/figures/5

rl-searches could potentially rewrite entire algorithms.

edit: (more musing) maybe the future of software engineering will be authoring bullet proof acceptance tests.

Edit, found it Transcendence (2014), also it seems my brain injected some random thoughts into that plot, it's not about that at all but the point stands!

https://smartech.gatech.edu/handle/1853/62855

>

... AlphaTensor finds an algorithm for multiplying 4×4 matrices using 47 multiplications in Z_2 , thereby outperforming Strassen’s two-level algorithm, which involves 7^2 = 49 multiplications. By applying this algorithm recursively, one obtains a practical matrix multiplication algorithm in Z_2 with complexity O(N^2.778).>

Moreover, AlphaTensor discovers efficient algorithms for multiplying matrices in standard arithmetic; for example, AlphaTensor finds a rank-76 decomposition of T_{4,5,5}, improving over the previous state-of-the-art complexity of 80 multiplications.I understand that they transform the problem into a gamified 3-D matrix decomposition, but what exactly is the motivation for using RL to beat this decomposition game ? Why not just use, for example, an evolution algorithm to find incrementally better decompositions ?

When somebody promotes a fast algorithm there is often a catch. In this case the issue is numerical stability. There is a theorem that states that any algorithm for n-by-n matrix-matrix multiplication that is componentwise forward stable (as good as it gets in this situation) much necessarily use n^3 scalar multiplications. The authors will therefore waste their time if they carry out their plans and try to optimize for stability. The standard algorithm has the nice property and no faster algorithm can have this property. The question of fast matrix multiplication was raised recently on mathoverflow.net, see https://mathoverflow.net/q/421304/110176 and the answers given there.

For example if you profile calculations done when training a model, is there any significant repetition happening that would allow some kind of benefit from table lookups of certain solutions?

https://www.researchgate.net/publication/341942316_Searching...

Otherwise, you end up with things like the "rediscovery" [0] of the trapezoid rule. That wasn't actually published in

Nature, but with the wayNaturehas gone downhill even in its own sphere, it could probably happen today. There's no reason to believe that someone publishing in a bio/chem journal is remotely capable of making math/CS judgements like "faster matrix multiplication"; nor that the referees (doesNatureeven bother with those any more? O.o) will recognize any issues.[0] https://academia.stackexchange.com/questions/9602/rediscover... https://news.ycombinator.com/item?id=26384357 etc.

For anyone looking for the algorithm itself, it is actually given in in one of the extended data sections at the very end of the paper.

Edit: Sorry, missed your point on first reading, which perhaps reinforces your point. I read papers like this for a living so the language is unambiguous but I see how it might not be for others, even if English is their first language.