BTW you may have noticed that the paper multiplication method we all learned in third grade is digit-by-digit convolution, which is an O[n^2] operation. But what does convolution in the time domain become when we switch to the frequency domain? It becomes digit-by-digit multiplication, which is O[n]. So now you see why the FFT is essential here. The extra factors of logn and log(logn) are about taking the FFT itself in both the forward and reverse directions.
So fast multiplication is about looking at numbers as signals and doing signal processing on them.
Much used in simplifying kernel operations and convolutions (and some other nifty tricks.)
Another useful idea is also that in the domain of the fourier transformation we have exponentials (Fourier series are some series of $ c_n e^inx$), and when multiplying exponentials we get $ e^ix * e^iy = e^i(x + y) $
Moreover this is usually coupled with the case where we integrate on some periodic signal (so it’s integrated from 0 to 2pi, and unless the product of e^i(x+y) = e^0 = 1, then the integral becomes 0 as well. )
Thank you for the nice explanation. Clears up the important point that article was not quite able to do so. Even though it feels unintuitive to apply a Fourier transform to numbers for multiplication purposes, it works nicely. Fascinating.
> Regardless of what computers look like in the future, Harvey and van der Hoeven’s algorithm will still be the most efficient way to multiply.
This is only true if you just look at the big O notation. However, if you also look at the constant involved to get C * n * log(n) + C2 as an upper bound for the number of operations for the calculation, there can be faster algorithms with lower constants C and C2. And these algorithms also will be of practical interest.
Also, as I understand it, it was not actually proven (yet) that O(n log(n)) is the fastest possible class.
Also, you could also study the average runtime, instead of the worst case upper bound. There are cases where the multiplication can be much more efficient than O(n log(n)). An algorithm could be clever and detect many such cases and then maybe end up in a strictly better average runtime class. Or I'm not sure if it is proven that this cannot be the case.
Similarly, sorting algorithms are still optimized more and more, although the worst case complexity is always O(n log(n)) (ignore integer sorting for now). But average and best case runtime complexity is relevant. And memory usage complexity as well. And other properties. And of course also these constants which matter in practice.
A big C2 basically corresponds to the galactic algorithms, mentioned in another thread here.
>This means it will not reach its stated efficiency until the numbers have at least 2^1729^12 bits (at least 10^10^38 digits), which is vastly larger than the number of atoms in the known universe.
"Vastly larger" is one of the biggest understatements I've ever heard. It's roughly 10^10^38 times larger.
I can't think of a better way to write that line. I just found it funny.
> This means it will not reach its stated efficiency until the numbers have at least 2^1729^12 bits (at least 10^10^38 digits), which is larger than the number of atoms in the known universe.
This version might be better? If you like understatement, that is.
"For millennia it was widely assumed that there was no faster way to multiply. Then in 1960, the 23-year-old Russian mathematician Anatoly Karatsuba... found it."
I have a bizarre mental tic wherein I will often read a story like this and then fantasize about going back in time to just before the discovery, and present it myself, in essence stealing it. I'll get quite far along in my daydream before the absurdity collapses the narrative.
My fantasies are a bit more charitable: Taking long dead people for a tour of the present, to see the consequences of their discoveries.
As a kid I used it as a way of analyzing the ramification of singular ideas, or understanding the complex phenomenons that interplay in technology. Eg, I would break down how a computer works (from the plug to the screen) to Benjamin Franklin, who would appreciate why you'd want to arrange silicon to make the thinking for you.
In adulthood they've quietly settled into the occasional "Man, I wish Kepler could play Kerbal Space Program".
That's really funny. I love the weird fantasies we cook up.
I have a variation of this, where I travel back in time to before some long dead person did what made them famous and tell them all about their discoveries and all the wonders that came from it.
It's usually accompanied with a hint that they won't remember our conversation, so it won't affect history, but they'll enjoy marveling at it for a short while.
I do the same, but for things that aren't yet discovered. I try to imagine what the solution might look like, how it would be presented, what the implications of it's creation will be, etc.
I really love for example the film 'Travelling Salesman' [1]. It explores some possible narrative for discovering the solution to P=NP.
One I like to think about is general human-like AI. How do I test it works correctly? How long did it take to create? What algorithms does it make use of? What are the implications in my social circles? What is the wider impact? What happens if you refuse to release the code?
One film I like that explores some ideas is 'Ex Machina' [2]. In my opinion the AI created in the film is intelligent, but lacks empathy - something I think we should seriously consider in our attempts to create AI.
Probably worth noting in this context, I recently started reading a book called "How to Invent Everything: A Survival Guide for the Stranded Time Traveler" which, basically, says: "If you happen to be in this timeframe, invent this trivial thing. It was invented that many years later by this person, but of course you can keep all the fame to yourself".
I lack the imagination to go back with anything more than some information that will make me rich, lottery numbers usually and only back to the previous saturday.
For anyone else just looking for an outline of the new algorithm: last two paragraphs of p4, first two paragraphs of p5.
Coming from a position of a few abstract algebra classes in college many years ago, all the words and notation are familiar but I am a long way from being able to follow it.
I think I'm misunderstanding something about the article. In what way is this method perfect, as is mentioned multiple times? Is it not just an incremental improvement over standard practice?
In computational complexity, we have some proofs that a certain cost is the absolute bare minimum that an optimal solution must expend. We have the same for some mechanical systems too.
If you get the order down to the proven limit, especially for the first time, people sometimes use the word 'perfect', even if there are potentially other solutions with lower constant overhead, or lower practical overhead (see also Timsort, which uses insertion sort for leaf operations).
> Second, in that same paper Schönhage and Strassen conjectured that there should be an even faster algorithm than the one they found — a method that needs only n × log n single-digit operations — and that such an algorithm would be the fastest possible.
Is this saying that there is a proof in there paper that nlogn is the cheapest? If so it certainly beats around the bush.
A conjecture is a proposition that is suspected to be true. So there would be no proof in the paper, but likely some arguments why the authors expect it to be true.
"Perfect" in this case is a qualitative exaggeration bordering on hyperbole.
> At the end of February, a team of computer scientists at Aarhus University posted a paper arguing that if another unproven conjecture is also true, this is indeed the fastest way multiplication can be done.
You don't know what nobody thinks, because there may be a person that thinks just that. This is an active area of research. Anything quicker than O(nlog n) but slower than O(n) is still possible.
It's perfect in the sense that its running time is O(n log n), and there is evidence that this is the fastest running time possible (at least up to a constant factor).
I would say most mathematical conjectures are stronger than most scientific theories. What I mean is that the conjectures are easily falsifiable (i.e., counter examples, if found, can easily be verified) and have been shown to be true for an enormous number of of cases. So unproven mathematical conjectures are not just "wild hypotheses."
According to the article, it rests on a plausible conjecture. I need to read further.
[edit]
According to the abstract of the paper referred to in the article (https://arxiv.org/abs/1902.10935) it rests on a "central conjecture in the area of network coding". Make of that what you will.
Ah but that's not what this game is about. It's about what you can prove. And conjectures are by definition things you can't prove yet. So the longer they stand and the more paths they block the more frustrating it gets.
>In this work, we prove that if a central conjecture in the area of network coding is true, then any constant degree boolean circuit for multiplication must have size Ω(nlgn), thus almost completely settling the complexity of multiplication circuits
So, given an unproven assumption and also given that we are constrained to doing multiplication on a boolean circuit, this may be "perfect" in a sense.
It achieves a theoretical lower bound on the complexity. This is what the article says in like 1000 words because it's written for mathematical illiterates.
> Harvey and van der Hoeven’s algorithm proves that multiplication can be done in n × log n steps. However, it doesn’t prove that there’s no faster way to do it. Establishing that this is the best possible approach is much more difficult. At the end of February, a team of computer scientists at Aarhus University posted a paper arguing that if another unproven conjecture is also true, this is indeed the fastest way multiplication can be done.
Gotta admit, I started chuckling at “We use [the fast Fourier transform] in a much more violent way..." I've known some math geeks in my day, and this sounded like them.
> Two decades ago, computers performed addition much faster than multiplication. The speed gap between multiplication and addition has narrowed considerably over the past 20 years to the point where multiplication can be even faster than addition in some chip architectures. With some hardware, “you could actually do addition faster by telling the computer to do a multiplication problem, which is just insane,” Harvey said.
Is this true for popular architectures such as Intel and ARM?
An Intel Haswell chip, when doing floating point vector operations, can dispatch one addition per cycle, or two fused-multiply-add. The latency is worse for the FMA (5 cycles versus 3 cycles), but if you've got enough parallel accumulations to hide that latency, then using FMAs for addition will be faster than addition.
This is a consequence of Intel optimising for FLOPs (or - equivalently - for matrix multiplication), as an FMA counts as two FLOPs, versus addition counting as just one.
Well the algorithm for floating point addition is very complicated so regardless of whether they were gaming the flops you would expect floating point multiplication to be at least as fast as floating point addition on most architectures.
> the algorithm for floating point addition is very complicated
Ah yes, of course. I think it is slightly misleading that the article is mostly about large integer multiplications, while that remark holds for floating point operations.
Is that just because multiplication is so commonly needed that there is double hardware allocated for FMA?
Like, if I had a special purpose adding machine and a special purpose multiply machine, and I bought a second multiply machine, then I could multiply large numbers faster than adding, but that's not insane, that's resource allocation.
0. Extract mantissa and exponents from the inputs [basically free in a HW implementation].
1. Combine the two exponents and shift the mantissas so that they same the share scale.
2. Add the two mantissas together.
3. Normalize the result back into a floating-point number.
Now look at the process for a floating-point multiply operation:
0. Extract mantissa and exponents from the inputs [basically free in a HW implementation].
1. Add the two exponents together.
2. Multiply the two mantissas together.
3. Normalize the result back into a floating-point number.
Steps 0 and 3 are identical in both cases. I'm not really knowledgeable about hardware multiplier implementations, but doing an additional "add" operation can be pretty close to free in some implementations. In any case, adding in an extra adder to the process is going to be quite cheap, even if not free.
What this means is that there is not much extra hardware to turn a FMUL ALU op into an FMA ALU op... and if you make just a single FMA ALU op, you can use that hardware to implement FADD, FMUL, and FMA with nothing more than microcode.
In other words, instead of thinking of FMA as something worth pouring more resources into, think of it as FADD as not worth enough to be made its own independent unit.
Many important numerical algorithms like dot products, matrix multiplication, FFT can be implemented in terms of FMA. So yes, it makes sense for processor designers to devote resources to FMA units.
Also, FMA units can be used for pure adds or multiplies as well. Not sure of instruction sets that have separate instructions for addition and multiplication in addition to FMA (like x86), are they all executed on the same functional units or are there separate units. And if executed on the same functional units, what's the limitation preventing only one add per cycle vs. two FMA's as on Haswell.
An integer multiplication always needs a "carry propagate add" step at the end. Consequently, addition is always faster because that's the final step of a multiplication. (Floating point is a little different, but not significantly so).
IF you have a chain of additions AND the multiplier has a way to forward the "carry save" representation (ie. the step right before the carry propagate add) AND your carry propagate add takes longer than a single clock cycle, you could theoretically get better performance by feeding the adds into the multiplier accumulation matrix (which is just a big matrix for doing a whole bunch of adds simultaneously).
That's a lot of "and" and "if". Nobody does it that way. The best you get is "fused multiply add".
Not really. Afaik integer multiplication still usually has lower instruction per cycle throughput than addition and obviously even if it were faster that would only apply to 32 but 64 bit etc integers not big integer multiplication.
This new method will be marginally faster if you ever need to multiply two numbers on a scale where writing down the bits fills every plank-length cube in the known universe.
Otherwise the “practical” benefit is limited to saving a few symbols when writing abstract math papers.
> This new method will be marginally faster if you ever need to multiply two numbers on a scale where writing down the bits fills every plank-length cube in the known universe.
If we're talking about physical implementations then you would also have to take the delay between those planck volumes into account
From last time this came up, the number of bits in the numbers involved is 2^(9^12). Storing a count of the number of bits takes about 930 billion digits. Actually storing the bits or digits of the numbers per se takes a whole lot more, to say the least.
Yes, this result will be used to save a few printed symbols each time the topic comes up in future mathematics papers, and the beauty of the slightly simpler mathematical expressions will spark joy in mathematicians reading them.
Karatsuba is a beautiful algorithm and as good as it gets for "small" numbers (on the order of a few thousand bits or less). But when you get bigger than that you need the FFT methods, which are impractical for "small" numbers. So the two algorithms complement each other nicely.
Indeed—the advantage of doing it as you describe is that each new multiplication gives you a new digit that can be recorded immediately, rather than having to record the '2' from the 2 x 6 multiplication and later amend it to '5'.
>"Over the past decade, mathematicians have found successively faster multiplication algorithms, each of which has inched closer to n × log n, without quite reaching it.
Then last month, Harvey and van der Hoeven got there.
Their method is a refinement of the major work that came before them. It splits up digits, uses an improved version of the fast Fourier transform, and takes advantage of other advances made over the past forty years. “We use [the fast Fourier transform] in a much more violent way, use it several times instead of a single time, and replace even more multiplications with additions and subtractions,” van der Hoeven said.
Harvey and van der Hoeven’s algorithm proves that multiplication can be done in n × log n steps.
However, it doesn’t prove that there’s no faster way to do it."
Yeah, a buddy of mine works at D-Wave. They held the world record for the longest quantum state (or maybe still do, I think it was like 15 mins or something) and I'd joke that they couldn't get past 15 mins because they'd lift the lid to check if the "quantum state was still okay".
But seriously, D-Wave is doing some pretty amazing stuff. The Lower Mainland, in general, is an awesome place to be career wise. Just a shame about the housing situation
This is amazing to me:
> "Their conjecture was based on a hunch that an operation as fundamental as multiplication must have a limit more elegant than n × log n × log(log n)"
Only in maths (and sometimes in its "implementation"/ expression in nature in conjunction with physical systems) can such a subjective viewpoint be so powerful.
Also blows me away, at least at first, that multiplying using an FFT would be the fastest method so far. (I guess I have an older mindset based on how slow FFTs used to be.) I wonder in actual implementations at how many digits the perf curves meet at break-even compared to other methods.
I completely missed this when I first looked at the link, but based on another comment, the break even point is considerably beyond the size of any number that would fit in the universe as we know it. Or any number you could write down if you had a universe for each atom in our universe, etc.
Does this apply to division? division being multiplication by the inverse. I wonder because
> And while the new algorithm is important theoretically, in practice it won’t change much, since it’s only marginally better than the algorithms already being used. “The best we can hope for is we’re three times faster,” van der Hoeven said. “It won’t be spectacular.”
wow, what an understatement! 3x is huge.
so if it speeds up factorization, even by 1.5x, then that's significant for RSA isn't it?
There's a constraint that I as a civilian am missing. Could someone please put me out of my misery:
n.m (in integers) can be computed by starting with zero and adding m to it, n times. That's n operations. Errrrr cough , assuming that the computer can deal with an arbitrarily large n and m.
Numbers that exceed the machine's word size have a common representation whose length is O(lg N). So a simplistic upper bound for computing N×M would be O(min(N,M) × lg(N×M))
How much does each operation cost? Adding together two 5 digit numbers and two 5,000 digit numbers should be counted differently, don’t you agree? So perhaps we should count the number of single-digit operations required.
In that case, adding two five digit numbers will require 5 (or so) additions, and two 5,000 digit numbers will require 5,000 (or so) additions.
What about a 5,000 digit * 5,000 digit multiplication?
Just the very first “big” addition in your scheme will require 5,000 single-digit additions. But that result might now be 5,001 digits long (think of how 99+99 results in a 3 digit sum). Do it again, and if you have carries, it might be 5,002 digits long.
So you’ll have 5,000 + 5,001 + ... = 5000N + (1 + 2 + ... + N) operations. How many times does this repeat, that is, what is N? Well, however big your multiplier is! Your multiplier is around N = 10^5000. So you’ll need a total number of additions approximately equal to
5000 * 10^5000 + 1 + 2 + ... + 10^5000
And what is 1 + 2 + ... + N? It’s N(N+1)/2. So we have
5000 * 10^5000 + 10^5000*(10^5000 + 1) / 2
which we might as well estimate to be
(10^5000)^2.
That’s a _lot_ of operations. And that’s an _under_ estimate! Even grade school multiplication is a lot simpler (if you count single-digit multiplication to have a cost of 1): it only requires 5000^2 elementary operations.
The goal of these funny multiplication algorithms is to beat (number of digits)^2 single-digit operations. What the article shows is that of you have enough digits, you can do only
(number of digits) * (number of digits in the number of digits)
operations to do the multiplication, perhaps off by some fixed, constant factor.
Do it on paper using the method you learned in third grade. How many times above do you combine a single pair of digits? On the left it's 4 (0x0, 0x1, 1x0, and 1x1). Which is O[n^2] and n=2 above.
On the right it's 9 pairwise operations for the ones column plus 9 for the tens column (but it would usually be 10 if we had any carries) for a total of 18. A lot more than 4.
Oh but you say "the computer doesn't have to do those adds digit by digit. It can do all the digits in parallel." That's only true when the numbers fit into a register. If your numbers are not 64 bits wide but 64 thousand bits wide, the computer has to effectively do addition digit by digit. Which is why the left side above is quicker than the right, and the O[n logn] method in the article is better still.
Call 's' the number of bits of the maximum of the two numbers, 'n' and 'm', that you're trying to multiply (that is s = max(lg(n), lg(m))).
Standard "long multiplication" works in O(s^2) time [1].
To speed things up, consider the univariate polynomial whose coefficients are the base representation of each number (for example 10 = p(2) = 0 + 12^1 + 02^2 + 12^2). Multiplying two polynomials algebraically (h(x) = p(x)g(x) = sum (i,j = k) p_ig_i x^k) then evaluating at some point (e.g. '2') is the same as multiplying the points of each polynomial at 's' different points and then using the resulting points to interpolate back to the multiplied polynomial and then evaluating at the same "base" point (e.g. '2').
That is, take 's' points, w_k, evaluate p(x) and g(x) at each of the w_k points, multiply each of the evaluated points together (that is, p(w_k) * g(w_k)) then interpolate back to get the multiplied polynomial, h(x). Evaluate h(x) at the original base (e.g. '2') to get the resulting number.
Take each of the w_k's that have some symmetry, say the roots of unity, and you can exploit various tricks that are either the Fourier transform or similar to go back and forth between the algebraic representation and the 'point representation' in O(s lg(s)) time.
The bit complexity of the evaluated points along with a host of other factors meant that the algorithms discovered were O(s lg(s) lg(lg(s))) or nearby. I would recommend "Fundamental Algorithms of Algorithmic Algebra" for a good introduction [2]. The article mentions the various methods but they go by Karatsuba [3], Shonhage and Strassen [4] among others.
I haven't read the paper but it looks like they're exploiting some 'symmetry' of multivariate rings that are each "simple" and of the form (x^r -1), where r divides some exact power of 2 and does some Fourier transform like tricks in each of those rings.
I can't believe this is a thing. I showed this to my grade school teacher and she told me I was doing it WRONG. I just new it worked so I kept using it.
So fast multiplication is about looking at numbers as signals and doing signal processing on them.