An article from Ars Technica [1] on this topic makes the following point: "Unlik...

KGIII · on Oct 23, 2017

With distributed computing and cloud hosting allowing for thousands of instances, with huge amounts of resources such as obscene terabytes of RAM, is there still a need for supercomputers? They are, effectively, the same thing, right?

I am guessing that I'm not understanding something fully. I don't really see the benefit anymore, now that you can lease thousands of cores, petabytes of disk, and multiple terabytes of RAM.

What's the benefit? What am I missing? Google is none too helpful.

niviksha · on Oct 23, 2017

TL;DR - It does come down mainly to the network, but in far more interesting ways than is apparent from some of the answers here - and also the nature of the HPC software ecosystem that co-evolved with supercomputing for the last 30+ years. This community has pioneered several key ideas in large scale computing that seem to be at risk in the world of cheap, lease-able compute.

In scientific computing (usually where you see them), the primary workload is simulation/modeling of natural phenomena. The nature of this workload is that the more parallelism that is available, the bigger/more fine-grained a simulation can run, and hence the better it can approximate reality (as defined by the scientific models which are being simulated). Examples of this are fluid dynamics, multi-particle physics, molecular dynamics, etc.

The big push with these types of workloads is to be able to get efficient parallel performance at scale - so it isnt about just the # of cores, PB of disk or TB of DRAM, but whether the software and underlying hardware work well together at scale to exploit the available aggregate compute.

So the network matters, not just raw bandwidth but things like latency of remote memory access and the topology itself - for example, the Cray XCs going to Azure allow for a programming model (PGAS) that allows for large, scalable global memory views where a program can view the total memory of a set of nodes as a single address space. Underneath, the hardware and software work together to bound latency, do adaptive per-packet routing and ensure reliability - all at the level of 10s of thousands of nodes. In a real sense, the network is the (super)computer - the old Sun slogan.

Where else is this useful? Well, look at deep learning - the new hotness in parallel computing these days - they are all realizing that it's amazing to run on GPUs, but once you have large enough problems (which the big guys do), you end up having to figure out how to efficiency get a bunch of GPUs to efficiently communicate during data parallel training (that efficient parallelism thing). This happens to map to a relatively simple set of communication patterns (e.g. AllReduce) that is a small subset of the kinds that the HPC community has solved for - so it's interesting that many deep learning engineers are starting to see the value of things like RDMA and frameworks like MPI (Baidu, Uber, MSFT and Amazon for starters).

Interestingly though, the word supercomputing is being co-opted by the very companies that you're positioning as the alternative - the Google TPU Cloud is a specialized incarnation of a typical supercomputing architecture. Sundar Pichai refers to Google as being a 'supercomputer in your pocket'.

KGIII · on Oct 23, 2017

Alas, HN only allows me to vote your comment up once.

I really love the detailed and expansive responses that some questions generate. Hopefully, they are of interest to more than just myself. I've been retired since 2007, so it is living vicariously through you folks - as I absolutely don't have to make these choices anymore.

A part of me wants to take on some big project, just to get back into it. I actually miss working. Go figure?

Thanks!

pinewurst · on Oct 24, 2017

Personally I think of Google as being Orwell's 1984 'telescreen in your pocket'.

jabl · on Oct 23, 2017

Do you get MPI ping-pong latencies on the order of a microsecond on a "normal" public cloud?

No? Well, MPI applications that are sensitive to latency is one usecase where a "real" supercomputer can be useful.

KGIII · on Oct 23, 2017

Ah! This is even better. It led me to finding this:

http://icl.cs.utk.edu/hpcc/hpcc_results_lat_band.cgi

I do believe I get it. Now to find out what kind of applications are greatly benefited from this.

Thanks HN! You always make me dig in and learn new things!

Edit: It was "MPI latency" that led me to that result, by the way.

vmarsy · on Oct 23, 2017

> Now to find out what kind of applications are greatly benefited from this.

The common applications are scientific computing usually, here's a quick overview[1] of the kind of algorithms ran on supercomputers: PDEs, ODEs, FFTs, Sparse and Dense linear algebra, etc.

These are usually used for scientific applications like weather forecasting, where you need to know about the result on time (i.e. before the hurricane reaches the coast!)

[1] https://www.nap.edu/read/11148/chapter/7#125

KGIII · on Oct 24, 2017

I'll scrape the whole book and read it. Thanks! I know weather models still do it on supercomuters but understood the currently have plenty.

I look forward to reading the book.

XenophileJKO · on Oct 24, 2017

Public cloud, maybe not that far away. Linkedin's new data centers have sub 400ns switching and 100G interconnects. https://engineering.linkedin.com/blog/2016/03/project-altair...

jabl · on Oct 24, 2017

Yes, big hyperscalers are adopting CLOS networks as used in HPC for decades. 400ns switching, per se, is not that bad (IIRC 100G IB switches have around 100ns switching cost). Though I wonder when they are doing ECMP at the L3 level what the latency is when crossing multiple switches (for comparison, IB does ECMP at the L2 level)? Also, you need software to take advantage of this as well. MPI stacks tend to use RDMA with user-space networking. For ethernet, there is RoCE v1/v2 which is comparable to IB RDMA, though I'm not sure how mature that is at this point.

Certainly it's true that ethernet has stepped up the game in recent years, while at the same time IB has more or less become a single-vendor play. So it'll be interesting to see what the future holds.

Further, proprietary supercomputer networks have things like adaptive non-minimal routing, which enables efficient use of network topologies that are more affordable at scale, such as flattened butterfly, dragonfly etc. AFAIK neither IB nor ETH + IP-level ECMP support anything like that.

sitkack · on Oct 24, 2017

For special topologies, one can put multiple IB nics in a single host. Nvidia has support for doing DMA transfers directly into GPU memory from 3d-party PCIe devices [0]

[0] http://docs.nvidia.com/cuda/gpudirect-rdma/index.html

discodave · on Oct 23, 2017

Supercomputers do optimimize for different things, e.g. they have faster and lower latency network interconnects. But, over time the differentiation will diminish as the public cloud providers invest more.

You can't beat economics and the public cloud market will grow to be much larger than the supercomputing market. This is similar to why supercomputers switched from bespoke processors to commidity x86.

AWS already has several instance types with 25Gbit ethernet, for instance: http://www.ec2instances.info/?cost_duration=monthly&reserved...

gaius · on Oct 24, 2017

they have faster and lower latency network interconnects

It will not be possible to replicate Ares - which is itself a moving target - for general workloads and still be competitive on price.

gsilva_msft · on Oct 24, 2017

Just wanted to point out, Azure has several VM instances with 30Gbps ethernet that were recently announced: https://azure.microsoft.com/en-us/blog/azure-networking-anno...

These include D64v3, Ds64v3, E64v3, Es64v3, and M128ms VMs.

cwingrav · on Oct 23, 2017

Parallel processing is about network latency and bandwidth for the types of algorithms that don't divide into small computable/paralelizable bits easily. For those tasks, supercomputer buses are unmatched.

KGIII · on Oct 23, 2017

That makes some sense, so I'll just post my thanks in this one reply so I don't have to thank everyone individually.

Much appreciated.

It does lead me to one additional question - is the need for additional speed great enough to justify this? I don't know how much faster it would be and I tried Google and they are not even remotely helpful. I may just be using the wrong query phrases.

Xcelerate · on Oct 23, 2017

> is the need for additional speed great enough to justify this?

Yep. I did molecular dynamics simulations on the Titan supercomputer, and also tried some on Azure's cloud platform (using Infiniband). The results weren't even close.

KGIII · on Oct 23, 2017

When you say the results weren't even close, and if you have time and don't mind, could you share some numbers/elaborate on that?

My experience with HPC is fairly limited, compared to what I think you're discussing. In my case, it was things like blade servers which was a cost decision. We also didn't have the kind of connectivity and speeds that you have available today.

(I modeled traffic at rather deep/precise levels.)

So, if you have some experience numbers AND you have the free time, I'd love to learn more about the differences in the results? Were the benefits strictly time? If so, how much time are we talking about? If you had to personally pay the difference, which one would you select?

Thanks for giving me some of your time. I absolutely appreciate it.

cwingrav · on Oct 24, 2017

The problem with molecular modeling/ protein folding algorithms is that each molecule interacts with every other molecule. So, you have an n^2 problem. Sure, hueristics get this way down to nlogn but what fundamentally slows it is not the growth in computation but the growth in the data being pushed around. Each node needs the data from every node's last time step. For these problems, doubling the nodes of a cluster computer might not even noticeable improve the speed of the algorithm. When I was helping people run these, they were looking at a few milliseconds of computation for a small number of molecules that took a few weeks of supercomputer time. So lots and lots of computing generations were/are needed before we get anywhere close to what they want to model.

cat199 · on Oct 23, 2017

seriously.. take a 1/2 second to think.

(compute-chunk-time * latency * nchunks * ncomms) / n-nodes

obviously this oversimplifies things, but generally as an approximation, there you go.

then merge this in with your cost/time equation, and make the call.

icebraining · on Oct 23, 2017

seriously.. take a 1/2 second to think.

Don't do this on HN please.

KGIII · on Oct 24, 2017

I wonder why they think I was actually wanting the oversimplified stuff? I thought I'd made it clear that I wanted the technical details from their experience. Ah well... Your response is better than mine would have been.

sitkack · on Oct 24, 2017

And Azure's MPI/IB support is the best of all the cloud providers.

crznp · on Oct 23, 2017

Mainly network, as others have said. But there are also scientific applications that are more appropriate for distributed computing (don't need the fast interconnect), but get run on supercomputers anyway because of data locality (post-processing/analysis) or grant structures.

The cool thing here (hopefully) is that it makes it easy to have both: HPC/supercomputer for jobs that need that and cheaper/easier cloud resources for jobs that don't.

justincormack · on Oct 23, 2017

Fast low latency interconnects, and scale up versions of the resources (eg lots of gpus per box) to minimise communication overhead.

grzm · on Oct 23, 2017

I think the key is in your parent, with the focus on increasing overall performance through decreasing latency (as opposed to increasing, say, parallelism).

jerven · on Oct 23, 2017

Internode latency, when you have one big problem and communication is the bottleneck.

rbanffy · on Oct 23, 2017

They still could be shared on a job by job base. For jobs that need the ridiculously fast interconnects (large datasets to be worked on by thousands of cores) you go Cray, when they don't, you go commodity Azure.

Think of it as a rack-sized GPU.

trhway · on Oct 23, 2017

Interesting what tasks require both, supercomputer and normal systems, and for them to be tightly connected. Until of course it is consultant BS/upsell trying to cover/workaround performance issues elsewhere in the stack.