I don't see it. Any non trivial analog computation involves a very large circuit, which has the problems of normal programming (bugs) and graphic programming (write-only), but with the extra pitfalls of electronics (resistance, delay, the resulting oscillations). And then you have to read all the outputs. That's going to be slow and expensive to build.
In what concrete problems do you (or Veritasium) think analog computing could beat a GPU?
Author here -- I don't disagree! I actually noted this in the article:
> Well, it turns out that LLMs are also pretty valuable when it comes to chips for lucrative markets -- but they won’t be doing most of the design work. LLM copilots for Verilog are, at best, mediocre. But leveraging an LLM to write small snippets of simple code can still save engineers time, and ultimately save their employers money.
I think designers getting 2x faster is probably optimistic, but I also could be wrong about that! Most of my chip design experience has been at smaller companies, with good documentation, where I've been focused on datapath architecture & design, so maybe I'm underestimating how much boilerplate the average engineer deals with.
Regardless, I don't think LLMs will be designing high-performance datapath or networking Verilog anytime soon.
At large companies with many designers, a lot of time is spent coordinating and planning. LLMs can already help with that.
As far as design/copilot goes, I think there are reasons to be much more optimistic. Existing models haven't seen much Verilog. With better training data it's reasonable to expect that they will improve to perform at least as well on Verilog as they do on python. But even if there is a 10% chance it's reasonable for VCs to invest in these companies.
> With better training data it's reasonable to expect that they will improve to perform at least as well on Verilog as they do on python.
There simply isn't enough of that code in existence.
Writing Verilog code is about mapping the constructs onto your theory of mind about the underlying hardware. If that were easy, so many engineers wouldn't have so much trouble writing Verilog code that doesn't have faults. You can't write Verilog code just by pasting together Stack Overflow snippets.
Look at the confusion that happens when programmers take their "for-loop" understanding into the world of GPU shaders or HDLs (hardware description languages) where "for-loops" map to hardware and suddenly are both finite and fixed. LLMs exhibit the exact same confusion--only worse.
I’m actually curious if there even is a large enough corpus of Verilog out there. I have noticed that even tools like Copilot tend to perform poorly when working with DSLs that are majority open source code (on GitHub no less!) where the practical application is niche. To put this in other terms, Copilot appears to _specialize_ on languages, libraries and design patterns that have wide adoption, but does not appear to be able to _generalize_ well to previously unseen or rarely seen languages, libraries, or design patterns.
Anyway that’s largely anecdata/sample size of 1, and it could very well be a case of me holding the tool wrong, but that’s what I observed.
I didn't get into this in the article, but one of the major challenges with achieving superhuman performance on Verilog is the lack of high-quality training data. Most professional-quality Verilog is closed source, so LLMs are generally much worse at writing Verilog than, say, Python. And even still, LLMs are pretty bad at Python!
That’s what your VC investment would be buying; the model of “pay experts to create a private training set for fine tuning” is an obvious new business model that is probably under-appreciated.
If that’s the biggest gap, then YC is correct that it’s a good area for a startup to tackle.
It would be hard to find any experts that could be paid "to create a private training set for fine tuning".
The reason is that those experts do not own the code that they have written.
The code is owned by big companies like NVIDIA, AMD, Intel, Samsung and so on.
It is unlikely that these companies would be willing to provide the code for training, except for some custom LLM to be used internally by them, in which case the amount of code that they could provide for training might not be very impressive.
Even a designer who works in those companies may have great difficulties to see significant quantities of archived Verilog/VHDL code, though it can be hoped that it still exists somewhere.
When I say “pay to create” I generally mean authoring new material, distilling your career’s expertise.
Not my field of expertise but there seem to be experts founding startups etc in the ASIC space, and Bitcoin miners were designed and built without any of the big companies participating. So I’m not following why we need Intel to be involved.
An obvious way to set up the flywheel here is to hire experts to do professional services or consulting on customer-submitted designs while you build up your corpus. While I said “fine-tuning”, there is probably a lot of agent scaffolding to be built too, which disproportionately helps bigger companies with more work throughput. (You can also acquire a company with the expertise and tooling, as Apple did with PA Semi in ~2008, though obviously $100m order of magnitude is out of reach for a startup. https://www.forbes.com/2008/04/23/apple-buys-pasemi-tech-ebi...)
I doubt any real expert would be tempted by an offer to author new material, because that cannot be done in a good way.
One could author some projects that can be implemented in FPGAs, but those do not provide good training material for generating code that could be used to implement a project in an ASIC, because the constraints of the design are very different.
Designing an ASIC is a year-long process and it is never completed before testing some prototypes, whose manufacture may cost millions. Authoring some Verilog or VHDL code for an imaginary product that cannot be tested on real hardware prototypes could result only in garbage training material, like the code of a program that has never been tested to see if it actually works as intended.
Learning to design an ASIC is not very difficult for a human, because a human does not need a huge number of examples, like ML/AI. Humans learn the rules and a few examples are enough for them. I have worked in a few companies at designing ASICs. While those companies had some internal training courses for their designers, those courses only taught their design methodologies, but with practically no code examples from older projects, so very unlikely to how a LLM would have to be trained.
I would imagine it is a reasonably straightforward thing to create a simulator that generates arbitrary chip designs and the corresponding verilog that can be used as training data. It would be much like how AlphaFold was trained. The chip designs don't need to be good, or even useful, they just need to be valid so the LLM can learn the underlying relationships.
I have never heard of any company, no matter how big and experienced, where it is possible to decide that an ASIC design is valid by any other means except by paying for a set of masks to be made and for some prototypes to be manufactured, then tested in the lab.
This validation costs millions, which is why it is hard to enter this field, even as a fabless designer.
Many design errors are not caught even during hardware testing, but only after mass production, like the ugly MONITOR/MWAIT bug of Intel Lunar Lake.
Randomly-generated HDL code, even if it does not have syntax errors, and even if some testbench for it does not identify deviations from its specification, is not more likely to be valid when implemented in hardware, than the proverbial output of a typewriting monkey.
Validating an arbitrary design is hard. It's equivalent to the halting problem. Working backwards using specific rules that guarantee validity is much easier. Again, the point is not to produce useful designs. The generated model doesn't need to be perfect, indeed it can't be, it just needs to be able to avoid the same issues that humans are looking for.
I know just enough about chips to be suspicious of "valid". The right solution for a chip at the HDL layer depends on your fab, the process you're targeting, what % of physical space on the chip you want it to take up, and how much you're willing to put into power optimization.
The goal is not to produce the right, or even a good solution. The point is to create a large library of highly variable solutions so the trained model can pick up on underlying patterns. You want it to spit out lots of crap.
That's probably where there's a big advantage to being a company like Nvidia, which has both the proprietary chip design knowledge/data and the resources/money and AI/LLM expertise to work on something specialized like this.
I strongly doubt this - they don't have enough training data either - you are confusing (i think) the scale of their success with the amount of verilog they possess.
IE I think you are wildly underestimating both the scale of training data needing, and wildly overestimating the amount of verilog code possessed by nvidia.
GPU's work by having moderate complexity cores (in the scheme of things) that are replicated 8000 times or whatever.
That does not require having 8000 times as much useful verilog, of course.
The folks who have 8000 different chips, or 100 chips that each do 1000 things, would probably have orders of magnitude more verilog to use for training
If they're doing inference on edge devices, one challenge I see is protecting model weights. If you want to deploy a proprietary model on an edge AI chip, the weights can get stolen via side-channel attacks [1]. Obviously this isn't a concern for open models, but I doubt Apple would go the open models route.
I'm curious what the motivation is here -- unfortunately, the dev blog is all in Chinese and I can't read it. If it's mostly to show a proof-of-concept of LLMs on a FPGA, that's awesome!
But if this is targeting real-world applications, I'd have concerns about price-to-performance. High-level synthesis tools often result in fairly poor performance compared to writing Verilog or SystemVerilog. Also, AI-focused SoCs like the Nvidia Jetson usually offer better price-to-performance and performance-per-watt than FPGA systems like the KV260.
Potentially focusing on specialized transformer architectures with high sparsity or significant quantization could give FPGAs an advantage over AI chips, though.
Not to toot my own horn, but I wrote up a piece on open-source FPGA development recently going a bit deeper into some of these insights, and why AI might not be the best use-case for open-source FPGA applications: https://www.zach.be/p/how-to-build-a-commercial-open-source
AMD hasn't shipped their "high compute" SOMs, so there is little point in building inference around it. Using programmable logic for machine learning is a complete waste, since Xilinx never shied away from sprinkling lots of "AI Engines" on their bigger FPGAs, to the point where buying the FPGA just for the AI Engines might be worth it, because 100s of VLIW cores packs a serious punch for running numerical simulations.
There has been a some recent investigations into bitnets (1 or 2-bit weights for NNs including LLMs) where they show that a 1.58 bit weight (with values: -1,0,1) can achieve very good results. Effectively that's 2 bits. The problem is that doing 2-bit math on a CPU or GPU isn't going to be very efficient (lots of shifting & masking). But doing 2-bit math on an FPGA is really easy and space-efficient. Another bonus is that many of the matrix multiplications are replaced by additions. Right now if you want to investigate these smaller weight sizes FPGAs are probably the best option.
> High-level synthesis tools often result in fairly poor performance compared to writing Verilog or SystemVerilog.
I'm curious, do you have any intuition for what percent of the time is spent shifting & masking vs. adding & subtracting (int32s I think)? Probably about the same?
The big challenge when it comes to using FPGAs for deep learning is pretty simple: all of that reprogrammability comes at a performance cost. If you're doing something highly specific that conventional GPUs are bad at, like genomics research [1] or high-frequency trading [2], the performance tradeoff is worth it. But for deep learning, GPUs and AI ASICs are highly optimized for most of these computations, and an FPGA won't offer huge performance increases.
The main advantage FPGAs offer is being able to take advantage of new model optimizations much earlier than ASIC implementations could. Those proposed ternary LLMs could potentially run much faster on FPGAs, because the hardware could be optimized for exclusively ternary ops. [3]
If I remember correctly about 80% of a modern FPGA's silicon is is used for connections. FPGA have their uses and very often a big part in them is the Field Programmability. If that is not required, there is no good reason another solution (ASIC, GPU, etc.) couldn't beat the FPGA in theory. Now, in practice there are some niches, where this is not absolutely true, but I agree with GP that I see challenges for deep learning.
An ASIC will always have better performance than an FPGA, but it will have an acceptable cost only if it is produced in a large enough number. You will always want an ASIC, but only seldom you will able to afford it.
So the decision of ASIC vs. FPGA is trivial, it is always based on the estimated price of the ASIC, based on the number of ASICs that would be needed.
The decision between off-the-shelf components, i.e. GPUs and FPGAs, is done based on performance per dollar and performance per W and it depends very strongly on the intended application. If the application must compute many operations with bigger numbers, e.g. FP32 or FP16, then it is unlikely that an FPGA can compete with a GPU. When arithmetic computations do not form the bulk of an algorithm, then an FPGA may be competitive, but a detailed analysis must be made for any specific application.
I'm definitely not! I'm a hardware designer and I work with FPGAs all the time, for both work and for personal projects. Like with all things, there's a right tool for every job, and I think for modern DL algorithms like Transformers, GPUs and AI ASICs are the better tools. For rapid hard prototyping, or for implementing specialized architectures, FPGAs are far better.
Large fast FPGAs are great but very expensive, small size slow FPGAs are not practical for most solutions, where ARM controllers are used, significantly cheaper.
500GB/s is going to limit it to at best 1/4 the DL performance of an nvidia gpu. I’m not sure what the floating point perf of these FPGAs are but I imagine that also might set a fundamental performance limit at a small fraction of a GPU.
Well I keep seeing all models quantized and for 2-bit, 4-bit and 1-bit quantizations I had good very good inference performance (either througput or latency) on CNNs and some RNNs on Alveo boards using FINN (so, mostly high level synthesis and very little actual fpga wrangling). No idea about the current status of all these, will read the paper though :-)
1. They are hard to use (program). If you're a regular ML engineer, there will be a steep learning curve with Verilog/VHDL and the specifics of the chip you choose, especially if you want to squeeze all the performance out of it. For most researchers it's just not worth it. And for production deployment it's not worth the risk of investing into an unproven platform. Microsoft tried it many years ago to accelerate their search/whatever, and I think they abandoned it.
2. Cost. High performance FPGA chips are expensive. Like A100 to H100 price range. Very few people would be willing to spend this much to accelerate their DL models unless the speedup is > 2x compared to GPUs.
FPGAs are also reasonably good at breadboarding modules to be added to ASICs. You scale down the timing and you can run the same HDL and perform software integration at the same time as the HDL is optimized.
Much cheaper and faster than gate level simulation.
Every couple of years I revisit the FPGA topic, eager to build something exciting. I always end up with a ton of research, where I learn a lot but ultimately shy away from building something.
This is because I cannot find a project that is doable and affordable for a hobbyist but at the same time requires an FPGA in some sense. To put it bluntly: I can blink a LED for a fiver with a micro instead of spending hundreds for an FPGA.
So, assuming I am reasonably experienced in software development and electronics and I have 1000 USD and a week to spend.
What could I build that shows off the capabilities of an FPGA?
I work at one of the big 3 FPGA companies, so I can give you an idea of where our teams spend most of their time, and you can translate that into a hobbyist project as you will.
1. Video and Broadcast. Lots of things to be done here. New protocols are being introduced every year by IEEE for sending video between systems. Most cutting-edge cameras have some sort of FPGA inside doing niche image processing. You can get a sensor and build yourself your own Camera-on-Chip. It's a fantastic way to lose a year or two (I can attest to that). Some good material on the matter here: https://www.mathworks.com/discovery/fpga-image-processing.ht...
2. Compute Acceleration. This is more data centre-specific. SmartNICs, IPUs and the like. Hard to make a dent unless you want to spend 200k on a DevKit, but you could prototype one on a small scale. Some sort of smart FPGA switch that redirects Ethernet traffic between a bunch of Raspberry Pis dependent on one factor or another. One company that comes to mind is Napatech. They make a bunch of really interesting FPGA servers systems: https://www.napatech.com/products/nt200a02-smartnic-capture/
3. Robotics and Computer Vision. Plenty of low-hanging fruit to be plucked here. A rediculous amount of IO, all needed to work in near realtime. Hardware acceleration kernels on top of open standards like ROS 2. I always point people in the direction of Acceleration Robotics' startup in Barcelona for this. They're epic: https://github.com/ros-acceleration
4. Telecomunications. This is a bit of a dark art area for me, where the RF engineers get involved. From what my colleagues tell me, FPGAs are good for this because any other device doesn't service the massive MIMO antenna arrays besides building custom ASICs, and the rate of innovation in this area means an ASIC made one year is redundant the next. Software-defined radios are the current trend. You could have fun making your own radio using an FPGA: https://github.com/dawsonjon/FPGA-radio
Reasonably experienced and 'a week' can mean vastly different things... It's certainly easier to keep the cost down with longer time-frames.
For a focus on electronics rather than implementing some kind of toy 'algorithm accelerator', I find low-hanging/interesting projects where the combination of requirements exceed a micro's peripheral capabilities - i.e. multiple input/output/processing tasks which could be performed on a micro individually, but adding synchronisation or latency requirements makes it rather non-trivial.
- Very wide/parallel input/output tasks: ADC/DACs for higher samplerate/bitdepth/channel count than typically accessible with even high-end micros
- Implementing unique/specialised protocols which would have required bit-banging, abuse of timer/other peripherals on a micro (i.e. interesting things people achieve with PIO blocks on RP2040 etc)
- Signal processing: digital filters and control systems are great because you can see/hear/interact with the output which can help build a sense of achievement.
When starting out, it's also less overwhelming to start with smaller parts and allocate the budget to the rest of the electronics. They're still incredibly capable and won't seem as under-utilised. Some random project ideas:
- Find a sensing application that's interesting and then take it to the logical extreme - arrays of photo/hall-effect sensors sampled at high speed and displayed, accelerometers/IMU sensor fusion
- Laser galvanometers and piezo actuators are getting more accessible
- Small but precise/fast motion stages for positioning or sensing might present a good combination of input, output, filtering and control systems.
- With more time/experience you could branch into more interesting (IMO) areas like RF or imaging systems.
With more info about your interest areas I can give more specific suggestions.
Good list, thanks. I have a couple of years professional experience as a software dev and worked in the embedded space too. Nowadays I am in security and that is definitely an area of interest.
I only dabble with recreationally reverse engineering industrial/consumer grade HW and following blogs/conferences, so I can only provide a rough shotgun of search terms to try and hit something you're interested in:
- The Glasgow interface explorer is an example of a smaller FPGA making interface level RE tooling more accessible.
- The Chipwhisperer hardware has a focus on power supply glitching, side-channel attacks and general hardware security education/testing.
- There's a handful of FPGA-based implementations intended for high-speed protocol sniffing/MiTM (TCP/IP, USB and CANBus are both pretty common) on github etc, Cynthion is one example.
- Some recent projects have been trying to implement and improve the FOSS ARM Cortex programming and trace experience, Orbuculum ORBTrace probe is an example though the benefits aren't fully realised yet.
- In an odd use-case for an FPGA, I've personally seen hardware that enforces brutal/paranoid DRM/licencing via customised downloaded bitstreams to guards against reverse-engineering/copy efforts, all to most likely run a soft-CPU. I've read (unsubstantiated) that this approach appears on some military hardware.
- Slightly adjacent to specific FPGA projects, but the SDR tooing ecosystem has lots of cool stuff to play with for wireless signal identification/spoofing/re-implementation. HackRF, LimeSDR, GNUradio etc. If you want to get deep then there's lots of overlap with custom FPGA implementations.
I was part of a startup that did ternary CNNs on FPGA in 2017. It involved a ton of nitty gritty work and massive loss of generalit, and in the end a Raspberry Pi could solve the same problem faster and cheaper.
Um, no? The actual problem is that most FPGAs already have DPUs for machine learning integrated on them. Some Xilinx FPGAs have 400 "AI Engines" which provide significantly more compute than the programmable logic, the almost 2000 DSP slices or the ARM cores. This means that the problem with FPGAs is primarily lack of SRAM and limited memory bandwidth.
I think one major challenge they'll face is that their architecture is incredibly fast at running the ~10-100B parameter open-source models, but starts hitting scaling issues with state-of-the-art models. They need 10k+ chips for a GPT-4-class model, but their optical interconnect only supports a few hundred chips.