Ahh that makes a lot more sense. Peer review was probably some of the best thing's I'd do at workplaces. Helping to point out thorns in my eyes and vice versa. There could be a bit too many LGTM comments, but I always welcomed having a second set of eyes.
It can also help me scope commits. I definitely had a habit early on to bundle maybe 4-5 commits worth of code into one review; I figured it would waste their time a lot less. Fortunately I was taught early how that was a bad practice for multiple reasons.
> But when developers put AI in consumer products, people expect it to behave like software, which means that it needs to work deterministically. If your AI travel agent books vacations to the correct destination only 90% of the time, it won’t be successful.
This is the fundamental problem that prevents generative AI from becoming a "foundational building block" for most products. Even with rigorous safety measures in place, there are few guarantees about its output. AI is about as solid as sand when it comes to determinism, which is great if you're trying to sell sand, but not so great if you're trying to build a huge structure on top of it.
I've made this statement a bunch in other mediums: The reason AI software is always "AI software" and not just a useful product is because AI is fallible.
The reason we can build such deep and complex software system is because each layer can assume the one below it will "just work". If it only worked 99% of the time, we'd all still be interfacing with assembly, because we'd have to be aware of the mistakes that were made and deal with them, otherwise the errors would compound until software was useless.
Until AI achieves the level of determinism we have with other software, it'll have to stay at the surface.
Recent work from Meta uses AI to automatically increase test coverage with zero human checking of AI outputs. They do this with a strong oracle for AI outputs: whether the AI-generated test compiles, runs, and hits yet-unhit lines of code in the tested codebase.
We probably need a lot more work along this dimension of finding use cases where strong automatic verification of AI outputs is possible.
It can be hard enough for humans to just look at some (already consistently passing) tests and think, "is X actually the expected behavior or should it have been Y instead?"
I think you should have a look at the abstract, especially this quote:
> 75% of TestGen-LLM's test cases built correctly, 57% passed reliably, and 25% increased coverage. During Meta's Instagram and Facebook test-a-thons, it improved 11.5% of all classes to which it was applied, with 73% of its recommendations being accepted for production deployment by Meta software engineers
This tool sounds awesome in that it generated real tests that engineers liked! "zero human checking of AI outputs" is very different though, and "this test passes" is very different from "this is a good test"
Good points regarding test quality. One takeaway for me from this paper is that you can increase code coverage with LLMs without any human checking of LLM outputs, because it’s easy to make a fully automated checker. Pure coverage may not be super-interesting but it’s still fairly interesting and nontrivial. LLM-based applications that run fully autonomously without bubbling hallucinations up to users seem elusive but this is an example.
You hit the nail. It's been almost tragically funny how people frantically tried to juggle 5 bars of wet soap in recent 2 years solving problems that (from what I've seen so far) have been already solved in a (boring) deterministic way consuming much less resources.
Going further, our predecessors put so much work into getting non-deterministic electronics together providing us with a stable and _correct_ platform, it looks ridiculous how people were trying to squeeze another layer of non-determinism in between to solve the same classes of problems.
The irony here is that there are many domains using statistical methods, that bound the complexity and failure modes of statistical methods successfully. A lot of people struggle with statistics but in domains where the glove fits I think AI will slot in all across the stack really nicely.
But software works only 99% of the time. For some definition of work: 99% of days it's run, 99% of clicks, 99% of CPU time in given component, 99% of versions released and linked into some business' production binary, 99% of github tags, 99% of commits, 99% of software that that one guy says is battle-tested
If twenty components work 99% of the time, then they only have an 0.99^20 = 82% chance of working as a collective.
If your 5.1 GHz (billion instructions per second) CPU had a 0.00000001% chance of failing at a given instruction, you'd have a 40% chance of a crash every second.
If a flight had a 1% chance of killing everyone aboard 10 million people/day * 1% = 100,000 people would die every day from a plane.
Software works so much more than 99% of the time that it's a rather deliberate strawman to claim otherwise.
Newly-"AI"-branded things that I have touched work substantially less than 90% of the time. There are like 3 orders of magnitude difference, even people who aren't paying any attention at all are noticing it.
It’s all about limits and edge cases. a+b may “fail” at INT_MAX and at 0.1+0.2. You don’t `==` your doubles, you don’t (a+b)/2 your mid, and you don’t ask ai to just book you vacation. You ask it to “collect average sentiment from `these_5k_reviews()` ignoring apparently fake ones, which are defined as <…>”. You don’t care about determinism because it’s a statistical instrument.
> and you don’t ask ai to just book you vacation. You ask it to “collect average sentiment from `these_5k_reviews()` ignoring apparently fake ones, which are defined as <…>”.
That's exactly my point. You have to interact directly with the A.I. and be aware of what its doing.
structured outputs help, paired with regular old systems design I think you can get pretty far. it really depends what you're building though.
>If your AI travel agent books vacations to the correct destination only 90% of the time
that would be using the wrong tool for the job. an AI travel agent would be very useful for making suggestions, either for destinations or giving a list of suggested flights, hotels etc, and then hand off to your standard systems to complete the transaction.
there are also a lot of systems that tolerate "faults" just fine such as image/video/audio gen
We have lists with shallowly gamed results all over the place, which work in owners/bots favor, not yours. You can’t expect something not running on your device (or on a gpu rented from a third party) to work in your interest.
And hopefully a real recommendation engine won't be weirdly biased towards different answers depending on the exact phrasing, tone, and idiom of the request.
i 100% percent agree. people get so caught up on trying to do everything 90% right with AI, but they forget there's a reason most websites offer at least 2 9's of uptime.
I’m not really sure what stance is here because you say you agree with the GP but then throw some figures that clearly disagree with the authors point (99% uptime is vastly greater than 90% accuracy).
> If your AI travel agent books vacations to the correct destination only 90% of the time, it won’t be successful.
Well, I don't agree. I think there are ways to make this successful, but you have to be honest about the limitations you're working it with and play to your strengths.
How about an AI travel agent that gets your itineraries at a discount with the caveat that you be ready for anything. Like old, cheap standby tickets where you just went wherever there was an empty seat that day.
Or how about an AI Spotify for way less money than current Spotify. It's not competing on quality, it can't. Occasionally you'll hear weird artifacts, but hey it's way cheaper.
We've had good, free (non ai) media recommendation tools in the past and they got killed by licensing agreements.
AI is creating a post-scarcity content economy where quality is going to be the only driver of value.
If you are the rights holder of any premium human created media content you are not going to let a 'cheap' AI tool get access to recommend it out to people.
The AI travel agent is trivial to solve though. It's the same as the human travel agent. Put the plan and pricing together, then give it to the user to sign and accept. Do it in an app, do it in an email, do it on a piece of paper, whatever floats your boat, but give them something they can review and accept instead of trying to do everything verbally or in a basic chat interface.
I'm not disagreeing with the "needs to work deterministically" -- there is a need for that, but this is a poor example. "Hey robot, plan a trip to Mexico" might still save me time overall if done right, and that has value.
I have a question for folks working heavily with AI blackboxes related to this - what are methods that companies use to test the quality of outputs? Testing the integration itself can be treated pretty much the same as testing around any third-party service, but what I've seen are some teams using models to test the output quality of models... which doesn't seem great instinctively
Take this with a grain of salt because I haven't done it myself, but I would treat this the same as testing anything that uses some element of random.
If you're writing a random number generator, that generates numbers between 0 and 100. How would you test it? Throw your hands up in the air and say nope, can't test it, it's not deterministic! Or maybe you can just run it 1000 times and make sure all the numbers are indeed between 0 and 100. Maybe count up the number frequencies and verify its uniform. There's lots of things you can check for.
So do the same with your LLMs. Test it on your specific use-cases. Do some basic smoke tests. Are you asking it yes or no questions? Is it responding with yes or no? Try some of your prompts on it, get a feel for what it outputs, write some regexes to verify the outputs stay sane when there's a model upgrade.
For "quality" I don't think there's a substitute than humans. Just try it. If the outputs feel good, add your unit tests. If you want to get scientific, do blind tests with different models and have humans rate them.
But a knowledgeable human can take the iternarary and run with it. I know I’ve done that with code enough from AI generated stuff, it’s basically boiler plate. You still run it through the same tests, reviews, and verification as you would have had to do anyway.
And yet, generative AI also seems to be poor at randomness. When I ask Google Gemini for a list of 50 random words, it gave me a list of 18 unique words, with 16 of them repeated exactly 3 times.
Randomness is difficult. I wouldn't expect any LLM to be able to reliably produce random anything, except in the cases where they have access to tools (ChatGPT Code Interpreter could use Python's random.random() for example).
Nowhere near as good as ChatGPT 4o or Claude (in not one case have I had it outperform those other two), but at least it can do math and data science correctly most of the time compared to the regular model.
I use it as a secondary when the other two are chewing on other tasks already.
I only own it as I am an outrageously heavy consumer of LLMs for all sorts of little projects at once and they all seem to pause one window if you use another.
I hand-write a work journal. Just an A5 notebook and a few pens of different colors. Definitely an essential piece of my dev toolkit. I've especially come to love the free-form nature of hand-writing, which allows me to visualize more of my thoughts than a digital text editor.
The journal has served two main purposes. One, I can write and annotate free-form pseudocode at exactly the level of abstraction I need without getting distracted by the errors produced by the code editor. It's really helped me work through the difficult parts of coding puzzles before I ever touch the keyboard to implement.
Two, I have a scientific notebook for debugging. I write down a hypothesis, design a small experiment, document the steps and complications as I go, and write down what the actual result was; then repeat the cycle. Putting it all in writing keeps it straight so I don't chase my tail, and I have something to look back on if I need to explain the bug and how it was solved to my coworkers.
I also took the class that uses this OS at MIT. Absolutely fantastic. I was just browsing the class website today actually, and you can totally kinda take the class yourself. The site has all the lecture notes, the labs, and even a version of xv6 in a repo with branches for all the labs, along with instructions for getting it all working yourself. It's kinda amazing how open it is.
Logical properties have been amazing to work with. They're meant to make CSS rules generalize across languages with different text directions, but thinking in terms of block, inline, start, and end naturally guides me into styles that are more responsive. I think it's given me a much better conceptual model and understanding of CSS.
The site actually has some tutorials for creating these sorts of animations with a specific focus on perfectly looping gifs [1]. Looks like it's all done with Processing [2].
Wow this is suprisingly accessible! How are people incorporating this aside from screensaver? That high contrast LCD screen from playdate would make a brilliant frame for these animations.
I think this basically has the same effect as the blog post produces, but it fits in my head better when the dynamic part is expressed as the equation for a line.
What kind of glitches would you experience in this context without rotation minimizing frames? If the missiles are a non-symmetrical shape, I would guess that you could see some "snapping" in its rotation, but if the missile is something like a centered sphere, my intuition is that the snapping wouldn't be perceptible.