Author here. Yes, I think the original GitHub Copilot autocomplete UI is (ironically) a good example of a HUD! Tab autocomplete just becomes part of your mental flow.
Recent coding interfaces are all trending towards chat agents though.
It’s interesting to consider what a “tab autocomplete” UI for coding might look like at a higher level of abstraction, letting you mold code in a direct-feeling way without being bogged down in details.
If that's what you think a HUD is, then a HUD is definitely way, way worse. Rather than a copilot sitting next to you, that's someone grabbing your hands and doing things with them while you're at the controls.
But if I invoke the death of the author and pretend HUD meant HUD, then it's a good point: tools are things you can form a cybernetic system with, classic examples being things like hand tools or cars, and you can't form a cybernetic system with something trying to be an "agent". To be in a cybernetic system with something you need predictable control and fast feedback, roughly.
I take "HUD" here to just mean "in your line of vision" or "in the context of your actual task" or minimizing any context switch to another interaction (chat window).
Rather I think most implementations of HUD AI interactions so far have been quite poor because the interaction model itself is perhaps immature and no one has quite hit the sweet spot yet (that I know of). Tab autocompletion is a simple gesture, but trades off too much control for more complex scenarios and is too easy to accidentally activate. Inline chat is still a context switch and also not quite right.
Neat —- Scrappy looks like a lovely prototype! As the creators say in their writeup, it fits nicely into the lineage of HyperCard-style “media with optional scripting” editors, which provide a gentle slope into programming.
In the section on dynamic documents towards the end of our essay, we show several of our lab’s own takes on this category of tool, including an example of integrating AI as an optional layer over a live programmable document.
Actually, Patchwork has surprisingly few features! Think of it more like an OS than a product. The goal is a small set of composable primitives that let you build many things - documents, tools, branching/diffs, plugins…
To answer your question: although we use Patchwork every day, it’s currently very rough around the edges. The SDK for building stuff needs refinement (and SDKs are hard to change later…) Reliability and performance need improvement, in coordination with work on Automerge. We also plan to have more alpha users outside our lab before a broader release, to work through some of these issues.
In short, we feel that it’s promising and headed in a good direction, but it’s not there yet.
> I agree, I feel like the authors are underestimating the effect the new AI is already having on the concept of local software crafting
Coauthor here -- did you catch our section on AI? [1]
We emphatically agree with you that AI is already enabling new kinds of local software crafting. That's one reason we are excited about doing this work now!
At the same time, AI code generation doesn't solve the structural problems -- our whole software world was built assuming people can't code! We think things will really take off once we reorient the OS around personal tools, not prefabricated apps. That's what the rest of the essay is about.
Yes, but I think we have a somewhat different idea about the market forces. My impression from your essay is that you believe app developers will add APIs that enable personal tools, and only then will local software crafting take off.
My belief is that it is happening already: local software crafting is happening now, before the tools are ready. People aren't going to wait for good APIs to exist; people will MacGyver things together. They'll scrape screens (sometimes with OCR), run emulated devices in the cloud, and call APIs incorrectly and abusively until they get what they need. They won't ask for permission.
A lot of software developers may transition from building to cleaning up knots.
Yes, I think atproto is a great example of the “shared data” pattern for composable tools! Especially since it handles public social scale, which is not addressed by the other systems we mention.
AFAIK, atproto is primarily designed to support multiple distinct clients over shared data, but I also wonder if it could help with composing more granular views within a client. I previously worked on a browser extension for Twitter, and data scraping was a major challenge - which seems easier building on an open protocol like atproto.
Sorry we didn’t mention — it is on our radar but we ran out of space and had to omit lots of good prior art..
I should also mention btw that Bluesky user-configurable feeds is a perfect example of a gentle slope from user to creator!
You make a fair point! Ease of use matters. We all want premade experiences some of the time. The problem is that even in those (perhaps rare!) cases where we want to tweak something, even a tiny thing, we’re out of luck.
An analogy: we all want to order a pizza sometime. But at the same time, a world with only food courts and no kitchens wouldn’t be ideal. That’s how software feels today—-the “kitchen” is missing.
Also, you may be right in the short term. But in the long run, our tools also shape our culture. If software makes people feel more empowered, I believe that’ll eventually change people’s preferences.
Well, if I may continue my pessimistic outlook, I would simply say that anyone can cook, but not everyone can cook. Programmers are chefs - we take ingredients called SDKs and serve them up into meals called custom software. Anyone who isn't a chef, might need to buy the packaged cake mix at Walmart.
For something as complex as software, it's sad, but it's almost... okay? Every industry has gone through this; there was a time when cars were experimental and hand-assembled. Imagine if Henry Ford in the 1920s had focused on democratizing car parts so anyone can build their own car with thousands of potential combinations; I don't think it would have worked out. It is still true that you can, technically speaking, build your own car; but nobody pretends that we can turn everyone into personalized car builders if we just try hard enough.
I gotta say I don’t understand your point about cooking — billions of people who aren’t professional chefs cook meals every day! These meals may not live up to restaurant standards but they have different virtues — like making it taste just the way you like it, or carrying on a family tradition.
On that note, Robin Sloan has a beautiful post about software as a home cooked meal…
That said, I think talking about cars may be stronger ground for the argument you’re making. Mass production is incredible at making cheap uniform goods. This applies even more in software, where marginal costs are so low.
The point of our essay, though, is that the uniformity of mass produced goods can hinder people when there’s no ability to tweak or customize at all. I’m not a car guy, but it seems like cars have reasonably modular parts you can replace (like the tires) and I believe some people do deeper aftermarket mods as well. In software, too often you can’t even make the tiniest change. It’s as if everyone had to agree on the same tires, and you needed to ask the original manufacturer to change the tires for you!
First thanks for the original article and it is great to know a team is going deep on this.
I am a bit fed up with software less because of malleablity but because of the cloud walled gardens. I can't open my Google doc in something else like I can a pdf in different programs. Not without exporting it.
This for me interested and I found remotestorage.io which looks very promising. I like the idea that I buy my 100gb of could storage from wherever then compose the apps I want to use around it.
I hadn't thought of malleable software... that's a whole other dimension! Thanks for introducing this as a concept worth talking about. Of course I have heard of elisp and used excel but haven't thought of it front and centre.
In terms of cooking ... I feel like cooking is easier potentially as for the most part (some exceptions) if I know the food hygiene and how to cook stuff then it is an additive process. Chicken plus curry plus rice. Software is like this too until it isn't. The excel docs do a great simple budget but not a full accounting suite. With the latter you get bogged down in fixing bugs in the sheet as you try to use it.
I think it is good you are researching as these could be solvable problems probably for many cases.
Something I have always thought about is sometimes it matter less if the software is open source than if the file format is. Then people can extend by building more around the file format. A tool might work on part of the format where an app works on all of it. I use free tools to sign PDFs for example.
Also adding that software only being inflexible due to being mass-produced is the state of the pre-Enshittification era that we already left behind.
Since the last decade or so at the latest, software is often designed as an explicit means of power over users and applications are made deliberately inflexible to, e.g. corece users to watch ads, purchase goods or services or simply stay at the screen for longer than intended.
(Even that was already the case in niches, especially "shareware". But in a sense, all commercial software is shareware now)
> But in the long run, our tools also shape our culture. If software makes people feel more empowered, I believe that’ll eventually change people’s preferences.
I'm really curious to see how the overlap with BABLR plays out. In many ways we're doing the same experiments in parallel: we're both working on systems that have a natural tendency to become their own version control, and which try to say what the data is without prejudice as to how it might be presented.
In particular BABLR thinks it can narrow and close the ease-of-use gap between "wire up blocks" style programming and "write syntax out left to right" style programming by making a programming environment that lets you wire up syntax tree nodes as blocks.
It's still quite rough, but we have a demo that shows off how we can simplify the code editing UX down to the point where you can do it on a phone screen:
Try tapping a syntax node in the example code to select that node. Then you can tap-drag the selected (blue) node and drop it into any gap (gray square). The intent is to ensure that you can construct incomplete structures, but never outright invalid ones.
> That’s how software feels today—-the “kitchen” is missing.
I believe you'll want to read this essay which appeared in the Spring 1990 issue of Market Process, a publication of the Center for the Study of Market Processes at George Mason University ...
"An Inquiry into the Nature and Causes of the Wealth of Kitchens"
by Phil Salin
Having worked for him, I'd say his wikipedia entry doesn't do him justice, but is a good start if you're curious--like your Ink & Switch group he spent many years trying to create a world changing software/platform [AMIX , sister co. to Xanadu, both funded in the 1990s by Autodesk].
Look at HyperCard (more or less dead, regrettably) or Excel and you'll see many useful "applications" created by non-programmers over the years.
People want to create, but need tools to make this easier / more abstract than regular programming. Most companies want to get them into their walled gardens instead, especially web-based companies today.
We're happy to share the Embark prototype with anyone who wants to try it out - just email me at [email protected] and I can share a link with you.
A couple reasons we've decided not to share the demo widely: 1) it's research software, not developed to the quality standards of a commercial product, so we don't want people to get confused or disappointed by that, 2) the prototype heavily uses the paid Google Maps API.
We've also publicly released demos of some related work, like Potluck, an interactive medium built on text notes:
I tried one task head-to-head with Codex o4-mini vs Claude Code: writing documentation for a tricky area of a medium-sized codebase.
Claude Code did great and wrote pretty decent docs.
Codex didn't do well. It hallucinated a bunch of stuff that wasn't in the code, and completely misrepresented the architecture - it started talking about server backends and REST APIs in an app that doesn't have any of that.
I'm curious what went so wrong - feels like possibly an issue with loading in the right context and attending to it correctly? That seems like an area that Claude Code has really optimized for.
I have high hopes for o3 and o4-mini as models so I hope that other tests show better results! Also curious to see how Cursor etc. incorporate o3.
Claude Code still feels superior. o4-mini has all sorts of issues. o3 is better but at that point, you aren't saving money so who cares.
I feel like people are sleeping on Claude Code for one reason or another. Its not cheap, but its by far the best, most consistent experience I have had.
These days I’m using Amazon Q Pro on the CLI. Very similar experience to Claude Code minus a few batteries. But it’s capped at $20/mo and won’t set my credit card on fire.
Seems 4x costlier than my Aider+Openrouter. Since I'm less about vibes or huge refactoring, my (first and only) bill is <5 usd with Gemini. These models will halve that.
No, Amazon Q is using Amazon Q. You can't change the model, it's calling itself "Q" and it's capped to $20 (Q Developer Pro plan). There is also a free tier available - https://aws.amazon.com/q/developer/
It's very much a "Claude Code" in the sense that you have a "q chat" command line command that can do everything from changing files, running shell commands, reading and researching, etc. So I can say "q chat" and then tell it "read this repo and create a README" or whatever else Claude Code can do. It does everything by itself in an agentic way. (I didn't want to say like 'Aider' because the entire appeal of Claude Code is that it does everything itself, like figuring out what files to read/change)
(It's calling itself Q but from my testing it's pretty clear that it's a variant of Claude hosted through AWS which makes sense considering how much money Amazon pumped into Anthropic)
> the entire appeal of Claude Code is that it does everything itself, like figuring out what files to read/change
how is this appealing? I think I must be getting old because the idea of letting a language model run wild and run commands on my system -- that's unsanitized input! --horrifies me! What do you mean just let it change random files??
It shows you the diff and you confirm it, asks you before running commands, and doesn't allow accessing files outside the current dir. You can also tell it to not ask again and let it go wild, I've built full features this way and then just go through and clean it up a bit after.
In the OpenAI demo of codex they said that it’s sandboxed.
It only has access to files within the directory it’s run from, even if it calls tools that could theoretically access files anywhere on your system. Also had networking blocked, also in a sandboxes fashion so that things like curl don’t work either.
I wasn’t particularly impressed with my short test of Codex yesterday. Just the fact that it managed to make any decent changes at all was good, but when it messed up the code it took a long time and a lot of tokens to figure out.
I think we need fine tuned models that are good at different tasks. A specific fine tune for fixing syntax errors in Java would be a good start.
In general it also needs to be more proactive in writing and running tests.
I don’t know what Amazon did - but I use Aider+Openrouter with Gemini 2.5 pro and it cost 1/6 of what sonnet 3.7 does. The aider leaderboard https://aider.chat/docs/leaderboards/ - includes relative pricing theses days.
> Upgrade apps in a fraction of the time with the Amazon Q Developer Agent for code transformation (limit 4,000 lines of submitted code per month)
4k loc per month seems terribly low? Any request I make could easily go over that. I feel like I'm completely misunderstanding (their fault though) what they actually meant.
Edit: No I don't think I'm misunderstanding, if you want to go over this they direct you to a pay-per-request plan and you are not capped at $20 anymore
You are confusing Amazon Q in the editor (like "transform"), and Amazon Q on the CLI. The editor thing has some stuff that costs extra after exceeding the limit, but the CLI tool (that acts similar to Claude Code) is a separate feature that doesn't have this restriction. See https://aws.amazon.com/q/developer/pricing/?p=qdev&z=subnav&..., under "Console" see "Chat". The list is pretty accurate with what's "included" and what costs extra.
I've been running this almost daily for the past months without any issues or extra cost. Still just paying $20
Do try! The free tier doesn't cost anything and is enough to tinker around with. You don't even need an AWS account for it, it'll prompt you to create a new separate account specifically for Q
Compared to cline aider had no chance, the last time I tried it (4 month ago). Has it really changed? Always thought cline is superior because it focuses on sonnet with all its bells an whistles. While aider tries to be an universal ide coding agent which works well with all models.
When I try gemmini 2.5 pro exp with cline it does very well but often fails to use the tools provided by cline which makes it way less expensive while failing random basic tasks sonnet does in its sleep. I pay the extra to save the time.
Do not get me wrong. Maybe I am totally outdated with my opinion. It is hard to keep up these days.
I tried Cline, but I work faster using the command line style of Aider.
Having the /run command to execute a script and having the console content added to the prompt, makes fixing bugs very fast.
speaking for myself, I am happy to make that trade. As long as I get unrestricted access to latest one. Heck, most of my code now is written by gemini anyway haha.
If you get a hang of controlling costs, it's much cheaper. If you're exhausting the context window, I'm not surprised you're seeing high cost.
Be aware of the "cache".
Tell it to read specific files, never use /compact (that'll bust cache, if you need to, you're going back and forth too much or using too many files at once).
Never edit files manually during a session (that'll bust cache). THIS INCLUDES LINT.
Have a clear goal in mind and keep sessions to as few messages as possible.
Write / generate markdown files with needed documentation using claude.ai, and save those as files in the repo and tell it to read that file as part of a question.
I'm at about ~$0.5-0.75 for most "tasks" I give it. I'm not a super heavy user, but it definitely helps me (it's like having a super focused smart intern that makes dumb mistakes).
If i need to feed it a ton of docs etc. for some task, it'll be more in the few $, rather than < $1. But I really only do this to try some prototype with a library claude doesn't know about (or is outdated).
For hobby stuff, it adds up - totally.
For a company, massively worth it. Insanely cheap productivity boost (if developers are responsible / don't get lazy / don't misuse it).
Sure, it might cost a few dollars here and there. But what I've personally been getting from it, for that cost, is so far away from "expensive" it's laughable.
Not only does it do things I don't want to do, in a _super_ efficient manner. It does things I don't know how to do - contextually, within my own project, such that when it's done I _do_ know how to do it.
Like others have said - if you're exhausting the context window, the problem is you, not the tool.
Example, I have a project where I've been particularly lazy and there's a handful of models that are _huge_. I know better than to have Claude read those models into context - that would be stupid. Rather - I tell it specifically what I want to do within those models, give it specific method names and tell it not to read the whole file, rather search for and read the area around the method definition.
If you _do_ need it to work with very large files - they probably shouldn't be that large and you're likely better off refactoring those files (with Claude, of course) to abstract out where you can and reduce the line count. Or, if anything, literally just temporarily remove a bunch of code from the huge files that isn't relevant to the task so that when it reads it it doesn't have to pull all of that into context. (ie: Copy/paste the file into a backup location, delete a bunch of unrelated stuff in the working file, do your work with claude then 'merge' the changes to the backup file and copy it back)
If a few dollars here and there for getting tasks done is "too expensive" you're using it wrong. The amount of time I'm saving for those dollars is worth many times the cost and the number of times that I've gotten unsatisfactory results from that spending has been less than 5.
I see the same replies to these same complaints everywhere - people complaining about how it's too expensive or becomes useless with a full context. Those replies all state the same thing - if you're filling the context, you've already screwed it up. (And also, that's why it's so expensive)
I'll agree with sibling commenters - have claude build documentation within the project as you go. Try to keep tasks silo'd - get in, get the thing done, document it and get out. Start a new task. (This is dependent on context - if you have to load up the context to get the task done, you're incentivized to keep going rather than dump and reload with a new task/session, thus paying the context tax again - but you also are going to get less great results... so, lesson here... minimize context.)
100% of the time that I've gotten bad results/gone in circles/gotten hallucinations was when I loaded up the context or got lazy and didn't want to start new sessions after finishing a task and just kept moving into new tasks. If I even _see_ that little indicator on the bottom right about how much context is available before auto-compact I know I'm getting less-good functionality and I need to be careful about what I even trust it's saying.
It's not going to build your entire app in a single session/context window. Cut down your tasks into smaller pieces, be concise.
If this is truly your perspective, you've already lost the plot.
It's almost always the users fault when it comes to tools. If you're using it and it's not doing its 'job' well - it's more likely that you're using it wrong than it is that it's a bad tool. Almost universally.
Right tool for the job, etc etc. Also important that you're using it right, for the right job.
Claude Code isn't meant to refactor entire projects. If you're trying to load up 100k token "whole projects" into it - you're using it wrong. Just a fact. That's not what this tool is designed to do. Sure.. maybe it "works" or gets close enough to make people think that is what it's designed for, but it's not.
Detailed, specific work... it excels, so wildly, that it's astonishing to me that these takes exist.
In saying all of that, there _are_ times I dump huge amounts of context into it (Claude, projects, not Claude Code - cause that's not what it's designed for) and I don't have "conversations" with it in that manner. I load it up with a bunch of context, ask my question/give it a task and that first response is all you need. If it doesn't solve your concern, it should shine enough light that you now know how you want to address it in a more granular fashion.
The unpredictable non-deterministic black box with an unknown training set, weights and biases is behaving contrary to how it's advertised? The fault lies with the user, surely.
A junior developer is skilled too, but still requires a senior’s guidance to keep them focused and on track. Just because a tool has built in intelligence doesn’t mean it can read your intentions from nothing if you fail to communicate to it well.
How can one develop this skill via trial and error if the cost is unknowably high? Before reasoning, it was less important when tokens are cheap, but mixing models, some models being expensive to use, and reasoning blowing up the cost, having to pay even five bucks to make a mistake sure makes the cost seem higher than the value.
A little predictability here would go a long way in growing the use of these capabilities, and so one should wonder why cost predictability doesn’t seem to be important to the vendors - maybe the value isn’t there, or is only there for the select few that can intuit how to use the tech effectively.
Thanks for sharing. Are you able to control the context when using Claude Code, or are you using other tools that give you greater control over what context to provide? I haven't used Claude Code enough to understand how smart it is at deciding what context to load itself and if you can/need to explicitly manage it yourself.
I like the scribe analogy. And, just like a scribe, my primary complaint with claude code isn't the cost or the context - but the speed. It's just so slow :D
True. Matches my experience. It takes much effort to get really proficient with ai. It's like learning to ride a wild horse. Your senior dev skills will sure come handy in this ride but don't expect it to work like some google query
Sometimes I see in certain areas AI / LLM is absolutely crushing those jobs, a whole category will be gone in next 5 to 10 years as they are already 80 - 90% mark. They just need another 5 - 10% as they continue to get improvement and they are already cheaper per task.
Sometimes I see an area of AI/LLM that I thought even with 10x efficiency improvement and 10x hardware resources which is 100x in aggregate it will still be no where near good enough.
The truth is probably somewhere in the middle. Which is why I dont believe AGI will be here any time soon. But Assisted Intelligence is no doubt in its iPhone moment and continue for another 10 years before hopefully another breakthrough.
there was one post that detailed how those OpenAI models hallucinate and double down on thier mistakes by "lying" - it speculated on a bunch of interesting reasons why this may be the case
> I no longer have the “real” prime I generated during that earlier session... I produced it in a throw‑away Python process, verified it, copied it to the clipboard, and then closed the interpreter.
AGI may well be on its way, as the mode is mastering the fine art of bullshitting.