Totally agree. I've found in many cases it's easier to roll your own software/patch existing software with AI than to open an issue, submit a PR, get it reviewed/merged, etc. Let alone buying software
Yes, but this is the honeymoon period. A year from now when you want to make three of the tools talk to each other and they're in three different languages, two of which you don't know and there's no common interface or good place to put one, well, here's hoping you hung onto the design documents.
Maybe I'm just naive, but I've been making lots of my 'vibe-coded' tools interoperable already.
My assumption is that eventually the VC-backed gravy train of low-cost good-quality LLM compute is going to dry-up, and I'm going to have to make do with what I got out of them.
The finding that self-generated skills provide negative benefit (-1.3pp) while curated skills give +16.2pp is the most interesting result here imo. Big discrepancy, but makes sense. Aligns with the thought that LLMs are better consumers of procedural knowledge than producers of it.
+4.5pp for software engineering is suspiciously low compared to +51.9pp for healthcare. I suspect this reflects that frontier models already have strong SWE priors from training data, so skills add less marginal value. If true, skills become most valuable precisely in the domains where models are weakest — which is where you'd actually want to deploy agents in production. That's encouraging.
> +4.5pp for software engineering is suspiciously low compared to +51.9pp for healthcare.
This stood out for me as well. I do think that LLMs have a lot of training data on software engineering topics and that perhaps explains the large discrepancy. My experience has been that if I am working with a software library or tool that is very new or not commonly used, skills really shine there. Example: Adobe React Spectrum UI library. Without skills, Opus 4.6 produces utter garbage when trying to use this library. With properly curated/created skills, it shines. Massive difference.
I feel similar... OpenClaw has lots of vulnerabilities, and it's very messy, but it also brought self-hosted cron-based agentic workflows to your favorite messaging channel (iMessage, telegram, slack, WhatsApp, etc.), which shouldn't be overlooked
Agreed, my experience and code quality with claude code and agentic workflows has dramatically increased since investing in learning how to properly use these tools. Ralph Wiggum based approaches and HumanLayer's agents/commands (in their .claude/) have boosted my productivity the most. https://github.com/snwfdhmp/awesome-ralphhttps://github.com/humanlayer
Built this because I wanted Claude Code to run untrusted snippets without touching my system, but Docker felt heavy. Uses jail.nix (bubblewrap) for isolation. Currently supports Python, Node, Bash with persistent REPL sessions.
Would love feedback on the interface design.
reply