Any LLM-based code review tooling I've tried has been lackluster (most comments not too helpful). Prose review is usually better.
> So we run dozens of parallel CLI agents that can review the code in excruciating detail. This has completely replaced human code review for anything that isn't functional correctness but is near the same order of magnitude of price. Much better than humans and beats every commercial tool.
Sure, you could make multiple LLM invocations (different temporature, different prompts, ...). But how does one separate the good comments from the bad comments? Another meta-LLM? [1] Do you know of anyone who summarizes the approach?
[1]: I suppose you could shard that out for as much compute you want to spend, with one LLM invocation judging/collating the results of (say) 10 child reviewers.
I have attempted to replicate the "workflow" LLM process where several LLMs come up with different variations of a way to solve a problem and a "judge" LLM reviews them and the go through different verification processes to see if this workflow increased the accuracy of the LLM's ability to solve the problem. For me, in my experiments, it didn't really make much difference but at the time I was using LLMs significantly dumber than current frontier models. HOWEVER...When I enable "Thinking Mode" on frontier LLM's like ChatGPT it DOES tend to solve problems that the non-thinking mode isn't able to solve so perhaps it's just a matter of throwing enough iterations at it for the LLM to be able to solve a particular complex problem.
You need human alignment on what constitutes a "good" comment. That means consistent rules.
Otherwise, some people feel review is too harsh, other people feel it is not harsh enough. AI does not fix inconsistent expectations.
> But how does one separate the good comments from the bad comments?
If the AI took a valid interpretation of the coding guidelines, it is a legitimate comment. If the AI is being overly pedantic, it is a documentation bug and we change the rules.
I once tried learning how to RE with radare2 but got very frustrated by frequent project file corruption (meaning radare2 could no longer open it). The way these project files work(ed?) in radare2 at the time was that it just saved all the commands you executed, instead of the state. This was brittle, in my experience.
I don't have a lot of free time, so I have to leave projects for long periods of time, not being able to restart from a previous checkpoints meant I never actually got further.
IIUC, one of the first things Rizin did was focus on saving the actual state, and backwards/forwards-compatibility. This fact alone made me switch to Rizin. To its credit, my 3-year old project file still works!
Now for the downside: there is apparently a gap in Windows (32-bit) PE support, causing stack variables to be poorly discovered: https://github.com/rizinorg/rizin/issues/4608. I tested this on radare2, which does not have this bug. I'm hoping this gets fixed in Rizin at some point, at which point I'll continue my RE adventure. Or maybe I should give an AI reverse engineer a try... (https://news.ycombinator.com/item?id=46846101).
I tried radare2 with the official GUI Iaito. Iaito saves the project in a git repo, so whenever I got corruption (and I got it a lot, like every 4-5 saves) I was just a `git reset --hard` away from restoring a good state. Not the most efficient way of operation, but for me it was better this than tolerating Ghidra's tiny Courier New font.
Your corruption frequency anecdote matches mine. I don't have the mental werewithal to deal with that. I won't go back to radare2 until they change their project file stability somehow.
LiteBox is a sandboxing library OS that drastically cuts down the interface to the host, thereby reducing attack surface. It focuses on easy interop of various "North" shims and "South" platforms. LiteBox is designed for usage in both kernel and non-kernel scenarios.
LiteBox exposes a Rust-y nix/rustix-inspired "North" interface when it is provided a Platform interface at its "South". These interfaces allow for a wide variety of use-cases, easily allowing for connection between any of the North--South pairs.
Example use cases include:
- Running unmodified Linux programs on Windows
- Sandboxing Linux applications on Linux
- Run programs on top of SEV SNP
- Running OP-TEE programs on Linux
- Running on LVBS
This might actually be my favourite use: I always thought WSL2 was a kludge, and WSL1 to be somewhat the fulfilment of the "personality modules" promise of Windows NT.
Yup WSL feels closer to the Services for Unix which has been around since NT 4/5.
It was sad to see WSL2 taking the path of least resistance, that decision has always felt TPM driven ("we got unexpected success with WSL and people are asking for more, deliver xxx by Q4! No I don't care _how_ you do it!")
The amount of techno jargon marketing speak in this readme is impressive. I’m pretty well versed in most things computers, but it took me a long time to figure out what the heck this thing is good for. Leave it to Microsoft to try to rename lots of existing ideas and try to claim they’ve invented something amazing when it’s IMHO not all that useful.
Do you have a writeup of how you did it? Both (regular) tooling (radare2? rizin? IDA? ...) and how the LLM did (or did not) use it?
In the little spare time I have, I've been able to reverse engineer the "compressed" file format (ended up being basically a XOR'ed zlib-compressed TAR-like archive), but not much else. I have not used LLMs to help me.
> I changed read/write_file to put contents in the state and presented in the system prompt, same for the agents.md, now working on evals to show how much better this is, because anecdotally, it kicks ass.
Can you detail this a bit more? Do you put the actual contents of the file in the system prompt? Forever?
That matches what I recall too, back when I ran a very cheap integrated intel (at least that's what I recall) card on my underpowered laptop. I posted a few days ago with screenshots of my 2009 setup with awesome+xcompmgr, and I remember it being very snappy (much more so than my tuned Windows XP install at the time). https://news.ycombinator.com/item?id=46717701
57MiB with a couple of applications open. From memory: urxvt running htop, thunar (XFCE file manager) and the Mirage image viewer (which is Python, not otherwise known for efficiency). Screenshot: https://github.com/aktau/awesome/blob/master/screenshots/200...
- 32-bit to 64-bit means all points are double the size. That would account for something.
- Wayland vs X11. I should compare Sway versus X.org+i3.
- General library and system daemon bloat.
> They were using tori primitives for the chain mail, and curve primitives for the clothing. (I.e., the clothing was actually woven out of curve primitives for the threads.)
That sounds mind-blowing. Is this documented anywhere?
> So we run dozens of parallel CLI agents that can review the code in excruciating detail. This has completely replaced human code review for anything that isn't functional correctness but is near the same order of magnitude of price. Much better than humans and beats every commercial tool.
Sure, you could make multiple LLM invocations (different temporature, different prompts, ...). But how does one separate the good comments from the bad comments? Another meta-LLM? [1] Do you know of anyone who summarizes the approach?
[1]: I suppose you could shard that out for as much compute you want to spend, with one LLM invocation judging/collating the results of (say) 10 child reviewers.
reply