Hacker Newsnew | past | comments | ask | show | jobs | submit | aktau's commentslogin

Any LLM-based code review tooling I've tried has been lackluster (most comments not too helpful). Prose review is usually better.

> So we run dozens of parallel CLI agents that can review the code in excruciating detail. This has completely replaced human code review for anything that isn't functional correctness but is near the same order of magnitude of price. Much better than humans and beats every commercial tool.

Sure, you could make multiple LLM invocations (different temporature, different prompts, ...). But how does one separate the good comments from the bad comments? Another meta-LLM? [1] Do you know of anyone who summarizes the approach?

[1]: I suppose you could shard that out for as much compute you want to spend, with one LLM invocation judging/collating the results of (say) 10 child reviewers.


I have attempted to replicate the "workflow" LLM process where several LLMs come up with different variations of a way to solve a problem and a "judge" LLM reviews them and the go through different verification processes to see if this workflow increased the accuracy of the LLM's ability to solve the problem. For me, in my experiments, it didn't really make much difference but at the time I was using LLMs significantly dumber than current frontier models. HOWEVER...When I enable "Thinking Mode" on frontier LLM's like ChatGPT it DOES tend to solve problems that the non-thinking mode isn't able to solve so perhaps it's just a matter of throwing enough iterations at it for the LLM to be able to solve a particular complex problem.

> But how does one separate the good comments from the bad comments?

One thing that works very well for me (in a different context) is to ask to return two lists:

- Things that I must absolutely fix (bugs, typos, logic mistakes, etc.)

- Lesser fixes and other stylistic improvements

Then I look only at the first list.


You need human alignment on what constitutes a "good" comment. That means consistent rules.

Otherwise, some people feel review is too harsh, other people feel it is not harsh enough. AI does not fix inconsistent expectations.

> But how does one separate the good comments from the bad comments?

If the AI took a valid interpretation of the coding guidelines, it is a legitimate comment. If the AI is being overly pedantic, it is a documentation bug and we change the rules.


For Go, there is https://pkg.go.dev/gvisor.dev/gvisor/tools/checklocks. There are some missing things from C++ Thread Safety annotations, but those could be added.

+1

I once tried learning how to RE with radare2 but got very frustrated by frequent project file corruption (meaning radare2 could no longer open it). The way these project files work(ed?) in radare2 at the time was that it just saved all the commands you executed, instead of the state. This was brittle, in my experience.

I don't have a lot of free time, so I have to leave projects for long periods of time, not being able to restart from a previous checkpoints meant I never actually got further.

IIUC, one of the first things Rizin did was focus on saving the actual state, and backwards/forwards-compatibility. This fact alone made me switch to Rizin. To its credit, my 3-year old project file still works!

Now for the downside: there is apparently a gap in Windows (32-bit) PE support, causing stack variables to be poorly discovered: https://github.com/rizinorg/rizin/issues/4608. I tested this on radare2, which does not have this bug. I'm hoping this gets fixed in Rizin at some point, at which point I'll continue my RE adventure. Or maybe I should give an AI reverse engineer a try... (https://news.ycombinator.com/item?id=46846101).


Yes, we are working on rewriting analysis completely[1][2] that would fix your issue along with many others.

[1] https://github.com/rizinorg/rizin/pull/5505

[2] https://github.com/rizinorg/rizin/issues/4736


Can't wait! Do you have any idea how far along this is? Is it likely to be months, quarters, years?

(Funny expression, that. I'll wait, of course. It'll be a happy day when this works again and I can slowly make progress RE'ing again.)


Months.


I tried radare2 with the official GUI Iaito. Iaito saves the project in a git repo, so whenever I got corruption (and I got it a lot, like every 4-5 saves) I was just a `git reset --hard` away from restoring a good state. Not the most efficient way of operation, but for me it was better this than tolerating Ghidra's tiny Courier New font.


Thanks for the note.

Your corruption frequency anecdote matches mine. I don't have the mental werewithal to deal with that. I won't go back to radare2 until they change their project file stability somehow.


From the GitHub page:

LiteBox is a sandboxing library OS that drastically cuts down the interface to the host, thereby reducing attack surface. It focuses on easy interop of various "North" shims and "South" platforms. LiteBox is designed for usage in both kernel and non-kernel scenarios.

LiteBox exposes a Rust-y nix/rustix-inspired "North" interface when it is provided a Platform interface at its "South". These interfaces allow for a wide variety of use-cases, easily allowing for connection between any of the North--South pairs.

Example use cases include:

  - Running unmodified Linux programs on Windows
  - Sandboxing Linux applications on Linux
  - Run programs on top of SEV SNP
  - Running OP-TEE programs on Linux
  - Running on LVBS


More links with discussion:

Reddit discussion: https://www.reddit.com/r/linux/comments/1qw4r71/microsofts_n...

Project lead James Morris announcing it on social.kernel.org: https://social.kernel.org/notice/B2xBkzWsBX0NerohSC


FYI, I am not the project lead for Litebox. It is led by Microsoft Research.


Sorry about that, I can no longer edit my comment.

Do you have any relation with the project apart from working at the same company?


> - Running unmodified Linux programs on Windows

This might actually be my favourite use: I always thought WSL2 was a kludge, and WSL1 to be somewhat the fulfilment of the "personality modules" promise of Windows NT.


Yup WSL feels closer to the Services for Unix which has been around since NT 4/5.

It was sad to see WSL2 taking the path of least resistance, that decision has always felt TPM driven ("we got unexpected success with WSL and people are asking for more, deliver xxx by Q4! No I don't care _how_ you do it!")


Personality was an OS aphorism that went longer back than NT I believe. But my memory is fuzzy on this.

Edit! Memory unfuzzed: It was Workplace OS, https://en.wikipedia.org/wiki/Workplace_OS


I know that, but Windows NT actually succeeded.


is this wslv1.2 (wslv1 redux) in now a more general cross-platform library firewall type thing?


The amount of techno jargon marketing speak in this readme is impressive. I’m pretty well versed in most things computers, but it took me a long time to figure out what the heck this thing is good for. Leave it to Microsoft to try to rename lots of existing ideas and try to claim they’ve invented something amazing when it’s IMHO not all that useful.


This is great. I'd love to do something similar for Ground Control (2000, https://en.wikipedia.org/wiki/Ground_Control_(video_game)).

Do you have a writeup of how you did it? Both (regular) tooling (radare2? rizin? IDA? ...) and how the LLM did (or did not) use it?

In the little spare time I have, I've been able to reverse engineer the "compressed" file format (ended up being basically a XOR'ed zlib-compressed TAR-like archive), but not much else. I have not used LLMs to help me.


> I changed read/write_file to put contents in the state and presented in the system prompt, same for the agents.md, now working on evals to show how much better this is, because anecdotally, it kicks ass.

Can you detail this a bit more? Do you put the actual contents of the file in the system prompt? Forever?


That matches what I recall too, back when I ran a very cheap integrated intel (at least that's what I recall) card on my underpowered laptop. I posted a few days ago with screenshots of my 2009 setup with awesome+xcompmgr, and I remember it being very snappy (much more so than my tuned Windows XP install at the time). https://news.ycombinator.com/item?id=46717701



I still have some screenshots in my GitHub repository of what my ArchLinux with AwesomeWM (X11) looked like in 2009.

Those screenshots also contain the RSS, as luck would have it.

34MiB when on the desktop (clean), running X.org, AwesomeWM and xcompmgr (for compositing). Screenshot: https://github.com/aktau/awesome/blob/master/screenshots/200...

57MiB with a couple of applications open. From memory: urxvt running htop, thunar (XFCE file manager) and the Mirage image viewer (which is Python, not otherwise known for efficiency). Screenshot: https://github.com/aktau/awesome/blob/master/screenshots/200...

Nowadays, even with a tiling WM that's supposed to be lightweight (say: Sway), the minimum appears to be well over 300MiB (see https://www.reddit.com/r/linux/comments/1njecy5/wayland_comp...). GNOME 49 takes up around 1GiB last time I tried it (NixOS). Interestingly https://www.reddit.com/r/swaywm/comments/oghner/how_does_the... from 5 years ago mentions Sway only using 115MiB. What happened?

Theories I have:

  - 32-bit to 64-bit means all points are double the size. That would account for something.
  - Wayland vs X11. I should compare Sway versus X.org+i3.
  - General library and system daemon bloat.


> They were using tori primitives for the chain mail, and curve primitives for the clothing. (I.e., the clothing was actually woven out of curve primitives for the threads.)

That sounds mind-blowing. Is this documented anywhere?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: