More

davidsong · 2026-05-23T11:22:31 1779535351

Thanks. I'm downloading all these now and will do a proper pass and compare outputs for correctness once complete :)

davidsong · 2026-05-14T03:51:03 1778730663

The proper way to work with Claude or Codex is, IMO, to load up the context with a discussion about what you're doing and why. You go back and forth, pushing back on its opinions and shaping the context until the tokens are ready to flow into the right shape. Every angle you miss is an opportunity for them to slop out all over the place, and, until Codex was mature, the longer you ran the task for, the more it'd spread out and lose shape.

Re-shaping the context sometimes involves severe pressures like "wtf is this ugly crap?" or "did I just spot you laying a turd in my codebase again?" and other strong forms of disapproval, mixed with "hmm not sure I like the sound of that"s, to "yea that's much better" to pull it back in the other direction.

The trick is to shape the flow before the tide comes in and you end up like King Canute

davidsong · 2026-05-14T03:34:51 1778729691

I didn't actually read any code. I generated spec documents using Claude, then later on used Codex to generate from the spec docs. Are the specs tainted? If someone else independently develops from my spec, is that also tainted? What if they hear it second hand? It's an interesting legal situation for sure.

davidsong · 2026-05-14T03:29:23 1778729363

Yeah the main things are DoS attacks and path traversal issues. I intentionally guarded against these with resource limits and checks, but I can't guarantee that it's safe. I mean, basically anyone who carefully reads it knows more about it than me - you play the AI slot machine at this scale and who knows what prizes you'll win!

davidsong · 2026-05-14T03:23:50 1778729030

Thanks!

I guess you could save the state to a file on SIGINT, flush what's been written and pick it back up again if the state file exists when you restart, and use the CRCs of the files to abort if things have changed. I don't fancy doing that for so many versions of RAR, but it would be a cool feature to add it to an `xz` fork. I like the idea.

Imustaskforhelp · 2026-05-14T07:18:16 1778743096

Thanks it would still be interesting to see this added to xz but supposing the fact that LLM's were able to create the rars project, I suppose it might not be that difficult to add that to rar format eventually. Starting to do it from xz might make the most sense if you like the idea right though.

Another idea for rar format that I have which I would love to hear your opinion on is that there are sometimes multiple .part01 .part02 .part03 and so on

I have found that when you try to unrar it, it requires all the files at the particular.

It would be really beneficial imo if it was possible to have the ability if there was some ability to somehow just unrar .part01 without requiring all the contents of .part02,03 etc.

but from my very limited understanding, you also need some (I think last contents) of all files for the de-compression to work.

Would it be possible to do something of this endeavour so that you don't require all the parts themselves but just something of a patch of an end, I am not sure about compression algorithms if that might be possible though but it felt like something which was a bit possible albeit hard/difficult to do with rar format.

I would be curious to hear your opinions on it, and thanks for responding and I would be really interested in seeing the xz fork that you mentioned!

davidsong · 2026-05-14T03:19:31 1778728771

I compressed thousands of files, went through libarchive's and Sembiance's test data at least for the decompressor side. I recompressed the files, and round-tripped them against 7zip, unrar, every later version of winrar. It failed a lot at the start, and codex burned a lot of tokens instrumenting the binaries and dividing and conquering until things settled down and round-trips worked properly.

I can't really say it works in every case as I honestly didn't spend that much time on it. But it works in the majority of cases. There's likely some nasty bugs hiding in there.

davidsong · 2026-05-14T03:12:17 1778728337

I generally don't anyway. Since the WTFPL came out I've been licensing under that with a warranty clause (don't blame me).

My main goal here was an experiment to see how far I could push the technique, and learn things along the way. Regardless of whether people dare to use it commercially or not, we have interoperability for the foreseeable future. As an archivist/computing historian I think that's important.

davidsong · 2026-05-14T03:08:07 1778728087

Well, it is every version of RAR. Documenting the quirks of rar 1.4, 1.5, 2.0, 2.9, 3.0, 4.0, 5.0 and 7.0, multiple compression strategies, PPMd, RARVM, compression levels, encryption, multi volume support, a huge test corpus, round trips for compatibility... The spec docs are linked.

xphos · 2026-05-15T12:30:28 1778848228

Your probably right I should read the spec but I've worked with so many brillant engineers and I am still pretty young so their must be even more. I just think the overselling of complexity is usually what makes things enterprise grade :) (if you know you know)

davidsong · on June 6, 2019

Thanks for doing this.

It's a pretty big file, so I compressed it down to 11mb:

https://www.docdroid.net/A5g0tzk/les-animaux-compressed.pdf

davidsong · on June 6, 2019

When I was six I learned that variables are little boxes, strings are made of bunting and there was some kind of octopus called INKEY$ that grabbed keys from the keyboard.

I think the most important thing to learn is enthusiasm, if you've not got that then proper methods are of no use. This book looks like it'll be great tool for nurturing that.