Hacker Newsnew | past | comments | ask | show | jobs | submit | xorvoid's commentslogin

Thank you Michael Rabin for your excellent work. Rest in Peace.

Rabin Fingerprinting is one of my favorites of his contributions. It's a "rolling hash" that allows you to quickly compute a 32-bit (or larger) hash at *every* byte offset of a file. It is used most notably to do file block matching/deduplication when those matching blocks can be at any offset. It's tragically underappreciated.

I've been meaning to write up a tutorial as part of my Galois Field series. Someday..

Thank you again!


I recently found his fingerprint algorithm and wrote a utility that uses it to find duplicate MIPS code for decompilation[0] and build unique identifiers that can be used to find duplicates without sharing any potentially copyrighted data[1].

This replaced some O(n²) searches through ASCII text, reducing search time from dozens of seconds to fractions of a second.

0 - https://github.com/ttkb-oss/mipsmatch 1 - https://github.com/ttkb-oss/mipsmatch/wiki/Identifiers


Important to note that FastCDC is about an order of magnitude for block deduplication and is generally considered the state of art for such an approach (speed of computing the hash is more important than absolutely optimal distribution of hashes).

That's where I knew the name from. Thank you!

I wrote a Rabin—Karp implementation in ~2006 as part of the spam and threat scanning stack for the MX Logic mail service. It was incredibly performant, letting us test {n} bytes against an essentially unlimited number of string signatures in O(n) time.


I'm working on a data annotation system based around Rabin fingerprints. They're a really neat idea.

I especially like how if you end up with hash characteristics that you don't like, your can just select a different irreducible Galois polynomial and now you've got a whole new hash algorithm. It's like tuning to a different frequency.

For me it means I don't have to worry about cases where there aren't enough nearby fingerprints for the annotation to adhere to, I can just add or remove polynomials until I get a good density.


Could you send link to Galois Field series please?

I live in a town in the Midwest that just voted down a data center project.

Personally I think it's mostly a proxy vote against bigtech/social-media. People are pretty fed up with their practices but don't have power to act at a national level. But, they DO have power at the local level to show up to town council and talk directly (in-person) to their representatives.

I think the other side of this is that there's this old idea (mostly correct) that municipalities partnering with businesses is good for the community because it brings positive side-effects: jobs, more cashflow in the local economy, etc. This is much less true for data centers. It's just a building that uses power and produces heat/by-products. Generally, employment gains are tiny compared with the old "automaker" labor model of the 1960s-1980s

People recognize this and they're not happy. They don't think it's a good deal for their communities.


One argument I've heard that people try to use as proponents of these datacenters is that they bring in employment in construction and electrical engineering even if it's temporary. Their argument falls apart very quickly when it's pointed out that because of the specialization many owners of these locations hire specific firms and contractors willing to travel across the nation for the work instead of hiring a local construction company or an on-site engineer. Some locations are managed entirely remotely where one engineer handles multiple sites, and the only people actually there are security, maintenance, and cleaning staff who might also be traveling contractors that move between locations in a circuit. That's like six people for something with a footprint the size of fifty family homes.

Even if it were true, I don't understand how this argument makes sense: So assume the construction were entirely done by local businesses/hires. That gets a short-term employment boost while the facility is being built. And then what? The construction is finished, the employment dries up again, but the data center is still there, with all the downsides.

Also, you could get the exact same short-term boost by building something else, e.g. housing.

Offsetting long-term costs with short-term benefits doesn't seem like a good strategy.


This is a good take.

Put a different way, some companies have made a lot of money with business models that hinge on victims never being able to reach a human.

Those same companies want to set up phone centers in the neighborhoods of the people they’ve neglected that also will not take their calls.

Town hall it is.


If your local governments are funded by property taxes, Data centers can bring in new revenue while requiring few services. Which could be used to shore up stressed public budgets or fund other economic development activities to bring in jobs.

These projects can be developed and located responsibly, and every project is different. I don't think a blanket ban is good policy.


"The person seems to have low self-esteem, displays introversion, poor honesty, low emotional stability, very little adventurousness and poor self-control hence we can target them with both niche and common products and services."

Amusing to me how wrong this is... I don't know how you can determine such characteriatics from a photo in any direction. I will admit that my appearance though tends to throw mixed and incorrect signals (not an accident). I find the entire concept of appearance signaling pretty off-putting so I guess this is a great result.

The only thing Google Lens has succeeded at for me is age, race, and location. Basically everything else has been very wrong.


The point isn't to get it right. It's about what people with more power and capital than you will do when the AI makes the wrong inferences about you.

Maybe you need to spell out what you're implying because it's not very clear. I don't understand what's so bad about wrong inferences. I've been living with wrong inferences my whole life - it's usually others that are made the fool.

In this particular case it means that you end up with really bad ad targeting. I'm happy with that, it's much easier to dismiss and roll your eyes at scam junk ads than ones that actually know how to manipulate you...


this

Strong disagree. Markdown is great. And the perfect format simply doesn't exist. You cannot be all things to all people. shrug


What if some of these ancient mysteries simply weren't logical. Investigations always assume that there was some very rational reason but still in our modern society we have exuberance and economic bubbles. The phenomenon is well-documented. What if it was something like that? The hole digging just got out-of-control?


I generally agree with this, but I think the small internet hasn't succeeded in building social replacements for the "centralized systems". The internet is a social technology. So for this to be viable, the small internet needs an answer.

Occasionally, someone mentions RSS as a solution. That's only a small component of the solution.


Strongly disagree. This is just increasing the "political temperature". The only true solution is people learning to talk and disagree with each other without "taking their ball and going home". Sorting ourselves into ideological camps on everything is destructive.


I was thinking about doing the same. Build a clone with AI custom tailored for my own quirks. And not bothering to open source it because it's too bespoke for anyone else. How hard was this? Can you share any advice?


It turned out to be pretty hard in some places. I'm using CodeMirror as the basic building block, which is great, but it does not support WYSIWYG table editing out of the box. Getting that to work requires one to use a separate CodeMirror instance for the cell editor, which makes things rather complicated. For the LLM as well :)

I think I've spent ~20 hours and a couple of $100 of Claude Opus tokens in Cursor. So it's not cheaper or easier, but the amount of frustration saved with having proper Emacs keybindings might delay catastrophic global warming by a few days.

Oh, and of course I'm not compatible with all the Obsidian extensions, nor do I have proper hosting for server-side sync yet. All in all, a fool's errand, but I'm having fun.


Im doing the exact same thing but Im building my Obsidian clone with Rust and gpui and primarily with Codex. So far I estimate Ive been solely vibe coding it for ~15 hours now with only one small change made by hand. Id be interested in comparing notes/our different approaches to this. Feel free to shoot me an email at jerlendds at osintbuddy dot com if you want to chat.

I have a small demo video of yesterdays work here: https://github.com/jerlendds/mdi

Theres since been many additions, Ill update the video tonight


Thank you! Re extensions: my thinking was that if you build a clone, then extensions become irrelevant. Just build what you need directly into the software. Extensions systems always seemed to me to be a second class citizen. I think I read an old story of Linus Torvalds using an old fork of microemacs and whenever he disliked something he would just go tweak it's C code (e.g. key bindings). I'm kind of thinking that but done with an LLM. Software could in theory be smaller and more bespoke. And it you want it to work differently, you just prompt an LLM to change the actual source code. Then you don't need higher level configuration/cuatomization interfaces. Simpler software.


This is an interesting topic. I always loved the idea of extensions, for multiple reasons. But they do have their disadvantages, and I'm eager to find out how extension systems will hold up in the time of LLMs.

A major advantage of (certain) extension mechanisms is that you can update them in real-time. For example, in Emacs you can change functions without losing the current state of the application. In Processing or live coding environments, you can even update functions that affect real-time animation or audio.

Another advantage is that they can pose a very nice API, that allows for other people to learn an abstraction of the core application. If you are the sole developer, and if you can spend the time to keep an active memory of the core application, this does not help much. But it can certainly help others to build upon your foundation. Gimp and Emacs are great examples of this.

A disadvantage is that you have to keep supporting the extension mechanism, or otherwise extensions will break. That makes an ecosystem somewhat more slow to adapt. Emacs is the prime example here. We're still stuck with single-threaded text mode :)


This makes a lot of sense. It makes me think of Go's approach to blur the distinction of heap/stack by just treating it as an escape analysis problem leading to an allocation choice. If it provably doesn't escape => optimize it by using the stack, otherwise fallback to the heap.

The distinction of stack vs heap objects is an old distinction that is deeply encoded in the semantics of C. It's not obvious that's the right choice.

It's worth pointing out however that you do want to have control sometimes. When you're coding for performance, etc it can be very important to control exactly where objects live (e.g. this must be on the stack with a certain layout). I feel like sometimes it's underappreciated in modern PL design that low-level coding needs this kind of control sometimes.

I think there exists a happy medium solution ultimately though.


> The distinction of stack vs heap objects is an old distinction that is deeply encoded in the semantics of C.

It's the most flexible way to implement recursive functions, and it's encoded on the semantics of the opcodes of every modern processor. They are way more deeply entrenched in our tech than just in C.

But it may make sense to mix them in some ways. None of that detracts from the point.


That's a good point. It is pretty intertwined with ISAs. But, I think you could successfully argue it's just C semantics leaking into the ISA. C was so incredibly successful that it's hard to appreciate sometimes that all the systems (abstractions above and below) that touch it came to embrace and conform to it's semantics.


In a lot of cases, we could argue that ISAs have been shackled by catering to the specific details of C. But if we're just referring to the CALL instruction (and its equivalents), that's not a reaction to C, it's a reaction to structured programming, which was a good thing.


Whether a CALL instruction (by whatever name) does anything with the stack is ISA dependent. On ARM, the BL instruction saves the return address into the LR register and performs a branch, much like the B instruction.


A linear stack, distinct from a heap, is not required for recursion. It's also not required for most of the local state: the minimal requirement is to keep track of the minimal context information in order to resume the suspended caller when the callee terminates.


> The distinction of stack vs heap objects is an old distinction that is deeply encoded in the semantics of C. It's not obvious that's the right choice.

Nothing about C requires a contiguous stack, and there are perfectly standard C environments where the stack isn't contiguous, where call frames are allocated dynamically and managed (singularly or in groups) as a linked-list, e.g. some mainframe environments, gcc's segmented stacks, etc. C's automatic ("stack") variables are defined in terms of their lifetime, which is basically lexical.


I don't think the OP assumed that stacks were contiguous. The main distinction from the heap is that stack management is automatic in C.


I'm pretty conflicted on this comment section. A lot of people are expressing a lot of fear of C++ bloat. I get that.

I'm not sure what the right answer for Rust is, but I'm fairly convinced that these type system ideas are the future of programming languages.

Perhaps it can be added to rust in a reasonable and consistent way that doesn't ultimately feel like a kludgy language post-hoc bolt on. Time will tell. There is a serious risk to getting it wrong and making the language simply more complicated for no gain.

But, these ideas are really not obscure academic stuff. This is where programming language design is at. This moment is like talking about sum-types in the 2010s. These days that concept is normalized and expected in modern programming languages. But, that's a fairly recent development.

I suspect that Linear types, refinement types, etc will follow a similar trajectory. Whether new ideas like this can be reasonably added to existing languages in a good way is the age old question.

Hopefully Rust makes good choices on that road.


> There is a serious risk to getting it wrong and making the language simply more complicated for no gain.

I think the Rust team/community is well-aware of this. Which is why Rust has such a well-defined RFC life-cycle.

At the other end, one of the biggest complaints about Rust is that many features seem eternally locked behind nightly feature gates.


I feel the same confliction about this organizational policy. It has been refreshing that they haven't just jammed half-baked ideas into stable like C++ has done for decades. But, yes, it's frustrating to bump into an issue and discover that a fix has been proposed and implemented but has never been moved to stable in 5-10 years. Some things feel like they're languishing in nightly forever.

I don't personally have a solution to propose to this problem. I generally appreciate their caution and long-term considering. It's refreshing coming from C++. I suppose one could argue that they've overcorrected in the other direction. Unclear.

Deeper than that, I think there's a philosophical dispute on whether languages should or shouldn't even evolve. There are people with C-stability type thinking that would argue that long-term stability is so important that we should stop making changes and etch things into stone. There is some merit to that (a lot of unhelpful churn in modern programming). But, failure to modernize is eventually death IMHO. I think C is slowly dying because of exactly this. It will take quite a while because it is critical computing infrastructure. But, few people remain that defend it's viability. The arguments that remain are of the form "we simply don't have a viable replacement yet".

Perhaps you can even take the view that this is the lifecycle of programming languages. They're not supposed to live forever. That could be a reasonable take. But then you really have to confront the problem of code migration from old languages to new languages. That is a very very hard unsolved problem (e.g. see: COBOL).

Language evolution is foundationally a hard problem. And I'm not unhappy with Rust's approach. I think no one has managed to find an ideal approach.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: