Hacker Newsnew | past | comments | ask | show | jobs | submit | drycabinet's commentslogin

Does finding the number of unique elements in a set actually require comparison of each element with everything else? Can't you use a hashtable? For every element, add it to the table (ignore if already exists), and finally, take a count of keys.


> Does finding the number of unique elements in a set

That's easy, that's all of them. Sorry, could not resist.

Yes, hashing is the usual method. In a sorted list you can compare to the following element.


Imagine 1PB of data and you expect 30% of it to be unique. That needs 300TB RAM to store unique elements. Keep in mind the values in a hash table are the elements themselves, so a hastable of perhaps 300TB. Doing that without that much RAM, even swapping to disk can be tough.


Using a hashtable is the "normal" approach mentioned in the article. It works of course, but requires memory to store each unique element (or their hashes). If you have less memory available, the described algorithm can still give a very good approximation.


Using a hashtable is effective because you only compare elements within their hash buckets, not the entire set. However, they can become inefficient with very large datasets due to memory usage and processing time, which is where approximate counts shine.


This algorithm is still spinning a lot of random. I would guess that this is much less overhead than hashing but still seems like it could be significant.


That is fine when you have say 1 million values and only 1000 are unique. But when you have 1 million values and about 900 thousand are unique you are putting more or less the whole data set into memory.


Imagine a million elements. How big must your hashtable be ? The article explains it very well, did you miss it ? It's a way to save memory.

But to be honest I implemented it, ran it on Hamlet, and it's very wrong, it's barely useful but maybe if you just need a vague idea...


How big was your thresh? I found it pretty accurate: https://news.ycombinator.com/item?id=40388878


That part of the man page sounds confusing. Is it unconditional and enabled by default?


You have also

     RuntimeMaxUse=, RuntimeKeepFree=, RuntimeMaxFileSize=, RuntimeMaxFiles= 
options if you are concerned about journald overusing tmpfs. But yes, it seems unconditional that it writes the logs always somewhere, which is not unreasonable design imho.


Yes - unconditional by default, to be clear. I think if you set storage to 'none' it will switch it off completely


NASA knew that people new that planetariums existed, so they didn't paint a realistic sky because people would think they used a planetarium. Game theory!


Wait until you switch to unions in rust and ask yourself whether it is a union or a struct.


Oh dear


Wikipedia has something interesting on this (how unions can be implemented using "class hierarchy in object-oriented programming"): https://en.wikipedia.org/wiki/Tagged_union#Class_hierarchies...

There is a lengthy blog post about the same stuff, except that the author doesn't seem to have come across the said wiki section yet: https://nandakumar.org/blog/2023/12/paradigms-in-disguise.ht...

Kudos to the dev of datatype99 for showing the problem with such ad-hoc methods in the readme right away.


Is RISC-V technically superior or is it just about the license?


It's simpler than x86/arm, while having a couple improvements compared to minimal alternatives like MIPS based on hindsight & modern chips

See chapter 2: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-...


It's designed with the luxury of hindsight on longtime-existing ISAs. Avoiding many pitfalls in those. While not attempting to innovate in ways that may or may not work out.

Also the base ISA is very implementer-friendly. As in: requiring few transistors / FPGA LEs, (relatively) easy to write a compiler or emulator for, etc. But that is hardly unique.

32b and 64b flavours very similar. Oh and... modular.

That doesn't make it 'better' though. Eg. x86 has a looott of legacy cruft. But also a looott of high-quality software for it. RISC-V: many of those tools are still being written / adapted / optimized. Likewise, x86 & ARM have many high-performance, efficient and/or low-cost implementations. RISC-V is catching up quickly, but not (yet) head-to-head with those.


ISAs are complex things so you can't say one is technically superior to the other.

I would say you're right though in that RISCV enjoys the success it's seeing due to the open specification and licensing model. People generally aren't drawn to RISCV because of technical innovation.


The base specification (IMAFDC) has little to no innovation, simply avoiding the mistakes of the past. We've got 60 years of experience with RISC-style instruction sets, so that's about consolidation not innovation.

However RISC-V is an excellent base upon which to innovate. You can see that in things such as the Vector extension, the memory model developed by industry and academic experts world-wide, and CHERI fine-grained memory-protection.


CHERI was originally based on MIPS, then ported to ARM and later RISC-V.


The place you can get a real physical commercially available CHERI implementation is RISC-V:

https://codasip.com/press-release/2023/10/31/codasip-deliver...

That's largely because if you base a product on Arm or MIPS you have the choice of getting them to actively invest in and support you, or getting sued into oblivion by them.

THAT is why RISC-V is the most friendly ISA to innovation and where most future innovation will happen. Because innovation comes not only from internally inside Intel or Arm or MIPS (who have switched to RISC-V now anyway) but from a myriad of possible sources.


The base RISC-V tends to avoid innovation, because we've been burned in the past (register windows, branch delay slots etc)


This article's just about the technical writing of the manual. It's nothing to do with the ISA itself.


Superior in what way? Is English superior to French? I think you can have equally good implementations of modern CPUs regardless of the ISA. The end result is trillions of logical gates that working together will store, add, subtract, multiply at a rate of several millions per second. Logical gates don't care what language you speak to them as long as they understand.


It's waaay more modern than other ISAs in its design. It can scale from microcontroller to simple CPU to GPU-like vector processing to very powerful CPUs without having to add thousands of CISC instructions.

E.g. The ISA is modular. You can use the RV64GC set of instructions to implement a very basic Linux-capable CPU that executes one instruction at a time.

Then you can build an advanced CPU that does OOO and instruction compression and run the same binary *efficiently*.


Might sound surprising until you remember there used to be Lisp Machines.


But some random word in its response can trigger an idea in your mind. Getting an idea from a conversation is not always about getting it directly. It's already in you and you just wanted a trigger.


Rubber-ducking is useful, but nobody gives the rubber duck anywhere near as much credit as AI enthusiasts give to chatbots.


You underestimate how much credit I give to rubber ducks ;)


No pipe organ in any of these?


Maybe a stupid question, but how do you compare this against GPT-based search engines?


GPT-based search engines usually use some sort of a database to retrieve context for the LLM to summarize first. This is what people refer to as RAG these days: https://blogs.nvidia.com/blog/what-is-retrieval-augmented-ge....

Some of these GPT engines maintain their own vector DB to do semantic search, others are directly hooked into Bing / Google. So pubmedisearch.com would be one component of a GPT-based engine. We actually have a GPT-based engine here: https://medisearch.io/.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: