Hacker Newsnew | past | comments | ask | show | jobs | submit | thecopy's commentslogin

>$39 Github Pro+ to keep using Opus,

For what its worth, i have been paying for Pro+ and i still got locked out of Opus. I only have access to Opus 4.7 at 7.5x


I have Copilot Pro+ and discovered i cannot use Opus anymore today! Are we reaching the end of VC funded productivity?

If you’re a paying customer, it’s paying customer funded, not VC funded.

That is not necessarily true.

Gatana: https://www.gatana.ai/

Extremely flexible and configurable MCP Gateway, target users is enterprises/companies/organizations who want secure and managed MCP within their company. Support both Cloud and On-premise.


Stupid question: can i run this on my 64GB/1TB mac somehow easily? Or this requires custom coding? 4bit is ~200GB

EDIT: found this in the replies: https://github.com/Anemll/flash-moe/tree/iOS-App


Running larger-than-RAM LLMs is an interesting trick, but it's not practical. The output would be extremely slow and your computer would be burning a lot of power to get there. The heavy quantizations and other tricks (like reducing the number of active experts) used in these demos severely degrade the quality.

With 64GB of RAM you should look into Qwen3.5-27B or Qwen3.5-35B-A3B. I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations.


>I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations.

There are dynamic quants such as Unsloth which quantize only certain layers to Q4. Some layers are more sensitive to quantization than others. Smaller models are more sensitive to quantization than the larger ones. There are also different quantization algorithms, with different levels of degradation. So I think it's somewhat wrong to put "Q4" under one umbrella. It all depends.


I should clarify that I'm referring generically to the types of quantizations used in local LLM inference, including those from Unsloth.

Nobody actually quantizes every layer to Q4 in a Q4 quant.


I've tried a number of experiments, and agree completely. If it doesn't fit in RAM, it's so slow as to be impractical and almost useless. If you're running things overnight, then maybe, but expect to wait a very long time for any answers.


Current local-AI frameworks do a bad job of supporting the doesn't-fit-in-RAM case, though. Especially when running combined CPU+GPU inference. If you aren't very careful about how you run these experiments, the framework loads all weights from disk into RAM only for the OS to swap them all out (instead of mmap-ing the weights in from an existing file, or doing something morally equivalent as with the original MacBook Pro experiment) which is quite wasteful!

This approach also makes less sense for discrete GPUs where VRAM is quite fast but scarce, and the GPU's PCIe link is a key bottleneck. I suppose it starts to make sense again once you're running the expert layers with CPU+RAM.


Yes, SSD speed is critical though. The repo has macOS builds for CLI and Desktop. It's early stages though. M4 Max gets 10-15 TPS on 400B depending on quantization. Compute is an issue too; a lot of code is PoC level.


I have a 64G/1T Studio with an M1 Ultra. You can probably run this model to say you’ve done it but it wouldn’t be very practical.

Also I wouldn’t trust 3-bit quantization for anything real. I run a 5-bit qwen3.5-35b-A3B MoE model on my studio for coding tasks and even the 4-bit quant was more flaky (hallucinations, and sometimes it would think about running tools calls and just not run them, lol).

If you decided to give it a go make sure to use the MLX over the GGUF version! You’ll get a bit more speed out of it.


Looks interesting. But how to explore or test or use? The product page (https://mistral.ai/products/forge) also does not contain anything useful. Just "Contact us"

Dissapointing.


Shameless plug: im working on a product that aims to solve this: https://www.gatana.ai/


Who isn't?


Building Gatana, a platform for securely connecting an organizations agents to their services, with very flexible credential management and federated IDP trust.

Currently my mini-projects includes:

* 0% USA dependency, aim is 100% EU. Currently still using AWS SES for email-sending and GCP KMS for customer data key encryption for envelope encryption.

* Tool output compression, inspired by https://news.ycombinator.com/item?id=47193064 Added semantic search on top of this using a local model running on Hetzner. Next phase is making the entire chain envelop encrypted.

* "Firewall" for tool calls

* AI Sandboxes ("OpenClaw but secure") with the credential integration mentiond above

https://www.gatana.ai/


I use Ergotron, super happy.


Air power alone has _never_ achieved regime change.


Libya begs to differ


What do you mean, Lybia happened 2 days after France met with Libyan rebels leaders and one of Ghadafi's son, the first strike targeted ground installations so that the rebels could take over.

It was carefully planed for a swift takeover, way, way more than what is happening there, and it still ended up being a cluster fuck. The rebels were the fucking ground groups.

Here, it will probably be Iraqis, like during the first gulf war. Hopefully less people will die, but clearly this is a terrible decision.


I implemented this as well successfully. Re structured data i transformed it from JSON into more "natural language". Also ended up using MiniLM-L6-v2. Will post GitHub link when i have packaged it independently (currently in main app code, want to extract into independent micro-service)

You wrote:

>A search for “review configuration” matches every JSON file with a review key.

Its good point, not sure how to de-rank the keys or to encode the "commonness" of those words


IDF handles most of it. In BM25, inverse document frequency naturally down-weights terms that appear in every document, so JSON keys like "id", "status", "type" that show up in every chunk get low IDF scores automatically. The rare, meaningful keys still rank.

For the remaining noise, I chunk the flattened key-paths separately from the values. The key-path goes into a metadata field that BM25 indexes but with lower weight. The value goes into the main content field. So a search for "review configuration" matches on the value side, not because "configuration" appeared as a JSON key in 500 files.

MiniLM-L6-v2 is solid. I went with Model2Vec (potion-base-8M) for the speed tradeoff. 50-500x faster on CPU, 89% of MiniLM quality on MTEB. For a microservice where you're embedding on every request, the latency difference matters more than the quality gap.


Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: