More

thecopy · 2026-04-21T07:00:54 1776754854

>$39 Github Pro+ to keep using Opus,

For what its worth, i have been paying for Pro+ and i still got locked out of Opus. I only have access to Opus 4.7 at 7.5x

thecopy · 2026-04-21T06:56:48 1776754608

I have Copilot Pro+ and discovered i cannot use Opus anymore today! Are we reaching the end of VC funded productivity?

alexaholic · 2026-04-21T06:59:02 1776754742

If you’re a paying customer, it’s paying customer funded, not VC funded.

thecopy · 2026-04-21T07:02:24 1776754944

That is not necessarily true.

thecopy · 2026-04-15T07:57:36 1776239856

Gatana: https://www.gatana.ai/

Extremely flexible and configurable MCP Gateway, target users is enterprises/companies/organizations who want secure and managed MCP within their company. Support both Cloud and On-premise.

thecopy · 2026-03-23T17:23:32 1774286612

Stupid question: can i run this on my 64GB/1TB mac somehow easily? Or this requires custom coding? 4bit is ~200GB

EDIT: found this in the replies: https://github.com/Anemll/flash-moe/tree/iOS-App

Aurornis · 2026-03-23T18:16:17 1774289777

Running larger-than-RAM LLMs is an interesting trick, but it's not practical. The output would be extremely slow and your computer would be burning a lot of power to get there. The heavy quantizations and other tricks (like reducing the number of active experts) used in these demos severely degrade the quality.

With 64GB of RAM you should look into Qwen3.5-27B or Qwen3.5-35B-A3B. I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations.

kgeist · 2026-03-23T20:16:56 1774297016

>I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations.

There are dynamic quants such as Unsloth which quantize only certain layers to Q4. Some layers are more sensitive to quantization than others. Smaller models are more sensitive to quantization than the larger ones. There are also different quantization algorithms, with different levels of degradation. So I think it's somewhat wrong to put "Q4" under one umbrella. It all depends.

Aurornis · 2026-03-23T20:24:10 1774297450

I should clarify that I'm referring generically to the types of quantizations used in local LLM inference, including those from Unsloth.

Nobody actually quantizes every layer to Q4 in a Q4 quant.

freedomben · 2026-03-23T18:36:31 1774290991

I've tried a number of experiments, and agree completely. If it doesn't fit in RAM, it's so slow as to be impractical and almost useless. If you're running things overnight, then maybe, but expect to wait a very long time for any answers.

zozbot234 · 2026-03-23T18:43:13 1774291393

Current local-AI frameworks do a bad job of supporting the doesn't-fit-in-RAM case, though. Especially when running combined CPU+GPU inference. If you aren't very careful about how you run these experiments, the framework loads all weights from disk into RAM only for the OS to swap them all out (instead of mmap-ing the weights in from an existing file, or doing something morally equivalent as with the original MacBook Pro experiment) which is quite wasteful!

This approach also makes less sense for discrete GPUs where VRAM is quite fast but scarce, and the GPU's PCIe link is a key bottleneck. I suppose it starts to make sense again once you're running the expert layers with CPU+RAM.

anemll · 2026-03-23T18:39:39 1774291179

Yes, SSD speed is critical though. The repo has macOS builds for CLI and Desktop. It's early stages though. M4 Max gets 10-15 TPS on 400B depending on quantization. Compute is an issue too; a lot of code is PoC level.

jnovek · 2026-03-23T18:03:50 1774289030

I have a 64G/1T Studio with an M1 Ultra. You can probably run this model to say you’ve done it but it wouldn’t be very practical.

Also I wouldn’t trust 3-bit quantization for anything real. I run a 5-bit qwen3.5-35b-A3B MoE model on my studio for coding tasks and even the 4-bit quant was more flaky (hallucinations, and sometimes it would think about running tools calls and just not run them, lol).

If you decided to give it a go make sure to use the MLX over the GGUF version! You’ll get a bit more speed out of it.

thecopy · 2026-03-18T09:21:36 1773825696

Looks interesting. But how to explore or test or use? The product page (https://mistral.ai/products/forge) also does not contain anything useful. Just "Contact us"

Dissapointing.

thecopy · 2026-03-16T18:30:58 1773685858

Shameless plug: im working on a product that aims to solve this: https://www.gatana.ai/

brabel · 2026-03-16T21:52:53 1773697973

Who isn't?

thecopy · 2026-03-09T14:49:01 1773067741

Building Gatana, a platform for securely connecting an organizations agents to their services, with very flexible credential management and federated IDP trust.

Currently my mini-projects includes:

* 0% USA dependency, aim is 100% EU. Currently still using AWS SES for email-sending and GCP KMS for customer data key encryption for envelope encryption.

* Tool output compression, inspired by https://news.ycombinator.com/item?id=47193064 Added semantic search on top of this using a local model running on Hetzner. Next phase is making the entire chain envelop encrypted.

* "Firewall" for tool calls

* AI Sandboxes ("OpenClaw but secure") with the credential integration mentiond above

https://www.gatana.ai/

thecopy · 2026-03-04T15:23:14 1772637794

I use Ergotron, super happy.

thecopy · 2026-03-04T13:34:17 1772631257

Air power alone has _never_ achieved regime change.

edgyquant · 2026-03-04T13:36:41 1772631401

Libya begs to differ

orwin · 2026-03-04T14:21:14 1772634074

What do you mean, Lybia happened 2 days after France met with Libyan rebels leaders and one of Ghadafi's son, the first strike targeted ground installations so that the rebels could take over.

It was carefully planed for a swift takeover, way, way more than what is happening there, and it still ended up being a cluster fuck. The rebels were the fucking ground groups.

Here, it will probably be Iraqis, like during the first gulf war. Hopefully less people will die, but clearly this is a terrible decision.

thecopy · 2026-03-03T12:56:02 1772542562

I implemented this as well successfully. Re structured data i transformed it from JSON into more "natural language". Also ended up using MiniLM-L6-v2. Will post GitHub link when i have packaged it independently (currently in main app code, want to extract into independent micro-service)

You wrote:

>A search for “review configuration” matches every JSON file with a review key.

Its good point, not sure how to de-rank the keys or to encode the "commonness" of those words

blakec · 2026-03-03T19:38:12 1772566692

IDF handles most of it. In BM25, inverse document frequency naturally down-weights terms that appear in every document, so JSON keys like "id", "status", "type" that show up in every chunk get low IDF scores automatically. The rare, meaningful keys still rank.

For the remaining noise, I chunk the flattened key-paths separately from the values. The key-path goes into a metadata field that BM25 indexes but with lower weight. The value goes into the main content field. So a search for "review configuration" matches on the value side, not because "configuration" appeared as a JSON key in 500 files.

MiniLM-L6-v2 is solid. I went with Model2Vec (potion-base-8M) for the speed tradeoff. 50-500x faster on CPU, 89% of MiniLM quality on MTEB. For a microservice where you're embedding on every request, the latency difference matters more than the quality gap.