Hacker Newsnew | past | comments | ask | show | jobs | submit | isusmelj's commentslogin

Are there any benchmarks? I didn’t find any. It would be the first model update without proof that it’s better.


Very proud as a Swiss that Soumith has a .ch domain!


Probably because his first name is Chintala


That d be his last name


true haha


Is the price here correct? https://openrouter.ai/moonshotai/kimi-k2-thinking Would be $0,60 for input and $2,50 for 1 million output tokens. If the model is really that good it's 4x cheaper than comparable models. It's hosted at a loss or the others have a huge margin? I might miss something here. Would love some expert opinion :)

FYI: the non thinking variant has the same price.


In short, the others have a huge margin if you ignore training costs. See https://martinalderson.com/posts/are-openai-and-anthropic-re... for details.


Somehow that article totally ignored the insane pricing of cached input tokens set by Anthropic and OpenAI. For agentic coding, typically 90~95% of the inference cost is attributed to cached input tokens, and a scrappy China company can do it almost for free: https://api-docs.deepseek.com/news/news0802


It uses 75% linear attention layers so it is inherently lower cost. And it is MOE so active parameters are far lower.


Yes, you may consider that opensource models hosted over Openrouter are charging about bare hardware costs, where in practice some providers there may run on subsidized hardware even, so there is money to be made.


I can only agree with your experience in Europe. I do not get how they do that, but Tesla Superchargers are more reliable. The occupancy information works better, they are easier to use, and they almost always offer a more competitive price. I often see other chargers that are 50 to 100 percent more expensive and only very rarely see offers that are within 10 to 50 percent.

What strikes me is that this difference can make EVs more expensive per kilometer if you only compare energy cost with fuel cost.

Here is the math with numbers. Tesla chargers in Switzerland and Germany are usually at most CHF 0.50 or EUR 0.60 per kilowatt hour at the more expensive locations, along highways for example. They offer fast charging of 150 kW or more. Alternative providers often start at around CHF 0.75 for 50 kW or CHF 1.00 for more than 250 kW fast charging. If your electric car consumes 20 kWh (Model 3 is at around 15 I think) per 100 km you end up with costs of CHF 10.00, CHF 15.00, or CHF 20.00 per 100 km at CHF 0.50, CHF 0.75, or CHF 1.00 per kilowatt hour. If you drive a petrol car that uses 8 l per 100 km and the cost per liter is CHF 1.70 you pay CHF 13.60 per 100 km.


The tesla 12 250 kW stall in my town in the south of France (Albi) price is one of the lowest in the area at 0.23 EUR/kWh (all taxes included).

It is under a solar covered parking


In Slovakia superchargers cost around 0.30-0.37 €/kWh, while the competitors are priced around 0.45-0.60, so yes there is a major price difference as well.

To be fair the others offer subscription plans which lower the price, but such plans don't suit me, so I pay the full prices.


Ze devil iz in ze detailz!

Try this one https://www.tesla.com/de_DE/findus/location/supercharger/Ham...

While the site says they are available 24/7 they are in the turnpiked parking space of a shopping mall. Closed between 8PM to 9AM and on Sundays. Fun to see danish tourists desperately try to reach them while very low on juice.

Extreme Fremdschäming/facepalming!


Are there any news about power consumption? I didn’t even see a tdp or so mentioned.


One of the first things I looked at too...


from my comment elsewhere in this thread, https://news.ycombinator.com/item?id=45048078, "up to 170W." was the quote from March.


Demand > Supply?


I hope they do well. AFAIK they’re training or finetuning an older LLaMA model, so performance might lag behind SOTA. But what really matters is that ETH and EPFL get hands-on experience training at scale. From what I’ve heard, the new AI cluster still has teething problems. A lot of people underestimate how tough it is to train models at this scale, especially on your own infra.

Disclaimer: I’m Swiss and studied at ETH. We’ve got the brainpower, but not much large-scale training experience yet. And IMHO, a lot of the “magic” in LLMs is infrastructure-driven.


No, the model has nothing do to with Llama. We are using our own architecture, and training from scratch. Llama also does not have open training data, and is non-compliant, in contrast to this model.

Source: I'm part of the training team


If you guys need help on GGUFs + Unsloth dynamic quants + finetuning support via Unsloth https://github.com/unslothai/unsloth on day 0 / 1, more than happy to help :)


absolutely! i've sent you a linkedin message last week. but here seems to work much better, thanks a lot!


Oh sorry I might have missed it! I think you or your colleague emailed me (I think?) My email is daniel @ unsloth.ai if that helps :)


Hey, really cool project, I’m excited to see the outcome. Is there a blog / paper summarizing how you are doing it ? Also which research group is currently working on it at eth ?


L3 has open pretraining data, it's just not official for obvious legal reasons: https://huggingface.co/datasets/HuggingFaceFW/fineweb


Wait, whole (english speaking) web content dataset size is ~50TB?


Yes, if we take the filtered and deduplicated HTMLs of CommonCrawl. I've made a video on this topic recently: https://www.youtube.com/watch?v=8yH3rY1fZEA


Fun presentation, thanks! 72min ingestion time for ~81TB of data is ~1TB/min or ~19GB/s. Distributed or single-node? Shards? I see 50 jobs are used for parallel ingestion, and I wonder how ~19GB/s was achieved since ingestion rates were far below that figure last time I played around with CH performance. Granted, that was some years ago.


Distributed across 20 replicas.


So you're not going to use copyrighted data for training? That's going to be a disadvantage with respect to LLaMa and other well-known models, it's an open secret that everyone is using everything they can get their hands on.

Good luck though, very needed project!


Not sure about the Swiss laws, but the EU AI Act and the 2019/790 digital millennium directive it piggies back on the topic, does allow for training on copyrighted data as long as any opt-out mechanisms (e.g. robots.txt) are respected. AFAICT this LLM was trained by respecting those mechanisms (and as linked elsewhere they didn't find any practical difference in performance - note that there is an exception to allow ignoring the opt-out mechanisms for research purposes, so they could make that comparison).


That is not correct. The EU AI Act has no such provision, ans the data mining excemption does not apply as the EU has made clear. As for Switzerland copyrighted material cannot be used unless licensed.


Thanks for clarifying! I wish you all the best luck!


Are you using dbpedia?


no. the main source is fineweb2, but with additional filtering for compliance, toxicity removal, and quality filters such as fineweb2-hq


Thx for engaging here.

Can you comment on how the filtering impacted language coverage? E.g. finweb2 has 1800+ languages, but some with very little actual representation, while finweb2-hq has just 20 but each with a subdsantial data set.

(I'm personaly most interested in covering the 24 official EU languages)


we kept all 1800+ (script/language) pairs, not only the quality filtered ones. the question if a mix of quality filtered and not languages impacts the mixing is still an open question. preliminary research (Section 4.2.7 of https://arxiv.org/abs/2502.10361 ) indicates that quality filtering can mitigate the curse of multilinguality to some degree, so facilitate cross-lingual generalization, but it has to be seen how strong this effect is on larger scale


Imo, a lot of the magic is also dataset driven, specifically the SFT and other fine tuning / RLHF data they have. That's what has separated the models people actually use from the also-rans.

I agree with everything you say about getting the experience, the infrastructure is very important and is probably the most critical part of a sovereign LLM supply chain. I would hope there will also be enough focus on the data, early on, that the model will be useful.


When I read "from scratch", I assume they are doing pre-training, not just finetuning, do you have a different take? Do you mean it's normal Llama architecture they're using? I'm curious about the benchmarks!


The infra does become pretty complex to get a SOTA LLM trained. People assume it's as simple as loading up the architecture and a dataset + using something like Ray. There's a lot that goes into designing the dataset, the eval pipelines, the training approach, maximizing the use of your hardware, dealing with cross-node latency, recovering from errors, etc.

But it's good to have more and more players in this space.


I'd be more concerned about the size used being 70b (deepseek r1 has 671b) which makes catching up with SOTA kinda more difficult to begin with.


SOTA performance is relative to model size. If it performs better than other models in the 70B range (e.g. Llama 3.3) then it could be quite useful. Not everyone has the VRAM to run the full fat Deepseek R1.


also isn't DeepSeek's Mixture of Experts? meaning not all params get ever activated on one forward pass?

70B feels like the best balance between usable locally and decent for regular use.

maybe not SOTA, but a great first step.


Is there anything like this also supporting other GPUs? Thinking of Apple Silicon or embedded ones in phones etc.


As someone in Europe, I sometimes wonder what’s worse: letting US companies use my data to target ads, or handing it to Chinese companies where I have no clue what’s being done with it. With one I at least get an open source model. The other is a big black box.


Both are bad. If Europe does not develop local alternatives to ChatGpt or DeepSeek, it will (slowly) lose its sovereingty.


Europe is developing local alternative models such as Mistral


They're not open source. It's nice of Meta and Deepseek to offer up their models for download, but that doesn't make them open source.


Hard to be fully open source if you train on copyrighted material.

Anyway. Deepseek is the most open of the sota models.


Did they open their datasets already? Would be nice to have 'thinking' part.


Isn't this a bit of semantic lawyering? Open model weights are not the same as open source in a literal sense, but I'd go so far as to suggest that open model weights fulfill much of the intent / "soul" of the open source movement. Would you disagree with that notion?


> open model weights fulfill much of the intent / "soul" of the open source movement

Absolutely not. The intent of the open source movement is sharing methods, not just artifacts, and that would require training code and methodology.

A binary (and that's arguably what weights are) you can semi-freely download and distribute is just shareware – that's several steps away from actual open source.

There's nothing wrong with shareware, but calling it open source, or even just "source available" (i.e. open source with licensing/usage restrictions), when it isn't, is disingenuous.


> The intent of the open source movement is sharing methods, not just artifacts, and that would require training code and methodology.

That's not enough. The key point was trust. When executable can be verified by independent review and rebuild. It it cannot be rebuilt it can be virus, troyan, backdoor, etc. For LLMs there is no way to reproduce, thus no way to verify them. So, they cannot be trusted and we have to trust producers. It's not that important when models are just talking, but with tools use it can be a real damage.


Hm, I wouldn't say that that's the key point of open software. There are many open source projects that don't have reproducible builds (some don't even offer any binary builds), and conversely there is "source available" software with deterministic builds that's not freely licensed.

On top of that, I don't think it works quite that way for ML models. Even their creators, with access to all training data and training steps, are having a very hard time reasoning about what these things will do exactly for a given input without trying it out.

"Reproducible training runs" could at least show that there's not been any active adversarial RHLF, but seem prohibitively expensive in terms of resources.


Well, 'open source' is interpreted in different ways. I think the core idea is it can be trusted. You can get Linux distribution and recompile every component except for the proprietary drivers. With that being done by independent groups you can trust it enough to run bank's systems. The other options are like Windows where you have to trust Microsoft and their supply chain.

There are different variations, of course. Mostly related to the rights and permissions.

As for big models even their owners, having all the hardware and training data and code, cannot reproduce them. Model may have some undocumented functionality pretrained or added in post process, and it's almost impossible to detect without knowing the key phrase. It can be a harmless watermark or something else.


But there is also no publicly known way to implant unwanted telemetry, backdoors, or malware into modern model formats either (which hasn't always been true of older LLM model formats), which mitigates at least one functional concern about trust in this case, no?

It's not quite like executing a binary in userland - you're not really granting code execution to anyone with the model, right? Perhaps there is some undisclosed vulnerability in one or more of the runtimes, like llama.cpp, but that's a separate discussion.


The biggest problem is arguably at a different layer: These models are often used to write code, and if they write code containing vulnerabilities, they don't need any special permissions to do a lot of damage.

It's "reflections on trusting trust" all the way down.


If people who cannot read code well enough to evaluate whether or not it is secure are using LLM's to generate code, no amount of model transparency will solve the resulting problems. At least not while LLM's still suffer from the the major problems they have, like hallucinations, or being wrong (just like humans!).

Whether the model is open source, open weight, both, or neither has essentially zero impact on this.


I saw the argument that the source code is the preferred base to make changes and modifications in software, but in the case of those large models, the weights themselves are the preferred way.

It's much easier and cheap to make a finetune or LoRA than to train from scratch to adapt it to your use case. So it's not quite like source vs binary in software.


Meta models do not, they have use restrictions. At least deepseek does not.


It does not and I totally disagree with that. Unless we can see the code that goes into the model to stop of from telling me how to make cocaine, it's not the same sort of soul.


> With one I at least get an open source model. The other is a big black box.

It doesn't matter much as in both cases provider has access to you ins and outs. The only question is if you trust company operating the model. (yes, you can run local model, but it's not that capable)


US is capitalistic liberal democracy and China is one party capitalistic dictatorship. Make your choice.


The US tends towards dictatorship; due process is an afterthought, people disappearing off the streets, citizens getting arrested at the border for nothing, tourists getting deported over minute issues such as an iffy hotel booking, and that's just off the top of my head from the last 2 days.


You make it seem so binary. If you do enough research on the US you might change your mind. YES, I would still choose the US.


Thanks for the kind words, joelio182! Glad you see the value in making SSL more practical for real-world domain shift issues.

As liopeer mentioned, we have results for medical (DeepLesion) and agriculture (DeepWeeds) in the blog post. We haven't published specific benchmarks on satellite or industrial inspection data yet, but those are definitely the kinds of niche domains where pretraining on specific unlabeled data should yield significant benefits. We're keen to explore more areas like these.

Our goal is exactly what you pointed out - bridging the gap between SSL research and practical application where labels are scarce. Appreciate the encouragement!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: