More

isusmelj · 2025-11-12T19:47:12 1762976832

Are there any benchmarks? I didn’t find any. It would be the first model update without proof that it’s better.

isusmelj · 2025-11-07T09:26:09 1762507569

Very proud as a Swiss that Soumith has a .ch domain!

roflmaostc · 2025-11-07T10:27:38 1762511258

Probably because his first name is Chintala

spprashant · 2025-11-07T10:48:47 1762512527

That d be his last name

roflmaostc · 2025-11-07T12:44:18 1762519458

true haha

isusmelj · 2025-11-06T19:56:56 1762459016

Is the price here correct? https://openrouter.ai/moonshotai/kimi-k2-thinking Would be $0,60 for input and $2,50 for 1 million output tokens. If the model is really that good it's 4x cheaper than comparable models. It's hosted at a loss or the others have a huge margin? I might miss something here. Would love some expert opinion :)

FYI: the non thinking variant has the same price.

burroisolator · 2025-11-06T20:05:36 1762459536

In short, the others have a huge margin if you ignore training costs. See https://martinalderson.com/posts/are-openai-and-anthropic-re... for details.

throwdbaaway · 2025-11-06T22:21:00 1762467660

Somehow that article totally ignored the insane pricing of cached input tokens set by Anthropic and OpenAI. For agentic coding, typically 90~95% of the inference cost is attributed to cached input tokens, and a scrappy China company can do it almost for free: https://api-docs.deepseek.com/news/news0802

fspeech · 2025-11-07T03:45:12 1762487112

It uses 75% linear attention layers so it is inherently lower cost. And it is MOE so active parameters are far lower.

flockonus · 2025-11-06T21:42:49 1762465369

Yes, you may consider that opensource models hosted over Openrouter are charging about bare hardware costs, where in practice some providers there may run on subsidized hardware even, so there is money to be made.

isusmelj · 2025-09-06T07:56:30 1757145390

I can only agree with your experience in Europe. I do not get how they do that, but Tesla Superchargers are more reliable. The occupancy information works better, they are easier to use, and they almost always offer a more competitive price. I often see other chargers that are 50 to 100 percent more expensive and only very rarely see offers that are within 10 to 50 percent.

What strikes me is that this difference can make EVs more expensive per kilometer if you only compare energy cost with fuel cost.

Here is the math with numbers. Tesla chargers in Switzerland and Germany are usually at most CHF 0.50 or EUR 0.60 per kilowatt hour at the more expensive locations, along highways for example. They offer fast charging of 150 kW or more. Alternative providers often start at around CHF 0.75 for 50 kW or CHF 1.00 for more than 250 kW fast charging. If your electric car consumes 20 kWh (Model 3 is at around 15 I think) per 100 km you end up with costs of CHF 10.00, CHF 15.00, or CHF 20.00 per 100 km at CHF 0.50, CHF 0.75, or CHF 1.00 per kilowatt hour. If you drive a petrol car that uses 8 l per 100 km and the cost per liter is CHF 1.70 you pay CHF 13.60 per 100 km.

guerby · 2025-09-06T08:23:18 1757146998

The tesla 12 250 kW stall in my town in the south of France (Albi) price is one of the lowest in the area at 0.23 EUR/kWh (all taxes included).

It is under a solar covered parking

yreg · 2025-09-06T09:58:24 1757152704

In Slovakia superchargers cost around 0.30-0.37 €/kWh, while the competitors are priced around 0.45-0.60, so yes there is a major price difference as well.

To be fair the others offer subscription plans which lower the price, but such plans don't suit me, so I pay the full prices.

LargoLasskhyfv · 2025-09-10T02:08:54 1757470134

Ze devil iz in ze detailz!

Try this one https://www.tesla.com/de_DE/findus/location/supercharger/Ham...

While the site says they are available 24/7 they are in the turnpiked parking space of a shopping mall. Closed between 8PM to 9AM and on Sundays. Fun to see danish tourists desperately try to reach them while very low on juice.

Extreme Fremdschäming/facepalming!

isusmelj · 2025-08-28T06:21:58 1756362118

Are there any news about power consumption? I didn’t even see a tdp or so mentioned.

lmeyerov · 2025-08-28T10:13:05 1756375985

One of the first things I looked at too...

canucker2016 · 2025-08-29T11:41:21 1756467681

from my comment elsewhere in this thread, https://news.ycombinator.com/item?id=45048078, "up to 170W." was the quote from March.

isusmelj · 2025-07-18T06:57:13 1752821833

Demand > Supply?

isusmelj · 2025-07-11T20:30:19 1752265819

I hope they do well. AFAIK they’re training or finetuning an older LLaMA model, so performance might lag behind SOTA. But what really matters is that ETH and EPFL get hands-on experience training at scale. From what I’ve heard, the new AI cluster still has teething problems. A lot of people underestimate how tough it is to train models at this scale, especially on your own infra.

Disclaimer: I’m Swiss and studied at ETH. We’ve got the brainpower, but not much large-scale training experience yet. And IMHO, a lot of the “magic” in LLMs is infrastructure-driven.

lllllm · 2025-07-12T06:39:41 1752302381

No, the model has nothing do to with Llama. We are using our own architecture, and training from scratch. Llama also does not have open training data, and is non-compliant, in contrast to this model.

Source: I'm part of the training team

danielhanchen · 2025-07-12T07:31:17 1752305477

If you guys need help on GGUFs + Unsloth dynamic quants + finetuning support via Unsloth https://github.com/unslothai/unsloth on day 0 / 1, more than happy to help :)

lllllm · 2025-07-12T08:11:30 1752307890

absolutely! i've sent you a linkedin message last week. but here seems to work much better, thanks a lot!

danielhanchen · 2025-07-13T07:24:36 1752391476

Oh sorry I might have missed it! I think you or your colleague emailed me (I think?) My email is daniel @ unsloth.ai if that helps :)

d3m0t3p · 2025-07-12T10:13:25 1752315205

Hey, really cool project, I’m excited to see the outcome. Is there a blog / paper summarizing how you are doing it ? Also which research group is currently working on it at eth ?

moffkalast · 2025-07-12T10:09:53 1752314993

L3 has open pretraining data, it's just not official for obvious legal reasons: https://huggingface.co/datasets/HuggingFaceFW/fineweb

menaerus · 2025-07-14T11:45:28 1752493528

Wait, whole (english speaking) web content dataset size is ~50TB?

zX41ZdbW · 2025-07-15T18:41:20 1752604880

Yes, if we take the filtered and deduplicated HTMLs of CommonCrawl. I've made a video on this topic recently: https://www.youtube.com/watch?v=8yH3rY1fZEA

menaerus · 2025-07-16T06:42:43 1752648163

Fun presentation, thanks! 72min ingestion time for ~81TB of data is ~1TB/min or ~19GB/s. Distributed or single-node? Shards? I see 50 jobs are used for parallel ingestion, and I wonder how ~19GB/s was achieved since ingestion rates were far below that figure last time I played around with CH performance. Granted, that was some years ago.

zX41ZdbW · 2025-07-23T19:59:26 1753300766

Distributed across 20 replicas.

Al-Khwarizmi · 2025-07-12T09:50:41 1752313841

So you're not going to use copyrighted data for training? That's going to be a disadvantage with respect to LLaMa and other well-known models, it's an open secret that everyone is using everything they can get their hands on.

Good luck though, very needed project!

badsectoracula · 2025-07-12T10:13:49 1752315229

Not sure about the Swiss laws, but the EU AI Act and the 2019/790 digital millennium directive it piggies back on the topic, does allow for training on copyrighted data as long as any opt-out mechanisms (e.g. robots.txt) are respected. AFAICT this LLM was trained by respecting those mechanisms (and as linked elsewhere they didn't find any practical difference in performance - note that there is an exception to allow ignoring the opt-out mechanisms for research purposes, so they could make that comparison).

miraculixx · 2025-07-26T06:32:03 1753511523

That is not correct. The EU AI Act has no such provision, ans the data mining excemption does not apply as the EU has made clear. As for Switzerland copyrighted material cannot be used unless licensed.

isusmelj · 2025-07-12T08:21:12 1752308472

Thanks for clarifying! I wish you all the best luck!

blurbleblurble · 2025-07-12T06:41:54 1752302514

Are you using dbpedia?

lllllm · 2025-07-12T07:13:22 1752304402

no. the main source is fineweb2, but with additional filtering for compliance, toxicity removal, and quality filters such as fineweb2-hq

PeterStuer · 2025-07-12T07:54:54 1752306894

Thx for engaging here.

Can you comment on how the filtering impacted language coverage? E.g. finweb2 has 1800+ languages, but some with very little actual representation, while finweb2-hq has just 20 but each with a subdsantial data set.

(I'm personaly most interested in covering the 24 official EU languages)

lllllm · 2025-07-12T08:06:36 1752307596

we kept all 1800+ (script/language) pairs, not only the quality filtered ones. the question if a mix of quality filtered and not languages impacts the mixing is still an open question. preliminary research (Section 4.2.7 of https://arxiv.org/abs/2502.10361 ) indicates that quality filtering can mitigate the curse of multilinguality to some degree, so facilitate cross-lingual generalization, but it has to be seen how strong this effect is on larger scale

andy99 · 2025-07-11T21:09:57 1752268197

Imo, a lot of the magic is also dataset driven, specifically the SFT and other fine tuning / RLHF data they have. That's what has separated the models people actually use from the also-rans.

I agree with everything you say about getting the experience, the infrastructure is very important and is probably the most critical part of a sovereign LLM supply chain. I would hope there will also be enough focus on the data, early on, that the model will be useful.

luke-stanley · 2025-07-11T20:53:17 1752267197

When I read "from scratch", I assume they are doing pre-training, not just finetuning, do you have a different take? Do you mean it's normal Llama architecture they're using? I'm curious about the benchmarks!

alfalfasprout · 2025-07-11T22:01:02 1752271262

The infra does become pretty complex to get a SOTA LLM trained. People assume it's as simple as loading up the architecture and a dataset + using something like Ray. There's a lot that goes into designing the dataset, the eval pipelines, the training approach, maximizing the use of your hardware, dealing with cross-node latency, recovering from errors, etc.

But it's good to have more and more players in this space.

asjir · 2025-07-12T12:58:39 1752325119

I'd be more concerned about the size used being 70b (deepseek r1 has 671b) which makes catching up with SOTA kinda more difficult to begin with.

zettabomb · 2025-07-12T13:12:20 1752325940

SOTA performance is relative to model size. If it performs better than other models in the 70B range (e.g. Llama 3.3) then it could be quite useful. Not everyone has the VRAM to run the full fat Deepseek R1.

tough · 2025-07-12T16:58:29 1752339509

also isn't DeepSeek's Mixture of Experts? meaning not all params get ever activated on one forward pass?

70B feels like the best balance between usable locally and decent for regular use.

maybe not SOTA, but a great first step.

isusmelj · 2025-06-03T05:58:12 1748930292

Is there anything like this also supporting other GPUs? Thinking of Apple Silicon or embedded ones in phones etc.

isusmelj · 2025-04-22T17:48:43 1745344123

As someone in Europe, I sometimes wonder what’s worse: letting US companies use my data to target ads, or handing it to Chinese companies where I have no clue what’s being done with it. With one I at least get an open source model. The other is a big black box.

credit_guy · 2025-04-22T19:27:59 1745350079

Both are bad. If Europe does not develop local alternatives to ChatGpt or DeepSeek, it will (slowly) lose its sovereingty.

ryoshoe · 2025-04-22T19:38:41 1745350721

Europe is developing local alternative models such as Mistral

fragmede · 2025-04-22T17:59:33 1745344773

They're not open source. It's nice of Meta and Deepseek to offer up their models for download, but that doesn't make them open source.

chvid · 2025-04-22T18:29:04 1745346544

Hard to be fully open source if you train on copyrighted material.

Anyway. Deepseek is the most open of the sota models.

MoonGhost · 2025-04-22T20:44:39 1745354679

Did they open their datasets already? Would be nice to have 'thinking' part.

anonym29 · 2025-04-22T18:34:23 1745346863

Isn't this a bit of semantic lawyering? Open model weights are not the same as open source in a literal sense, but I'd go so far as to suggest that open model weights fulfill much of the intent / "soul" of the open source movement. Would you disagree with that notion?

lxgr · 2025-04-22T18:55:11 1745348111

> open model weights fulfill much of the intent / "soul" of the open source movement

Absolutely not. The intent of the open source movement is sharing methods, not just artifacts, and that would require training code and methodology.

A binary (and that's arguably what weights are) you can semi-freely download and distribute is just shareware – that's several steps away from actual open source.

There's nothing wrong with shareware, but calling it open source, or even just "source available" (i.e. open source with licensing/usage restrictions), when it isn't, is disingenuous.

MoonGhost · 2025-04-22T20:51:08 1745355068

> The intent of the open source movement is sharing methods, not just artifacts, and that would require training code and methodology.

That's not enough. The key point was trust. When executable can be verified by independent review and rebuild. It it cannot be rebuilt it can be virus, troyan, backdoor, etc. For LLMs there is no way to reproduce, thus no way to verify them. So, they cannot be trusted and we have to trust producers. It's not that important when models are just talking, but with tools use it can be a real damage.

lxgr · 2025-04-22T21:19:10 1745356750

Hm, I wouldn't say that that's the key point of open software. There are many open source projects that don't have reproducible builds (some don't even offer any binary builds), and conversely there is "source available" software with deterministic builds that's not freely licensed.

On top of that, I don't think it works quite that way for ML models. Even their creators, with access to all training data and training steps, are having a very hard time reasoning about what these things will do exactly for a given input without trying it out.

"Reproducible training runs" could at least show that there's not been any active adversarial RHLF, but seem prohibitively expensive in terms of resources.

MoonGhost · 2025-04-23T01:23:12 1745371392

Well, 'open source' is interpreted in different ways. I think the core idea is it can be trusted. You can get Linux distribution and recompile every component except for the proprietary drivers. With that being done by independent groups you can trust it enough to run bank's systems. The other options are like Windows where you have to trust Microsoft and their supply chain.

There are different variations, of course. Mostly related to the rights and permissions.

As for big models even their owners, having all the hardware and training data and code, cannot reproduce them. Model may have some undocumented functionality pretrained or added in post process, and it's almost impossible to detect without knowing the key phrase. It can be a harmless watermark or something else.

anonym29 · 2025-04-22T21:14:58 1745356498

But there is also no publicly known way to implant unwanted telemetry, backdoors, or malware into modern model formats either (which hasn't always been true of older LLM model formats), which mitigates at least one functional concern about trust in this case, no?

It's not quite like executing a binary in userland - you're not really granting code execution to anyone with the model, right? Perhaps there is some undisclosed vulnerability in one or more of the runtimes, like llama.cpp, but that's a separate discussion.

lxgr · 2025-04-22T21:21:37 1745356897

The biggest problem is arguably at a different layer: These models are often used to write code, and if they write code containing vulnerabilities, they don't need any special permissions to do a lot of damage.

It's "reflections on trusting trust" all the way down.

anonym29 · 2025-04-23T01:57:49 1745373469

If people who cannot read code well enough to evaluate whether or not it is secure are using LLM's to generate code, no amount of model transparency will solve the resulting problems. At least not while LLM's still suffer from the the major problems they have, like hallucinations, or being wrong (just like humans!).

Whether the model is open source, open weight, both, or neither has essentially zero impact on this.

Manabu-eo · 2025-04-22T20:14:54 1745352894

I saw the argument that the source code is the preferred base to make changes and modifications in software, but in the case of those large models, the weights themselves are the preferred way.

It's much easier and cheap to make a finetune or LoRA than to train from scratch to adapt it to your use case. So it's not quite like source vs binary in software.

_aavaa_ · 2025-04-22T19:02:19 1745348539

Meta models do not, they have use restrictions. At least deepseek does not.

fragmede · 2025-04-22T19:49:10 1745351350

It does not and I totally disagree with that. Unless we can see the code that goes into the model to stop of from telling me how to make cocaine, it's not the same sort of soul.

MoonGhost · 2025-04-22T20:58:49 1745355529

> With one I at least get an open source model. The other is a big black box.

It doesn't matter much as in both cases provider has access to you ins and outs. The only question is if you trust company operating the model. (yes, you can run local model, but it's not that capable)

mrkramer · 2025-04-22T19:16:10 1745349370

US is capitalistic liberal democracy and China is one party capitalistic dictatorship. Make your choice.

PartiallyTyped · 2025-04-22T21:12:44 1745356364

The US tends towards dictatorship; due process is an afterthought, people disappearing off the streets, citizens getting arrested at the border for nothing, tourists getting deported over minute issues such as an iffy hotel booking, and that's just off the top of my head from the last 2 days.

mr90210 · 2025-04-22T19:23:41 1745349821

You make it seem so binary. If you do enough research on the US you might change your mind. YES, I would still choose the US.

isusmelj · 2025-04-15T17:59:23 1744739963

Thanks for the kind words, joelio182! Glad you see the value in making SSL more practical for real-world domain shift issues.

As liopeer mentioned, we have results for medical (DeepLesion) and agriculture (DeepWeeds) in the blog post. We haven't published specific benchmarks on satellite or industrial inspection data yet, but those are definitely the kinds of niche domains where pretraining on specific unlabeled data should yield significant benefits. We're keen to explore more areas like these.

Our goal is exactly what you pointed out - bridging the gap between SSL research and practical application where labels are scarce. Appreciate the encouragement!