More

forrest2 · on Dec 27, 2024

This is largely a side effect of mimicking the distribution on the internet via pretraining.

It's a good basis for setting up a model of the world since we have so much data and it's free.

Post-training techniques like DPO and RLHF are then about using minimal hand-curated data (expensive!) to shift that distribution closer to standard human / desired behavior.

It will continue to get better -- early versions of chat gpt were taught to say "I don't know" with something like 20 training examples and it got substantially better off of those. As that number of training examples increases with the amount of capital invested, there will be more patterns that get latched onto and expressed by attention in these models.

----

It will take time but they'll get pretty robust. Models will still be susceptible to Dunning-Kruger / ignorance. They aren't perfect AND it's in their training data thanks to us humans that they're copying.

forrest2 · on June 14, 2024

A single synchronous request is not a good way to understand cost here unless your workload is truly singular tiny requests. Chatgpt handles many requests in parallel and this article's 4 GPU setup certainly can handle more too.

It is miraculous that the cost comparison isn't worse given how adversarial this test is.

Larger requests, concurrent requests, and request queueing will drastically reduce cost here.

forrest2 · on May 26, 2024

Of course, we already need to be working in Spaghetti ints or prepping them will be as complex than a linear scan.

Can't wait for spaghetti arithmetic.

Do we have a better algo than log(n) for locating the min?

I'm thinking spaghetti max align them, lay them across an arm at the midway point, sweep the shorts that fell, and repeat until all remaining are same length.

forrest2 · on April 2, 2024

A lot of the yolo stuff from ultralytics is AGPL3 fyi. Recommend caution depending on what code or models / model lineage are used

yingfeng · on April 2, 2024

Thanks for your nice suggestion. We train the model using YOLO, but during inference, the model is converted into ONNX and we use ONNXRuntime for the model inference. As a result, YOLO itself is not included in the software package. We will open the training code in the repo soon.

forrest2 · on July 21, 2023

In some places, it's "gay" if two guys hold hands. On the racist bit: it can be hard to disambiguate why someone treats you in an unfavorable way -- it could be because they're actually racist or they could just be a pissy person who hates their job.

Just because a person or group of people classifies you a certain way doesn't make it universally true.

Timon3 · on July 21, 2023

> In some places, it's "gay" if two guys hold hands.

Sure, but that doesn't mean those two guys are gay.

> On the racist bit: it can be hard to disambiguate why someone treats you in an unfavorable way -- it could be because they're actually racist or they could just be a pissy person who hates their job. Just because a person or group of people classifies you a certain way doesn't make it universally true.

It might not be universally true, but it is subjectively true. That's the difference: even if you think you're not being racist, you might be racist from the PoV of someone you were racist towards.

forrest2 · on July 22, 2023

Then it's an arguably arbitrary classification right? The borders seem potentially ill-defined and/or social in nature.

Timon3 · on July 22, 2023

"Subjective" and "arbitrary" are not the same thing.

forrest2 · on June 26, 2023

You can get most products on prem at a certain price & size. A lot of companies will apply resistance though unless the contract size is right b/c on-prem contracts tend to be less unit-profitable, risky, + unique annoying terms or constraints.

If you * need * it, you find a human to talk to (sales or connect from your network).

forrest2 · on June 19, 2023

Is there a design paper somewhere? Curious how this was accomplished (and what trade-offs / failure modes it has): "System uses a consistency sharding algorithm, lock-free design, task scheduling is accurate down to the second, supporting lightweight distributed computing and unlimited horizontal scaling"

forrest2 · on May 11, 2023

I think I have this: https://www.nationwidechildrens.org/conditions/auditory-proc...

Might be the same thing depending on your full set of symptoms.

glenngillen · on May 11, 2023

Yeah, they tested for APD (two different approaches). I found it really difficult though enjoyable in terms of being able to self analyse what it was that made me struggle. That said, my results were apparently normal and everyone finds it difficult.

forrest2 · on April 17, 2023

You might not see it available via self serve interfaces, but I think the h100 cards can do 256x to a single host.

narrator · on April 18, 2023

You're right! You can nvlink 32 systems with 8 h100s! That's 20 terabytes of VRAM total!

forrest2 · on April 17, 2023

I think people said the same thing about NNs in general before we hit a scale where they started performing magic.

There could be exponential or quadratic scaling laws with any of these black boxes that makes one approach suddenly extremely viable or even dominant.

bob1029 · on April 18, 2023

> There could be exponential or quadratic scaling laws with any of these black boxes that makes one approach suddenly extremely viable or even dominant.

The reason I like the CPU approach is the memory scaling is bonkers compared to GPU. You can buy a server that has 12TB of DRAM (in stock right now) for the cost of 1 of those H100 GPU systems. This is enough memory to hold over 3 trillion parameters with full 32-bit FP resolution. Employ some downsampling and you could get even more ridiculous.

If 12TB isn't enough, you can always reach for things like RDMA and high speed interconnects. You could probably get 100 trillion parameters into 1 rack. At some point you'll need to add hierarchy to the SNN so that multiple racks & datacenters can work together.

Imagine the power savings... It's not exactly a walk in the park, but those DIMMs are very eco friendly compared to GPUs. You don't need a whole lot of CPU cores in my proposal either. 8-16 very fast cores per box would probably be more than enough, looking at how fintech does things. 1 thread is actually running the entire show in my current prototype. The other threads are for spike timers & managing other external signals.

suby · on April 18, 2023

Is your current prototype open source?

Ambix · on April 18, 2023

Not the TS, but that's actually the same goal I have in mind with [0] project.

Right now I'm building my homelab server which aimed to fit 1 TB RAM and 2 CPUs with ~100 cores total.

It will cost like 0.1% of what I need to pay for GPU cluster with the same memory size :)

[0] https://github.com/gotzmann/llama.go/