Hacker Newsnew | past | comments | ask | show | jobs | submit | vkaufmann's commentslogin

GPT-OSS-20B-Vision: First community VLM for GPT-OSS, trained on a single DGX Spark

A couple weeks ago I shipped an MCP server (noapi-google-search-mcp) and people in the community challenged me to do something harder - build a VLM. So I bought a DGX Spark, flew to Dubai, and built the first vision-language model for GPT-OSS from a hotel room. Just a Spark, hotel WiFi and stubbornness.

This is an early proof of concept at 22% training - shipped it to show what's possible and to find compute partners to finish the job.

What it does: Adds vision to GPT-OSS-20B. Takes an image + text prompt, generates coherent descriptions. Identifies objects, scenes, spatial relationships. Vision was trained directly into the model through QLoRA adaptation - the LLM learned to see, not just pass through visual tokens. All original text capabilities are fully preserved. Hallucinations present - expected at this training stage.

How it works: A SigLIP vision encoder feeds into the 20B MoE language model through a method I call PseudoDeepStack - extracting visual features from multiple encoder depths instead of just the final layer. Richer visual representations at zero additional inference cost.

Key finding: Projector-only training (the standard approach for dense VLMs) fails completely on MoE architectures. The expert routing can't handle visual tokens it's never seen. QLoRA adaptation solves this.

The setup: Single NVIDIA DGX Spark GB10, hotel room in Dubai, Domino's pizza. No cluster, no team. ~3.5 days of training to this checkpoint.

What's next: Finishing training with new hyperparameters based on what we learned from this run, scaling to GPT-OSS-120B (same projector works - shared hidden dimensions), benchmarking. Need compute to get there.

Model + code + full model card: https://huggingface.co/vincentkaufmann/gpt-oss-20b-vision-pr...


About to release GPT-OSS-120B-Vision and GPT-OSS-20B-Vision, how about that! :D

Its meant to be super light weight for people who run 1B, 3B, 8B or 20B models on skinny devices, one Pip install with high impact for one install :D

Coolest thing about it is, its 1 pip install to give your local model the ability to see, do Google Searches and use News, Shopping, Scholar, Maps, Finance, Weather, Flights, Hotels, Translate, Images, Trends etc

Easiest and fastest way and the impact is massive


GPT-OSS-120B runs like hell on my DGX Spark

The MXFP4 variant I suppose? My setup (RTX Pro 6000) does around ~140 tok/s with llama.cpp, around 160 tok/s with vLLM.

yep MXFP4 really fast :D

too slow bro

might be slower, but then it can get the actual image as input, not just some description of it

Thought this is "hacker new" bro

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: