In my case, I was also running an ASR model and a TTS model so it was a bit much...

		Ey7NFZ3P0nzAe 38 days ago \| parent \| context \| favorite \| on: Qwen3.6-Max-Preview: Smarter, Sharper, Still Evolv... In my case, I was also running an ASR model and a TTS model so it was a bit much for my RTX 3090. I opted to offset like 5 layers to the cpu while adding a GPU-only speculative decoding with their 0.8B model. Working well so far.