You need to set sampling parameters for the llm. Had the same issue with Qwen3.5 when i first started. You can grab them off the model card page usually.
From Qwen3.6 page:
Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
Yes, have tried all of these (as per the docs). Have you actually tried these? Because I have tried all 3 configurations with agentic coding that you mentioned and have the same result - loops.
I've used only Qwen3.5 so far for work and was, after initial struggles, successful with GPU setup, no mlx. Ngl the fact that they are using `presence_penalty: 0` and no `max_tokens` is weird after that exact setup caused me "initial struggles", but i've set up a simple docker-compose with vllm and qwen3.6 right now to test it out and it worked perfectly fine for me.
min_p author here. min_p is strictly better than top_p and top_k. The big labs don't know shit about sampling, and give absolutely nuts recommendations like this.
set min_p to like 0.3 and ignore top_p and top_k and you'll be fine.
There's better samplers now like top N sigma, top-h, P-less decoding, etc, but they're often not available in your LLM inference engine (i.e. vLLM)
I’m wondering though, what does extra creativity in code generation actually look like? How is the creativity expressed in code? Does the LLM reach for Bubble Sort instead of Quicksort? Maybe it decides that sorting only the first 10 elements of an array is enough? Funny variable names? Cursing in comments?
In this case, we are not arguing that min_p is better for "creative code" (you really don't want high temperature anywhere near your code generation, despite the "turning up the heat" framing of our paper) - at least in my post claiming min_p is strictly better than top_p above.
We are instead arguing that min_p handles truncating tokens that are more likely to lead to degeneration/looping because it is partially distribution aware. Fully distribution aware samplers like the ones I mentioned above (i.e. P-less decoding) are strictly superior due to using the whole distribution to decide the truncation at every time step.
Code hallucinations, like many LLM hallucinations, can be seen as accumulation of small amounts of "sampling errors".
From Qwen3.6 page:
Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
Thinking mode for precise coding tasks (e.g. WebDev): temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0
Instruct (or non-thinking) mode: temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0