More

codingmoh · 2025-05-04T19:03:25 1746385405

In theory, it seemed perfect for flexible manufacturing: same machine, same material, endless outputs. But in practice, it hit limits in speed, material properties, and post-processing. You still can’t print a high-tolerance metal part at scale and cost-effectively replace traditional machining. It’s amazing for prototyping or niche parts

earleybird · 2025-05-04T19:25:10 1746386710

"You still can't print a high-tolerance metal part at scale and cost-effectively..."

Dan Gelbart has a response (with caveats)

https://www.youtube.com/watch?v=kLgPW2672s4

codingmoh · 2025-05-04T19:42:07 1746387727

oh wow - that's cool! - Thanks so much for sharing!

codingmoh · 2025-04-22T21:20:10 1745356810

Hey, that is a very good question, I have answered that before. I hope you don't mind, if I simply copy paste my previous answer:

Technically you can use the original Codex CLI with a local LLM - if your inference provider implements the OpenAI Chat Completions API, with function calling, etc. included.

But based on what I had in mind - the idea that small models can be really useful if optimized for very specific use cases - I figured the current architecture of Codex CLI wasn't the best fit for that. So instead of forking it, I started from scratch.

Here's the rough thinking behind it:

   1. You still have to manually set up and run your own inference server (e.g., with ollama, lmstudio, vllm, etc.).
   2. You need to ensure that the model you choose works well with Codex's pre-defined prompt setup and configuration.
   3. Prompting patterns for small open-source models (like phi-4-mini) often need to be very different - they don't generalize as well.
   4. The function calling format (or structured output) might not even be supported by your local inference provider.

Codex CLI's implementation and prompts seem tailored for a specific class of hosted, large-scale models (e.g. GPT, Gemini, Grok). But if you want to get good results with small, local models, everything - prompting, reasoning chains, output structure - often needs to be different. So I built this with a few assumptions in mind:

   - Write the tool specifically to run _locally_ out of the box, no inference API server required.
   - Use model directly (currently for phi-4-mini via llama-cpp-python).
   - Optimize the prompt and execution logic _per model_ to get the best performance.

Instead of forcing small models into a system meant for large, general-purpose APIs, I wanted to explore a local-first, model-specific alternative that's easy to install and extend — and free to run.

codingmoh · 2025-04-21T22:43:56 1745275436

Thanks for bringing that up - it's exactly why I approached it this way from the start.

Technically you can use the original Codex CLI with a local LLM - if your inference provider implements the OpenAI Chat Completions API, with function calling, etc. included.

But based on what I had in mind - the idea that small models can be really useful if optimized for very specific use cases - I figured the current architecture of Codex CLI wasn't the best fit for that. So instead of forking it, I started from scratch.

Here's the rough thinking behind it:

   1. You still have to manually set up and run your own inference server (e.g., with ollama, lmstudio, vllm, etc.).
   2. You need to ensure that the model you choose works well with Codex's pre-defined prompt setup and configuration.
   3. Prompting patterns for small open-source models (like phi-4-mini) often need to be very different - they don't generalize as well.
   4. The function calling format (or structured output) might not even be supported by your local inference provider.

Codex CLI's implementation and prompts seem tailored for a specific class of hosted, large-scale models (e.g. GPT, Gemini, Grok). But if you want to get good results with small, local models, everything - prompting, reasoning chains, output structure - often needs to be different.

So I built this with a few assumptions in mind:

   - Write the tool specifically to run _locally_ out of the box, no inference API server required.
   - Use model directly (currently for phi-4-mini via llama-cpp-python).
   - Optimize the prompt and execution logic _per model_ to get the best performance.

Instead of forcing small models into a system meant for large, general-purpose APIs, I wanted to explore a local-first, model-specific alternative that's easy to install and extend — and free to run.

codingmoh · 2025-04-21T21:02:00 1745269320

I want to add support for qwen 2.5 next

manmal · 2025-04-21T21:20:54 1745270454

QwQ-32 might be worth looking into also, as a high level planning tool.

codingmoh · 2025-04-21T21:30:11 1745271011

Thank you so much!

smcleod · 2025-04-22T07:24:21 1745306661

Hopefully Qwen 3 and maybe if we're lucky Qwen 3 Coder might be out this week too.

smcleod · 2025-04-22T07:26:54 1745306814

Also GLM 4 is pretty amazing - https://www.reddit.com/r/LocalLLaMA/comments/1k4w9p2/i_uploa...

codingmoh · 2025-04-22T22:29:21 1745360961

Thanks, I'll have a look

codingmoh · 2025-04-21T21:01:31 1745269291

Thanks so much!

Was the model too big to run locally?

That’s one of the reasons I went with phi-4-mini - surprisingly high quality for its size and speed. It handled multi-step reasoning, math, structured data extraction, and code pretty well, all on modest hardware. Phi-1.5 / Phi-2 (quantized versions) also run on raspberry pi as others have demonstrated.

xyproto · 2025-04-22T07:09:10 1745305750

The models work fine with "ollama run" locally.

When trying out "phi4" locally with:

open-codex --provider ollama --full-auto --project-doc README.md --model phi4:latest

I get this error:

      OpenAI rejected the request. Error details: Status: 400, Code: unknown, Type: api_error, Message: 400
    registry.ollama.ai/library/phi4:latest does not support tools. Please verify your settings and try again.

codingmoh · 2025-04-21T20:29:53 1745267393

Hey, this is really cool! Curious how good the multi-language support is. Also - pretty wild that you trained the whole thing yourselves, especially without prior experience in speech models.

Might actually be helpful for others if you ever feel like documenting how you got started and what the process looked like. I’ve never worked with TTS models myself, and honestly wouldn’t know where to begin. Either way, awesome work. Big respect.

toebee · 2025-04-22T01:19:08 1745284748

Thank you so much for the kind words :) We only support English at the moment, hopefully can do more languages in the future. We are planning to release a technical report on some of the details, so stay tuned for that!

bavell · 2025-04-22T12:38:44 1745325524

I'd also love to peek behind the curtains, if only to satisfy my own curiosity. Looking forward to the technical report, well done!

codingmoh · 2025-04-21T20:14:57 1745266497

I saw pretty good reasoning quality with phi-4-mini. But alright - I’ll still run some tests with qwen2.5-coder and plan to add support for it next. Would be great to compare them side by side in practical shell tasks. Thanks so much for the pointer!

codingmoh · 2025-04-21T20:07:39 1745266059

fair jab - haha; if we’re gonna go small, might as well go fully local and open. At least with phi-4-mini you don’t need an API key, and you can tweak/replace the model easily

codingmoh · 2025-04-21T20:02:29 1745265749

I went with Phi as the default model because, after some testing, I was honestly surprised by how high the quality was relative to its size and speed. The responses felt better in some reasoning tasks-but were running on way less hardware.

What really convinced me, though, was the focus on the kinds of tasks I actually care about: multi-step reasoning, math, structured data extraction, and code understanding.There’s a great Microsoft paper on this: "Textbooks Are All You Need" and solid follow-ups with Phi‑2 and Phi‑3.

codingmoh · 2025-04-15T16:49:04 1744735744

I know exactly what you mean—it’s the same cycle I’ve seen in countless projects, with every project spiraling into a brittle pipeline that’s hard to maintain. Despite all the promising ideas, we never seem to nail down a universal solution that’s both simple and robust. - There must be a way, and sooner or later it will be figured out.