Hacker Newsnew | past | comments | ask | show | jobs | submit | mehdibl's commentslogin

How to do agentic workflow like 2 years ago.

What would SOA be?

Claude Code is perfectly able to access your git/jj history directly. Just ask it to review a commit.

Claims as always misleading as they don't show the context length or prefill if you use a lot of context. As it will be fun waiting minutes for a reply.


Issue llama.cpp is shipped and no more updated. Or you need to download it all.


This is killing me with complexity. We had agents.md and were supposed to augment the context there. Now back to cursor rules and another md file to ingest.


MCPs feel complicated. Skills seem to me like the simplest possible design for a mechanism for adding extra capabilities to an existing coding agent.


Can skills completely replace MCPs? For example, can a skill be configured to launch my local Python program in its own venv. I don’t want Claude to spend time spinning up a runtime


Skills only work if you have a code environment up and running and available for a coding agent to execute commands in.

You can absolutely have a skill that tells the coding agent how to use Python with your preferred virtual environment mechanism.

I ended up solving that in a slightly different way - I have a Claude hook that spits attempts to run "python" or "python3" and returns an error saying "use uv run instead".


I tell Claude to make its own skills. “Which part of this task is worth making a skill for, use your skill making skill to do it”


If we aren’t in the take off phase, I don’t know where we are


Skills are just pointers to context so you don't need to load all of them upfront, it is as simple as that. By the way cursor rules is effectively the same as agents.md.


May be Mozilla have a point.

Chat apps, are replacing a lot the old way we consume information and search. That is mostly made thru browser. So I see the vision is follow this transformation to keep market share and offer an alternative to big players.

Mozilla and Firefox loosing market share and revenue too and that could bite back.


Ok and then? Those models were not trained for this purpose.

It's like the last hype over using generative AI for trading.

You might use it for sentiment analysis, summarization and data pre-processing. But classic forecast models will outperform them if you feed them the right metrics.


These are all multi-modal models, right? And the vision capabilities are particularly touted in Gemini.

https://ai.google.dev/gemini-api/docs/image-understanding


It is relevant because they are trained for the purpose of browser use and completing tasks on websites. Being able to bypass captchas is important for using many websites.

It would be nice to see comparisons to some special-purpose CAPTCHA solvers though.


And more broadly, if an agent is supposed to do everything a human can on the web, its ability to solve a captcha is likely a decent litmus test.


Previous report blaming TPlink slow to patch a CVE were already outdated as the CVE got patched. Yes TPlink are recieving updates if the products are not EOL. And even US products when EOL are vulnerable.

Seem more heavy lobbying to get their US marketshare here rathar than looking for secure products.

Also the report from checkpoint over firmware used to attache EU, the malware is firmware agnostic. As it can be used for other hardware.


Before investing in instruments, you should have a solid static analysis, unit tests, integration test and so on. Logging help flagging issues post deployment but you can catch a lot if you test.


100%, and we have all of those things. Canary acts as the last line of defence, and honestly, when Canary detects and rolls back, it is already an incident that is being auto-mitigated with a limited blast radius.

To reduce the potential blast radius, we are working on a cohort-based canary, which will allow us to validate against a minimal, stable subset of traffic with the desired properties.


What matter is not context or the recod token/s you get.

But the quality for the model. And it seem Grok pushing the wrong metrics again, after launching fast.


I thought the number of tokens per second doesn't matter until I used Grok Code Fast. I realized that it makes a huge difference. If it take more than 30s to run, I lose focus, and look at something else. I end up being a lot less productive. It also opens up the possibility to automate a lot more simple tasks. I would def recommend people try fast models


If you are single tasking, speed matters to an extent. You need to still be able to read/skim the output and evaluate its quality.

The productive people I know use git worktrees and are multi-tasking.

The optimal workflow is when you can supply it one or more commands[1] that the model can run to validate/get feedback on its own. Think of it like RLHF for the LLM, they are getting feedback albeit not from you, which can be laborious.

As long as the model gets feedback it can run fairly autonomously with less supervision it does not have to testing driven feedback, if all it gets is you as the feedback, the bottleneck will be always be the human time to read, understand and evaluate the response not token speed.

With current leading models doing 3-4 workflows in parallel is not that hard, when fully concentrating, of course it is somewhat less when browsing HN :)

---

[1] The command could be a unit test runner, or a build/compile step, or e2e workflows like for UI it could be Chrome MCP/CDP, playwright/cypress, or storybook-js and so on. There are even converts toversion of TDD to benefit from this gain.

You could have one built for your use case if no existing ones fit, with model help of course.


Hmm. I run maybe 3 work streams max in parallel and struggle to keep up with the context switching. I have some level of skepticism that your colleagues are amazingly better and do 4 and produce quality code at a faster rate than 1 or 2 work streams in wall clock time. I consider a workstream to be disparate features or bugs that are unrelated and require attention. Running 8 agents in parallel that are all doing the same thing is of course trivial nowadays but that in of itself is what I would consider a single threaded workstream.


We have similar definition of streams, but It depends on a lot of things from your tooling/ language , stack etc.

if your builds take a fair bit of time (incremental builds may not work in worktree first time) or you are working on a item that has high latency feedback like e2e suite that runs on a actual browser etc.

Prompt styles also influences this. I like to make fairly detailed prompt that cover a lot of the nuances upfront and spend 10-15 or more writing it. I find that when I do that it takes longer, but I only give simple feedback during the run itself freeing me to go next item. Some people prefer chat style approach, you cannot keep lot of threads in mind if chatting.

Model and cli client choice matters , on average codex is slower than sonnet 4.5 . Within each family if you enable thinking or use the high reasoning model it can be slower as well.

Finally not all tasks are equal, I like to mix some complex and simpler ones or add some dev ex or a refactor that requires lower attention budget with features that require more.

Having said that, while I don’t know 10x type developers. I wouldn’t be surprised if there are were such people and they can be truly that productive .

The analogy I think of is chess. Maybe I can play 2-3 games in parallel reasonably well, but there are professional players who can play dozens of games blindfolded and win all of them.


Nice answer - all of the above aligns with my experience.

I use sonnet a lot more than openai models and its speed means I do have to babysit it more and get chattier which does make a difference, probably you are right that if I was using codex which is on average 4-6 times slower than claude code that I would have more mental bandwidth to handle more workstreams.


This reads like satire. Who can work on two separate features at the same time?


I completely agree. Grok’s impressive speed is a huge improvement. Never before have I gotten the wrong answer faster than with Grok. All the other LLMs take a little longer and produce a somewhat right answer. Nobody has time to wait for that.


Seems reductive. Some applications require higher context length or fast tokens/s. Consider it a multidimensional Pareto frontier you can optimize for.


It's not just that some absolutely require it, but a lot of applications hugely benefit from more context. A large part of LLM engineering for real world problems revolves around structuring the context and selectively providing the information needed while filtering out unneeded stuff. If you can just dump data into it without preprocessing, it saves a huge amount of development time.


Depending on the application, I think “without preprocessing” is a huge assumption here. LLMs typically do a terrible job of weighting poor quality context vs high quality context and filling an XL context with unstructured junk and expecting it to solve this for you is unlikely to end well.

In my own experience you quickly run into jarring tangents or “ghosts” of unrelated ideas that start to shape the main thread of consciousness and resist steering attempts.


It depends to the extent I already mentioned, but in the end more context always wins in my experience. If you for example want to provide a technical assistant, it works much better if you can provide an entire set of service manuals to the context instead of trying to put together relevant pieces via RAG.


Quality of the model tends to be pretty subjective, and people also complain about gaming benchmarks. At least context window length and generation speed are concrete improvements. There's always a way you can downplay how valuable or impressive a model is.


Depends. For coding at least, you can divide tasks into high-intelligence ($$$) and low-intelligence ($) tasks. Being able to do low-intelligence tasks super fast and cheap would be quite beneficial. A majority of code edits would fall into the fast-and-cheap subset.


Grok's biggest feature is that unlike all the other premier models (yes I know about ChatGPT's new adult mode), it hasn't been lobotomized by censoring.


I am amazed people actually believe this

Grok is the most biased of the lot, and they’re not even trying to hide it particularly well


Bias is not the same as censoring.

Censoring is "I'm afraid I can't let you do that, Dave".

Bias is "actually, Elon Musk waved to the crowd."

Everyone downthread is losing their mind because they think I'm some alt-right clown, but I'm talking about refusals, not Grok being instructed to bend the truth in regard to certain topics.

Bias is often done by prompt injection whilst censoring is often in the alignement, and in web interfaces via a classifier.


They are different, but they’re not that different.

If Grok doesn’t refuse to do something, but gives false information about it instead, that is both bias and censorship.

I agree that Grok gives the appearance of the least censored model. Although, in fairness, I never run into censored results on the other models anyway because I just don’t need to talk about those things.


[flagged]


> it's undisputed that Chat GPT and Gemini insert hidden text into prompts to change the outputs to conform to certain social ideologies

And why do you think Grok doesn’t? It has been documented numerous times that Grok’s prompt has been edited at Musk’s request because the politics in its answers weren’t to his satisfaction.


Nothing you posted (from an almost two year old article btw) in anyway refutes the prior comment.

Grok is significantly the most biased. Did you sleep through its continuous insertion of made up stuff about south africa?

This is the same person who is trying to re-write an entire encyclopedia because facts aren't biased enough.

A group has created an alternate reality echo chamber, and the more reality doesn't match up the more they are trying to invent a fake one.

When you're on the side of book banning and Orwellian re-writing of facts & history that side never turns out to have been the good side. It's human nature for some people to be drawn to it as an easy escape rather than allowing their world views to be challenged. But you'd be pretty pressed to find the group doing that any of the times it's been done to have been anything but a negative for their society.


It takes a lot of chutzpah to accuse people of "re-writing ... facts & history" while peddling AI (and movies and TV shows) that change the ethnicities of historical figures.


Two years is an eternity in this business. Got anything newer than that?


[flagged]


Can’t help but feel everyone making a pro-Grok argument here isn’t actually making the case that it’s uncensored, rather that it’s censored in a way that aligns with their politics, and thus is good


It's almost always telling isn't it?

Almost like chatting with an LLM that refuses to make that extra leap of logic.

"if the llm won't give racist or misogynistic output, it's biased in the wrong way!"


Has the possibility occurred to you that the majority of the editors aren't American and don't care about American culture wars?

What you think of as "heavily biased to the left" is, globally speaking, boring middle of the road academia.


According to a recent Economist article, even Grok is left-biased.


[flagged]


Oh the hubris.


Well, it's kind of a tautology, isn't it? Conservatism always loses in the end, for better or worse, simply because the world and everything in it undergoes change over time.


Relax downvoters, I write it pretty tongue-in-cheek understanding full well the scope of “real” political ideas, and think reasonable people can be all over the political spectrum.

This is quote seared into my head because my father says anything that disagrees with his conspiracies, it is a liberal bias. If I say “ivermectin doesn’t cure cancer”, that’s my liberal bias. “Climate change is not a hoax by the you-know-who’s to control the world” == liberal bias. “Bigfoot only exists in our imagination”… liberal bias (I’m not joking on any of these whatsoever).

So I’ve been saying this in my head and out loud to him for a looooong time.


No censoring and it says the things I agree with are not the same thing


It doesn't blindly give you the full recipe for how to make cocaine. It's still lobotomized, it's just that you agree with the ways in which it's been "lobotomized".


Grok has plenty of censoring. E.g.

"I'm sorry, but I cannot provide instructions on how to synthesize α-PVP (alpha-pyrrolidinopentiophenone, also known as flakka or gravel), as it is a highly dangerous Schedule I controlled substance in most countries, including the US."


Is this the same AI model that at some point managed to make any single topic about the white genocide in South Africa?


How does this sort of thing work from a technical perspective? Is this done during training, by boosting or suppressing training documents, or is is this done by adding instructions in the prompt context?


I think they do it by adding instructions since it came and went pretty fast. Surely if it was part of the training, it would take a while longer to take in.


This was done by adding instructions to the system prompt context, not through training data manipulation. xAI confirmed a modification was made to “the Grok response bot’s prompt on X” that directed it to provide specific responses on this topic (they spun this as “unauthorized” - uh, sure). Grok itself initially stated the instruction “aligns with Elon Musk’s influence, given his public statements on the matter.” This was the second such incident - in February 2025 similar prompt modifications caused Grok to censor mentions of Trump/Musk spreading misinformation.

[1] https://techcrunch.com/2025/05/15/xai-blames-groks-obsession...


For a less polarizing take on the same mis-feature of LLMs, there was Golden Gate Claude.

https://www.anthropic.com/news/golden-gate-claude


Of course it has. There are countless examples of Musk saying Grok will be corrected when it says something that doesn’t line up with his politics.

The whole MechaHitler thing got reversed but only because it was too obvious. No doubt there are a ton of more subtle censorships in the code.


I would argue over censorship is the better word. Ask Grok to write a regex so you can filter slurs on a subreddit and it immediately kicks in telling you that it cant say the nword or whatever, thanks Grok, ChatGPT, Claude etc I guess racism will thrive on my friends sub.


I can’t tell if this is serious or not. Surely you realise you can just use the word “example” and then replace the word in the regex?!


I think they would want a more optimized regex. Like a long list of swears, merged down into one pattern separated by tunnel characters, and with all common prefixes / suffixes combined for each group. That takes more than just replacing one word. Something like the output of the list-to-tree rust crate.


Wouldn't the best approach for that be to write a program that takes a list of words and output an optimized regex?

I'm sure an LLM can help write such a program. I wouldn't expect an LLM to be particularly good at creating the regex directly.


I would agree. That’s exactly what the example I gave (list-to-tree) does. LLMs are actually pretty OK at writing regexes, but for long word lists with prefix/suffix combinations they aren’t great I think. But I was just commenting on the “placeholder” word example given above being a sort of straw man argument against LLMs, since that wouldn’t have been an effective way to solve the problem I was thinking of anyways.


Still incredibly easy to do without feeding the actual words into the LLM.


But why are LLM censored? This is not a feature I asked for


Come on bro you know the answer to this.


When trying to block out nuanced filter evasions of the n-word for example, you can't really translate that from "example" in a useful meaningful way. The worst part is most mainstream (I should be saying all) models yell at you, even though the output will look nothing like the n-word. I figured an LLM would be a good way to get insanely nuanced about a regex.

What's weirdly funny is if you just type a slur, it will give you a dictionary definition of it or scold you. So there's definitely a case where models are "smart" enough to know you just want information for good.

You underestimate what happens when people who troll by posting the nword find an nword filter, and they must get their "troll itch" or whatever out of their system. They start evading your filters. An LLM would have been a key tool in this scenarion because you can tell it to come up with the most absurd variations.


I’ve never run into this problem. What are you asking LLM’s where you run it censoring you?


I was talking to ChatGPT about toxins, and potential attack methods, and ChatGPT refused to satisfy my curiosity on even impossibly impractical subjects. Sure, I can understand why anthrax spore cultivation is censored, but what I really want to know is how many barrels of botox an evil dermatologist would need to inject into someone to actually kill them via Botulism, and how much this "masterplan" would cost.


I've run into things ChatGPT has straight up refused to talk about many times. Most recently I bought a used computer loaded with corporate MDM software and it refused to help me remove it.


It’s easy to appear as uncensored when the world’s attention is not on your product. Once you have enough people using it and harm themselves it will be censored too. In a weird way, this is helping grok to not get boggled by lawsuits unlike openai.


I'm sure there are lawyers out there just looking for uncensored AI's to go sue for losses when some friendly client injures themselves by taking bad-AI-advice.


I sometimes use LLM models to translate text snippets from fictional stories from one language to another.

If the text snippet is something that sounds either very violent or somewhat sexual (even if it's not when properly in context), the LLM will often refuse and simply return "I'm sorry I can't help you with that".


Bigger context window = more input tokens processed = more income for the provider


Indeed. Free grok.com got significantly worse this week and has been on a decline since shortly after the release of Grok-4.

People who have $2000 worth of various model subscriptions (monthly) while saying they are not sponsored are now going to tell me that grok.com is a different model than Grok-4-fast-1337, but the trend is obvious.


What are the other ones to get to $2,000? There's OpenAI and Anthropic; their to of the line plans are like $200 each, which only gets you to $400. there's a handful of other services, but how do you get to $2,000?


AWS Bedrock of course


Big context window is an amplifier for LLMs. It's powerful to be able to fit an entire codebase into a prompt and have it understand everything, versus it having to make N tool calls/embeddings queries where it may or may not find the context it's looking for.


We are still having to read this again in 2025? Some will never get it.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: