I've already made my arguments against AI on a financial level and on a technical level. Now they've started suggesting that you combine these catastrophes into one big, uber-catastrophe: don't even write prompts any more, just write loops!
The hubris!
AI is not ready. Period.
I'm NOT saying that AI will NEVER be ready. I'm saying that, at the current moment, we're not there. Not even close.
If you thought the crazy AWS bills of the past were something, wait until the first engineering teams start taking this advice. I guarantee you that it will be epic.
A little part of me actually wants companies to just take this advice and let Claude run without any token caps. Why? Not because I want to see those companies fail, but because I want to see how fast Anthropic can destroy itself when these companies suddenly can't afford the bills which they need them to pay to keep the lights on. It would be short term pain for long term gain in the software industry. A mercy killing.
It is only a little part of me though. Because we really shouldn't even be here. And because the company I work for is a likely candidate for taking the bait.
The problem, as I see it, is the same as the financial and technical problems. On the technical side of things, I have no doubts that this approach could bash out some impressive looking initial versions of software at a "reasonable" price increase over a dev assisted prompt driven piece of development. And by a reasonable price increase, I mean that loops are basically just having additional agents automatically looping over responses from one or more other agents performing the actual work until it reaches some objective.
This means 1 or more agent writing the code, another automatically code reviewing it, submitting suggestions to the main working agents to fix/improve the code, some number of agents running builds, some number of agents running tests and then perhaps even some number of agents running deployments.
Now, not all of these need to be the most expensive/powerful agents. But, arguably the main working agents should be. And the review agents should probably be of a similar level. And testing agents should be at least in the same ballpark.
All of this sounds good. Except when you start factoring the costs of even the happy path scenario and then start considering the unhappy path.
My latest Claude code debacle was trying to get it help with formatting a .docx file. Not even a coding task. And arguably not a super complicated task. It was just a quick guide for testers on a very small API. Writing doc isn't my strength and they generally come out looking ugly. So how does this go for me?
- Prompt the tool to review and improve my docx file.
- It runs for about 15s, Claude stops thinking, document is unmodified.
- Prompt it again, it hallucinates a reason for the failure, tries again and fails.
- Prompt it one more, wording things a bit differently.
- Exact same failure.
- Review the output it was getting. Suggest to Claude that it may not have permissions of the required tools installed.
- It notices "oh yeah, I've been trying to use this tool via NPM but it either doesn't appear to be installed or not working".
- Prompt it to get it's shit together and it fixes what it was missing. Leaves the document alone.
- Prompt it one last time to fix the document. Finally runs.
I want to point out that I didn't search anything. I just looked at the the Claude output. Stuff that was already in Claude's context. And suggested what something obvious from the errors.
My issue is NOT that it made a mistake. It is that Claude Opus on medium made a rather basic mistake 3 times in a row how long it would have taken Claude to fix itself is indeterminate. It may have gotten it right the next time around, or it may have made the same mistake infinitely often.
I chose this example for a VERY simple reason. A junior dev making a similar mistake is typically going to fail forward and learn from his mistakes. That junior devs salary is also a known quantity. Whether it takes him 5 tries to get it right or 200, you're not paying him any more. You WILL pay your AI provider for EVERY mistake it makes.
Now, you're going to point out that this is actually an argument in favor of loops. You're going to tell me how a supervising loop will be running a different model or a different session and is less likely to make the same mistake and may have thought to re-prompt faster than I would have. And you're partially right. I *thought* to look at the output on the first failure, but I'm also genuinely interested to see how far along these models have come, so I often let them stumble a bit on purpose.
BUT... you're also making my argument for me. Yes, many times, perhaps even most times, having these additional supervisors running WILL catch and correct these issues (or so I assume). But, they aren't infallible. And the more such loops you run, the more likely you are to finally hit the perfect storm where none of the agents involved at a certain level are making the necessary leap to fix the problem.
How many of these incidents do you think are necessary to break a company? Well, if the usage is uncapped? Then just 1. Remember, these things are just looping indefinitely and working (and thus failing) faster than a human. You will LONG for a stupidly large AWS bill once you see your AI bill.
How about if the token usage is capped? It still could be 1. Let's say you've got a critical make or break deadline. Your loop just ate through all of your tokens. You either need to uncap it and hope it finishes the work or you need to fail to meet this make or break deadline.
I also want to point out that none of these companies are telling you how many tokens you're spending on failed attempts. I'm my experience, almost every prompt has at least one failed attempt. A missed package import, bad namespace, code based on an earlier version of an API, etc... Many times it is more. We tend to ignore it so long as the prompt finishes fast and the token usage isn't egregious.
And that brings up my next suggestion; the next time to prompt an AI agent to write some code for you, if it allows you to view the details of what it is attempting, then take the time to do so. See how many times it fails and try and guesstimate how many tokens it wasted. See how many times it makes variations of the same mistake. Estimate that token usage as well.
I think you'll be surprised at how many times an Agent can do something astronomically stupid in 5-10 minutes.
And this last one happens less, but still a non-zero amount of time; update the agents .MD with some specific advice or instructions and find out how many times it doesn't follow those. Estimate token usage wasted there as well.
Comments
Post a Comment