Skip to main content

AI Loops?

I've already made my arguments against AI on a financial level and on a technical level. Now they've started suggesting that you combine these catastrophes into one big, uber-catastrophe: don't even write prompts any more, just write loops!

The hubris!

AI is not ready. Period.

I'm NOT saying that AI will NEVER be ready. I'm saying that, at the current moment, we're not there. Not even close.

If you thought the crazy AWS bills of the past were something, wait until the first engineering teams start taking this advice. I guarantee you that it will be epic.

A little part of me actually wants companies to just take this advice and let Claude run without any token caps. Why? Not because I want to see those companies fail, but because I want to see how fast Anthropic can destroy itself when these companies suddenly can't afford the bills which they need them to pay to keep the lights on. It would be short term pain for long term gain in the software industry. A mercy killing.

It is only a little part of me though. Because we really shouldn't even be here. And because the company I work for is a likely candidate for taking the bait.

The problem, as I see it, is the same as the financial and technical problems. On the technical side of things, I have no doubts that this approach could bash out some impressive looking initial versions of software at a "reasonable" price increase over a dev assisted prompt driven piece of development. And by a reasonable price increase, I mean that loops are basically just having additional agents automatically looping over responses from one or more other agents performing the actual work until it reaches some objective.

This means 1 or more agent writing the code, another automatically code reviewing it, submitting suggestions to the main working agents to fix/improve the code, some number of agents running builds, some number of agents running tests and then perhaps even some number of agents running deployments.

Now, not all of these need to be the most expensive/powerful agents. But, arguably the main working agents should be. And the review agents should probably be of a similar level. And testing agents should be at least in the same ballpark.

All of this sounds good. Except when you start factoring the costs of even the happy path scenario and then start considering the unhappy path.

My latest Claude code debacle was trying to get it help with formatting a .docx file. Not even a coding task. And arguably not a super complicated task. It was just a quick guide for testers on a very small API. Writing doc isn't my strength and they generally come out looking ugly. So how does this go for me?

  1. Prompt the tool to review and improve my docx file.
  2. It runs for about 15s, Claude stops thinking, document is unmodified.
  3. Prompt it again, it hallucinates a reason for the failure, tries again and fails.
  4. Prompt it one more, wording things a bit differently.
  5. Exact same failure.
  6. Review the output it was getting. Suggest to Claude that it may not have permissions of the required tools installed. 
  7. It notices "oh yeah, I've been trying to use this tool via NPM but it either doesn't appear to be installed or not working". 
  8. Prompt it to get it's shit together and it fixes what it was missing. Leaves the document alone.
  9. Prompt it one last time to fix the document. Finally runs.
I want to point out that I didn't search anything. I just looked at the the Claude output. Stuff that was already in Claude's context. And suggested what something obvious from the errors.

My issue is NOT that it made a mistake. It is that Claude Opus on medium made a rather basic mistake 3 times in a row how long it would have taken Claude to fix itself is indeterminate. It may have gotten it right the next time around, or it may have made the same mistake infinitely often.

I chose this example for a VERY simple reason. A junior dev making a similar mistake is typically going to fail forward and learn from his mistakes. That junior devs salary is also a known quantity. Whether it takes him 5 tries to get it right or 200, you're not paying him any more. You WILL pay your AI provider for EVERY mistake it makes.

Now, you're going to point out that this is actually an argument in favor of loops. You're going to tell me how a supervising loop will be running a different model or a different session and is less likely to make the same mistake and may have thought to re-prompt faster than I would have. And you're partially right. I *thought* to look at the output on the first failure, but I'm also genuinely interested to see how far along these models have come, so I often let them stumble a bit on purpose.

BUT... you're also making my argument for me. Yes, many times, perhaps even most times, having these additional supervisors running WILL catch and correct these issues (or so I assume). But, they aren't infallible. And the more such loops you run, the more likely you are to finally hit the perfect storm where none of the agents involved at a certain level are making the necessary leap to fix the problem.

How many of these incidents do you think are necessary to break a company? Well, if the usage is uncapped? Then just 1. Remember, these things are just looping indefinitely and working (and thus failing) faster than a human. You will LONG for a stupidly large AWS bill once you see your AI bill. 

How about if the token usage is capped? It still could be 1. Let's say you've got a critical make or break deadline. Your loop just ate through all of your tokens. You either need to uncap it and hope it finishes the work or you need to fail to meet this make or break deadline.

I also want to point out that none of these companies are telling you how many tokens you're spending on failed attempts. I'm my experience, almost every prompt has at least one failed attempt. A missed package import, bad namespace, code based on an earlier version of an API, etc... Many times it is more. We tend to ignore it so long as the prompt finishes fast and the token usage isn't egregious.

And that brings up my next suggestion; the next time to prompt an AI agent to write some code for you, if it allows you to view the details of what it is attempting, then take the time to do so. See how many times it fails and try and guesstimate how many tokens it wasted. See how many times it makes variations of the same mistake. Estimate that token usage as well.

I think you'll be surprised at how many times an Agent can do something astronomically stupid in 5-10 minutes. 

And this last one happens less, but still a non-zero amount of time; update the agents .MD with some specific advice or instructions and find out how many times it doesn't follow those. Estimate token usage wasted there as well.

Comments

Popular posts from this blog

Kyudo Blog #2 - Shinsa Prep

So, we're well past my first year. I decided to take the video shinsa for shodan at the end of the year. We also, have a new cohort joining the dojo. Things are getting interesting. This journey has not been at all what I expected. First, I had expected Kyudo to be both more and less introspective. Our dojo focuses more on Taihai, or the ceremonial/performance aspect of the art. And that comes with a lot of concern for harmony and beauty as a group. Making movements up to shai (firing line) in unison as much as possible.  At the same time the focus on the self is infinitely more than I could have anticipated as well. Posture, breathing, form and movements. Everything seems to have a right way of doing it. And none of those "right ways" come naturally. Beyond that there is also etiquette. Even things you might not associate with the practice itself are a part it. And all of these things then also need to be in harmony with everything else. Needless to say, I have a long wa...

Kyudo Blog #3 - The basics

Right now my biggest (though certainly not only) issues are the fundamentals. The basics. Posture, speed and form. I know form is a very broad term, but I'm talking about specific forms like my ENSOU, or even TENOUCHI. On the posture front, I'm a software guy. I sit at a desk all day and have developed a bad posture for years. This isn't impossible to overcome. But it is far from trivial. And breaking out of a good posture mid performance can make an otherwise acceptable showing into a bad one. Correcting my posture, similarly, draws attention to its absence prior. On the speed front, this isn't technically a fundamental, but it is an area of focus at the moment. I'm doing video shinsa this year. And apparently I need from my YU at HONZA until my HANARE on my OTOYA to be about 3 minutes. Right now... closer to 4. My pauses between movements tend to be too long. I like to make my pauses pronounced to punctuate the division between one movement and the next. And this ...

Kyudo - 1st weeks with my first bow

This will be more of a normal journal entry. I think it is worth stating these things, but it doesn't feel very formal or informational. Here are some things I've encountered so far: Humility Not knowing all of the terms for the associated equipment Equipment mistakes (poor tsuruwa construction) Concerns Storage and blind-spots in care/maintenance  Excitement New "toy" Learning new things and overcoming some of the above Knowing I still have even more to learn So, yeah "humility".  I will admit now that I was not properly prepared for this moment. I think that, for next year, I will recommend more equipment maintenance lectures. There is definitely more for me to learn with regards to how to maintain my equipment and even what all of the individual items are. The first mistake was confusing the uchifukuro with a yumibukuro. To be fair, we only use the uchifukuro in the dojo and they both have the same basic form. In fact, both even have VERY similar dimensio...