Large(r) Scale AI Development

I’ve been doing a lot of coding with AI over and since the holidays. Many have discovered the joy of creating a basic web app in one-shot with modern models. And few-shot attempts are good enough for simple programs with well-defined rules (e.g. Chess). But larger projects are out of reach for methods that rely on the context of a single conversation. I’m pretty late to this game, but there are now a variety of tools and techniques to build larger applications with our currently imperfect models.

I thought I'd share some of my experience and observations working on a game in Godot that is entirely AI coded. Aside: I’m avoiding vibe-coded here because I feel that has a connotation of “prompt and pray,“ where what I’m doing now is decidedly more structured. I recently heard the term, pilot-coding which I think is a great fit though I am also fond of sloperator. 😎

My setup is VSCode with copilot mostly in agent mode. I'm using Claude Opus 4.5 and have concluded unscientifically that it is the best available model for coding; particularly when including a lot of tool use. I have my source in GitHub. I started the project from a multiplayer game sample. I'm using 3 MCP servers actively: One for Godot, one for Github, and a custom one hosted by the game itself. I maintain a list of features in a FEATURES.md with an organized TODO list, a copilot_instructions.md that VSCode agent mode reads automatically, and an Architecture.md file that is referenced by the copilot instructions. I start each conversation with FEATURES.md in context whereupon I can ask it to work on a specific feature, or I can have it select one. I had learned these techniques over time but hadn’t had a chance to put them all together achieving 100+ commits into a project using them before. This workflow is working pretty well for me.

The TLDR includes these major take-aways:

A new developer for every feature - The best practices that enable a project to scale to a human team of varying skill levels and varying levels of familiarity with the codebase are the same things needed to scale AI. Some examples of these best practices include: a well-articulated architecture, coding standards, linting tools, tests, great documentation, and CI workflows. The need for each of these will present itself in nearly every conversation. By starting a new conversation with every feature, you’re basically hiring a new developer for every feature with only these best practices to guide them.
Close the loop - The more you can close the loop for what the AI can do, the more it will be able to iterate to success. A simple example is test-cases. If you don’t explicitly require test cases and get picky about the coverage you will find yourself being the full-time tester where every turn of the crank will result in you performing a full test pass. By contrast, if the AI can run the compiler, run the linter, run the tests, take a screenshot — all of these actions will enable it to iterate when it’s not quite right.
Post-mortem - Everyone has a feature they’re trying to implement that has gone off the rails. In fact, from a brief poll of my peers, this is where the majority of your time pilot-coding will go. So when something goes off the rails, use the opportunity to interrogate the agent about what went wrong, why it occurred, and what can be done to avoid it in future iterations.
Problems that are worth solving - AI doesn’t just make development faster, it reshapes what problems are worth solving. I have recently authored several tools that I wouldn’t have embarked on at all if weren’t for AI speeding up the nitty gritty. Applying a difficult algorithm to a problem? Give it a try! Not sure if the feature you have a vague shape of is the right one? Try implementing it and throw it away if it isn’t. The shift in where time is spent in development changes the way I look at what problems are worth solving.

And my laundry list of other observations:

The paperclip problem - AI agents suffer from a form of the paperclip problem in that they will ruthlessly optimize to complete your query. This will frequently result in cutting corners to finish the work especially if your request wasn't clear on the finer details. I've many times had it say "this would be the right way, but it's too big of a change." It hilariously feels like working with a very junior developer who feels under the gun to complete the work as fast as possible. I find that most of my time goes into coaching against this and trying to invent systems so the wrong thing is less likely to happen.
Not relegated to coding - AI agents are just as happy to do non-coding tasks. From automation to data analysis to repo management to documentation; I have embraced this to create a fully automated system, not just a codebase. These investments help to close the loop as I mentioned earlier.
Keeping a clean house - You have to clean up after it. Especially after big refactors, it's likely to leave dead code everywhere. However, when specifically prompted to address this kind of problem it's really good. I'm on about a weekly cadence to ask it to do a whole codebase audit for dead code. It takes it a long time, but it's almost entirely non-interactive and pretty good at it. But just like when you do this manually, it can sometimes take multiple passes as you remove stuff that makes it clearer that you can remove other stuff.
Where’s the debugger? I find myself wishing my agents could just debug the process. They’re super good at reasoning out timing issues by just reading the code, but it takes a lot of time and sometimes many turns of manual testing. In real-time I'm wondering if anyone has created some kind of tool/MCP server for this?
Going faster - AI is unquestionably speeding me up. Particularly in fast prototypes and those domains that I don't have a lot of knowledge (automation and DSLs for example). It's also introduced me to approaches and algorithms that I was unaware of and enjoyed reading about. However, in dedicating about 3 hours a day for a week, I found that I implement 1-3 forward progress features and 1-3 refactors to keep the codebase clean each day. When you one-shot an entire tool it can be a 10,000% speed improvement, but on this larger project I'm probably getting about 400%, but additionally I've got a lot of architectural patterns codified way more solidly than I would if I was soloing it.
Getting medieval - While there is little more than cathartic value in taking out your frustration on an AI agent, what is valuable is being very direct. I have found it much easier to hold to my principles with an AI agent where my feedback to a human would be softer and more deferential. Related: after the code is working, I am much more likely to give the feedback “works but now redo it the right way” than I would with a human who had toiled on the problem for days. This a good thing when it comes to getting the best results.
The why in software development - Last, but certainly not least, I'm really having fun doing learning and executing on projects with AI! It's delightful to get a great feature, and it's equally delightful to me when a system I put in place makes it more successful.

That’s it for now. Does my experience resonate with you? I’d love to learn from you as well!

Large(r) Scale AI Development

Comments

More from this blog

Inside-out MCP

DyPE Workflow Revisited

High Resolution Image Generation w/ FLUX and DyPE on Windows

Nanochat Training Continues

Command Palette

Comments

More from this blog