ClashPilot: can an LLM actually play Clash of Clans?

A GPU-less potato PC, a 40–50 FPS custom frame grabber, a hand-labeled object detector, and Gemini 2.0 Flash deploying troops off a text snapshot of the screen. What worked at Town Hall 6, and why it hallucinated on everything harder.

Every few weeks a new post goes around showing an LLM playing tic-tac-toe, or clearing a Codeforces problem, or grinding LeetCode. Impressive, but those are clean, turn-based, fully-observable little worlds. I wanted to point a model at something messier: a real-time strategy game with a hundred moving pieces, no turn structure, and a board it had never seen. So about a year and a half ago I tried to get a large language model to play Clash of Clans. I called it ClashPilot.

The honest headline up front: it sort of works, and where it breaks is more interesting than where it succeeds. This is a retrospective — most of the code and the trained model went down with an old laptop — but the surviving scripts and a demo video are enough to tell the real story.

1. Turning a live game into something an LLM can read

An LLM can't see. It reads text. So the whole problem is turning a live, animated game screen into a compact description the model can reason over, and turning its answer back into clicks. The loop looks like this:

flowchart TD
    S([live game screen]) --> C["custom Win-API capture<br/>~40-50 FPS via BitBlt"]
    C --> Y["YOLO detector:<br/>buildings + in-game buttons"]
    Y --> ST["game state as plain text<br/>one line per detection"]
    ST --> G{{"Gemini 2.0 Flash<br/>plans a Barch attack"}}
    G --> P["deploy plan:<br/>a troop + an x,y per drop"]
    P --> A["PyAutoGUI:<br/>click army slot, tap the tile"]
    A --> S

A YOLO object detector I trained on my own labeled screenshots finds every building and every in-game button on the current frame. Each detection becomes one line of a plain-text game state: (GoldStorage, x=412, y=233), (Cannon, x=…, y=…), …. That string is the entire world, as far as the model is concerned.

Then Gemini 2.0 Flash gets a system prompt that says, roughly: you're playing Clash of Clans, here are the building positions, you have 100 barbarians, 100 archers, 2 rage spells, 1 heal, and the barbarian king — a standard "Barch" attack — return only an array of deploy positions. It answers with a list of [troop, x, y] entries, and the runtime clicks the matching army slot and taps the map. No human in the loop.

2. Making it run on a GPU-less potato

I built this on a laptop with no GPU, so most of the engineering was just fighting for frames. Three things mattered:

A custom screen grabber. The usual libraries — MSS, PyAutoGUI, D3DShot — gave me 5–6 FPS, which is useless when the detector has to keep up with a live game. I wrote a capture path straight against the Windows API (win32gui + BitBlt) and got 40–50 FPS. Clash also blocks screen capture of its own window, so the workaround was to grab the top-left region of the whole desktop instead.
Threads. The capture-and-detect loop can't freeze for the ~1–2 seconds it takes to round-trip to Gemini. So the button clicks and the LLM call all run on background threads while the main loop keeps grabbing and annotating frames in real time.
Detection over template matching. I started with OpenCV template matching for the UI buttons, but it fell apart the moment you zoomed the game in or out — the templates only matched at one scale. Training an object detector on a hand-labeled dataset fixed it. Labeling those screenshots was hours of mind-numbing work, and it's the first thing I'd automate if I did it again.

I also trained a small segmentation model to mark the deployable grass around a base — troops can only land on open ground, never on buildings — so the system had at least some notion of where a legal drop actually was.

3. Where it works, and where it hallucinates

For small bases (around Town Hall 5–6) and simple strategies like Barch or an all-dragon spam, it genuinely plays. Few buildings, few troop types, a short list of positions — the model keeps it straight and executes a real attack.

Complex bases are where it comes apart, and always in the same way: it hallucinates. It deploys troops onto undeployable tiles. It calls for troops it doesn't have, or ones it already spent. The more variables on the board — more buildings, more troop types, more state to track — the more confidently it invents things that aren't there.

The failure was never strategy. It was bookkeeping. The model would plan a perfectly reasonable attack and then reach for a troop that no longer existed.

That points straight at the real gap. The model had no trustworthy feedback loop about its own resources — it was handed a fixed army in a prompt and never actually knew what was left after each drop. A human tracks that effortlessly. A stateless text snapshot doesn't. The strategic reasoning held up fine at small scale; the part that broke was tracking a changing world, which is exactly what a single forward pass over one screenshot is worst at.

What I'd do differently

If I rebuilt it today I'd stop treating every frame as an isolated question. I'd give the model state it can trust — troops remaining, spells left, which drops already happened — and let it act one step at a time with feedback after each move, instead of demanding the whole plan up front from a single screenshot. Half the hallucinations were really just the model guessing at things it was never told.

Even so: a general-purpose LLM, handed a noisy scrape of a game it had never trained on, executing a coherent basic attack with zero fine-tuning — that was a genuinely fun thing to watch work at all. Here's the demo.