ClashPilot: can an LLM actually play Clash of Clans?
A GPU-less potato PC, a 40–50 FPS custom frame grabber, a hand-labeled object detector, and Gemini 2.0 Flash deploying troops off a text snapshot of the screen. What worked at Town Hall 6, and why it hallucinated on everything harder.
Every few weeks a new post goes around showing an LLM playing tic-tac-toe, or clearing a Codeforces problem, or grinding LeetCode. Impressive, but those are clean, turn-based, fully-observable little worlds. I wanted to point a model at something messier: a real-time strategy game with a hundred moving pieces, no turn structure, and a board it had never seen. So about a year and a half ago I tried to get a large language model to play Clash of Clans. I called it ClashPilot.
The honest headline up front: it sort of works, and where it breaks is more interesting than where it succeeds. This is a retrospective — most of the code and the trained model went down with an old laptop — but the surviving scripts and a demo video are enough to tell the real story.
1. Turning a live game into something an LLM can read
An LLM can't see. It reads text. So the whole problem is turning a live, animated game screen into a compact description the model can reason over, and turning its answer back into clicks. The loop looks like this:
flowchart TD
S([live game screen]) --> C["custom Win-API capture<br/>~40-50 FPS via BitBlt"]
C --> Y["YOLO detector:<br/>buildings + in-game buttons"]
Y --> ST["game state as plain text<br/>one line per detection"]
ST --> G{{"Gemini 2.0 Flash<br/>plans a Barch attack"}}
G --> P["deploy plan:<br/>a troop + an x,y per drop"]
P --> A["PyAutoGUI:<br/>click army slot, tap the tile"]
A --> S
A YOLO object detector I trained on my own labeled screenshots finds every building and every in-game button on the current frame. Each detection becomes one line of a plain-text game state: (GoldStorage, x=412, y=233), (Cannon, x=…, y=…), …. That string is the entire world, as far as the model is concerned.
Then Gemini 2.0 Flash gets a system prompt that says, roughly: you're playing Clash of Clans, here are the building positions, you have 100 barbarians, 100 archers, 2 rage spells, 1 heal, and the barbarian king — a standard "Barch" attack — return only an array of deploy positions. It answers with a list of [troop, x, y] entries, and the runtime clicks the matching army slot and taps the map. No human in the loop.
2. Making it run on a GPU-less potato
I built this on a laptop with no GPU, so most of the engineering was just fighting for frames. Three things mattered:
- A custom screen grabber. The usual libraries — MSS, PyAutoGUI, D3DShot — gave me 5–6 FPS, which is useless when the detector has to keep up with a live game. I wrote a capture path straight against the Windows API (
win32gui+BitBlt) and got 40–50 FPS. Clash also blocks screen capture of its own window, so the workaround was to grab the top-left region of the whole desktop instead. - Threads. The capture-and-detect loop can't freeze for the ~1–2 seconds it takes to round-trip to Gemini. So the button clicks and the LLM call all run on background threads while the main loop keeps grabbing and annotating frames in real time.
- Detection over template matching. I started with OpenCV template matching for the UI buttons, but it fell apart the moment you zoomed the game in or out — the templates only matched at one scale. Training an object detector on a hand-labeled dataset fixed it. Labeling those screenshots was hours of mind-numbing work, and it's the first thing I'd automate if I did it again.
I also trained a small segmentation model to mark the deployable grass around a base — troops can only land on open ground, never on buildings — so the system had at least some notion of where a legal drop actually was.
3. Where it works, and where it hallucinates
For small bases (around Town Hall 5–6) and simple strategies like Barch or an all-dragon spam, it genuinely plays. Few buildings, few troop types, a short list of positions — the model keeps it straight and executes a real attack.
Complex bases are where it comes apart, and always in the same way: it hallucinates. It deploys troops onto undeployable tiles. It calls for troops it doesn't have, or ones it already spent. The more variables on the board — more buildings, more troop types, more state to track — the more confidently it invents things that aren't there.
The failure was never strategy. It was bookkeeping. The model would plan a perfectly reasonable attack and then reach for a troop that no longer existed.
That points straight at the real gap. The model had no trustworthy feedback loop about its own resources — it was handed a fixed army in a prompt and never actually knew what was left after each drop. A human tracks that effortlessly. A stateless text snapshot doesn't. The strategic reasoning held up fine at small scale; the part that broke was tracking a changing world, which is exactly what a single forward pass over one screenshot is worst at.
What I'd do differently
If I rebuilt it today I'd stop treating every frame as an isolated question. I'd give the model state it can trust — troops remaining, spells left, which drops already happened — and let it act one step at a time with feedback after each move, instead of demanding the whole plan up front from a single screenshot. Half the hallucinations were really just the model guessing at things it was never told.
Even so: a general-purpose LLM, handed a noisy scrape of a game it had never trained on, executing a coherent basic attack with zero fine-tuning — that was a genuinely fun thing to watch work at all. Here's the demo.