Notes on vibe coding

Posted on

For some time, I’d been underlining passages in books on my Kobo. Small observations, ideas I wanted to return to, quotes I planned to do something with. The problem was that those highlights lived in the device. Extracting them meant typing them out by hand (which I never did) or paying for a subscription service that felt like renting access to my own reading. The highlights accumulated, unread, in a drawer I never opened.

So I vibe coded an app to fix it.

Khi is a macOS application that connects to a Kobo e-reader, extracts your highlights and exports them as Markdown files. It’s local-only, free, and does exactly one thing well. I built it over several days using AI as my co-developer: Claude Code for most of the implementation (including an initial interview that produced a detailed SPEC.md), with Kimi 2.5 during some early sprints (since it was free over Opencode for a week!) and Gemini for stretches where API constraints on Opencode were not playing along.

I’m not a programmer. I understand HTML and CSS reasonably well, have some familiarity with Svelte, but I’d never written a line of Rust in my life and had no experience with Tauri, a framework for building lightweight native apps that I chose specifically because it’s less resource-intensive than Electron. The architecture (a Rust backend parsing Kobo’s SQLite database, a Svelte frontend, communication through Tauri’s IPC bridge) wasn’t something I designed. I described the problem and the stack I wanted; the AI proposed the architecture and implemented it.

The code side

The technical side delivered. The Rust code that reads the Kobo database, queries the SQLite tables, extracts highlights with their chapter metadata — none of this was something I could have written or even reviewed at a detailed level. It worked. There were refactors, moments where early architectural decisions needed revisiting. But that’s true of any project, with or without AI. As someone who couldn’t have written a line of this code alone, the moment that landed was simple: the first time I saw my own highlights appear in the interface, pulled from the database, chapter names intact. I hadn’t expected to feel much. I did.

That fluency has a clear explanation. Code has clear evaluation criteria. A function either produces the correct output or it doesn’t. Unit tests pass or fail. Error messages point to line numbers. The model has been trained on enormous amounts of annotated code: billions of files across dozens of languages, bug fixes, code reviews, explanatory pairs. When it writes Rust to parse SQLite, it’s operating in a domain where correctness is verifiable and the feedback loops that shaped its training were precise.

Sometimes, when working on more difficult bugs, I had to remember that fluency and accuracy are different things and that nothing in the model’s training gave it a reliable way to tell them apart. It would loop through approaches, each slightly reframed, each delivered in the same assured tone, apparently unable to flag that it had reached a limit. That recognition was mine to supply, suggesting it look at the official documentation or pointing it toward how a similar project had solved the same problem.

The design wall

The visual implementation was a different story.

I had no experience with AI-assisted UI design, so I decided to try it out: I used Figma Make to produce a simple but effective visual design, then exported everything into a specification document: design tokens, CSS variables, color palettes, typography scales, etc. Component specifications down to padding values, with React reference files showing exactly how each element should look. The spec was precise.

Generating a “good enough” design was rather trivial. Getting the AI to implement it was slower, more iterative, and more frustrating than I’d expected.

The core problem is the translation layer. When you describe a visual error in text, the model has to guess which element, which property, which value.

Tools like Agentation address this differently. Rather than approximating the problem in language, they extract precise CSS selectors and computed styles directly from the interface — exact coordinates instead of a verbal description. The model still can’t see the layout, but it doesn’t need to: “this selector, this computed value” is a different kind of input than “the button looks slightly off.” A more precise dialect.

Unfortunately, Agentation only works with React projects. For a Svelte app, that bridge didn’t exist. Screenshots and approximations were what remained.

Give the model a screenshot and you gain something, but less than you’d expect. The image becomes a set of numerical embeddings carrying a rough sense of composition: light areas, dark areas, blocks, text zones. Not “this button is misaligned by 8 pixels from the grid.”

This asymmetry is structural. For code, the training data contains annotated files, code reviews, bug fix commits, test suites: a rich record of correct and incorrect states with rationale. For visual design, what exists is mostly static HTML and screenshots, with almost no “this layout was broken, here’s how it was improved” pairs.

The context problem

On smaller tasks, context window limitations stay in the background. At the scale this project required, they didn’t.

Context is what the model can “see” at once: the active conversation, the files in scope, the spec you’ve shared. Modern models have large windows, but large isn’t the same as reliable. Researchers have a name for it: context rot. The longer the session runs, the less reliable the model — not in any obvious or traceable way. Some models start making things up with increasing confidence, others stop committing to answers. It just begins to forget, quietly and unevenly.

The developer needs, if there’s no better word, to act as a kind of “RAM manager” for the model. Deciding what stays in context, what needs to be restated, when to start a fresh session. The effective window — the range where performance stays consistent — is typically less than half the declared maximum. The rest is territory where the model still responds, but less reliably than advertised.

This is where context rot becomes most costly: any task that requires holding many constraints simultaneously. Design implementation is the clearest example in this project, but complex debugging sessions told the same story. The difference is feedback. Code tells you when something is wrong — an error, a failing test. Design just drifts.

Looking forward

The constraints I’ve run into here are, I suspect, temporary ones.

I expect most of what I’ve described here to be solved, progressively, by the companies building these models. Context rot, miscalibrated confidence, the failure to step back rather than just iterate — these are documented limitations, not mysteries. I think the design gap will close too, as multimodal training improves and more tools appear to address exactly this problem. None of this is guaranteed to go fast. But the direction is clear, and so is what’s already possible.

For most of recorded history, tools were built by specialists for general audiences. You got the features the developer thought you needed, the workflow that fit most users rather than you. That’s changing. It’s now possible to build the exact tool you need — with only the features that serve your specific use case, at a cost that makes the experiment worth running even if the result is imperfect. Khi is a modest example of that. The principle scales.

Anton Sten frames the deeper point better than I can:

“Building something silly is how you practice that. Not because the tool matters, but because the act of building rewires how you think. You stop being a passive consumer of software and start being someone who shapes their own tools. That’s adaptability in action.”

Khi wasn’t silly, but the logic holds. The real output of a project like this isn’t the app. It’s the rewiring.