A few days back, I was religiously watching the latest video from Andrej Karpathy where he introduced Vibe Coding.
I have been using AI code assistants for a while but am still highly involved in the coding process. AI is mostly acting as a smart auto-completion, taking over tedious boilerplate stuff, writing docstrings, or occasionally explaining the code that I wrote yesterday and conveniently forgotten. Vibe coding sounded more like graduating from autocomplete to actual co-creation – a shift from AI as an assistant to AI as an actual coding partner.
Intrigued by this potential paradigm shift, I decided to experience it firsthand. In the same video, Andrej shared that he uses voice as primary input roughly 50% of the time as it's more intuitive and efficient. He suggested some options for Mac which act as universal speech-to-text tools that work across multiple apps. I primarily work on Windows for interacting with apps and Linux over the command line. As I couldn't find any good options for Windows (other than built-in Voice Access which is super slow), I decided to build one for myself. I had a few other potential ideas but settled for this as I wanted to build
Goal – Build a minimal, private, universal speech-to-text desktop application
I call it Vaani (वाणी), meaning "speech" or "voice" in Sanskrit.
This article chronicles the journey of building Vaani. It's a practical exploration of what vibe coding actually feels like – the exhilarating speed, the unexpected roadblocks, the moments of genuine insight, and the lessons learned when collaborating intensely with an AI. I decided to use Claude Sonnet 3.7, the best (again, based on general vibe) available coding assistant at that time. In the meantime, Google Gemini 2.5 Pro was released and I decided to use it as a code reviewer.
BTW, this article is largely dictated using Vaani 😊.
Let's vibe.
AI Developer: Claude Sonnet 3.7
AI Code Reviewer: Gemini 2.5 Pro Preview 03-25
The AI Developer and AI Code Reviewer always had complete code as context for each prompt. I initiated a fresh conversation once a certain goal is achieved e.g. a bug is fixed or a feature is implemented and working successfully. I did this to manage the context window and ensure the best AI performance. I did not use any agentic IDEs (e.g. Cursor, Windsurf, etc.) and instead relied on Claude Desktop and Google AI Studio to keep it pure. I also avoided any manual code changes with the intent to release the code open source for community scrutiny.
So, where do we begin? Traditionally, this involves meticulous planning. For example, outlining components, designing interfaces, choosing libraries, and setting up the project structure. Instead, I decided to start with a lazy prompt as shown below.
I want to build a lightweight speech to text app in Python for Windows users. The idea is to help Windows users write things quickly using voice in any application e.g. word, powerpoint, browser etc. The app should work locally without the internet for privacy. Should activate using hot key or hot word.
And true to its reputation, Claude Sonnet 3.7 went completely berserk. It generated a comprehensive application structure almost instantly, complete with:
The initial phase perfectly captured the allure of vibe coding: bypassing hours of design and foundational coding, moving directly from the idea to a tangible (albeit buggy) application skeleton. The feeling was one of incredible acceleration and possibilities.
With the basic components in place, the real development began by settling into a distinct rhythm – the core loop of AI-assisted vibe coding:
This loop was incredibly fast, but also heavily reactive. We weren't following a grand design; we were navigating by sight, fixing problems only after they surfaced. Key challenges emerged rapidly:
This phase highlighted the raw power of AI for iteration but also the potential chaos of debugging code generated by another entity, relying on the AI to fix its own mistakes based on your observations.
While Claude demonstrated impressive coding capabilities, getting most things right in the first go, it wasn't infallible. Particularly regarding architectural choices, several instances forced us to fundamentally rethink Claude's suggestions. Claude seemed eager to over-engineer the solutions making it extremely complex in an attempt to make it generic. For example, when tackling the fragmented text output, Claude proposed a sophisticated data buffering class. It worked but felt overly complex. Once I questioned the need for this complexity, it conceded and we pivoted to much simpler direct implementation by detecting natural pauses. This is when I realized that developer intuition about simplicity and pragmatism is a valuable counterbalance to potential AI over-enthusiasm.
Another instance was when we implemented audio calibration and persisted (after Gemini rightly pointed out the efficiency issue) it in settings. Later, a practical thought emerged: "Won't this calibration be specific to the microphone used?". This real-world usage scenario revealed a bug missed during generation. Claude first suggested storing calibration settings per device, but obliged with a simpler solution: just recalibrate if the input device changes. Again, considering the practical usage context (most users are unlikely to switch input devices frequently), choosing to persist audio calibration for only 1 device made sense.
These moments underscore that effective vibe coding isn't passive acceptance; it's an active dialogue where the developer guides, questions, and sometimes corrects the AI's trajectory.
Reflecting on the Vaani development journey, it exhibited the core characteristics often associated with vibe coding:
This aligns well with the current discussion defining vibe coding by its speed, reliance on natural language, and sometimes a lesser degree of developer scrutiny. However, my experience suggests it exists on the spectrum. While Vaani started near the "pure vibe" end, the project naturally shifted toward more structure (requesting modularization, and code reviews) as it matured and approached release.
Working so closely with the AI on a complete project yielded some insight that goes beyond the typical "AI is fast but makes mistakes" narrative.
Vibe coding is undoubtedly a powerful tool, but it requires skill to wield effectively. If you are an experienced developer and considering this approach, here are my generic recommendations based on my experience of building Vaani.
My journey building Vaani confirmed that AI-assisted "vibe coding" is more than just hype. It fundamentally changes the development workflow, offering unprecedented speed in translating ideas into functional code. It allowed me, a single developer, to build a reasonably complex application in a fraction of the time (~ 15 hours) it might have traditionally taken.
However, it's not a magic wand. It's a collaboration requiring communication, critical thinking, and oversight. The AI acts like an incredibly fast, knowledgeable, but sometimes over-enthusiastic assistant. It can generate intricate logic in seconds but might miss the simplest solution or overlook real-world constraints or practicability. It can fix bugs instantly but might struggle with the nuances.
The real power emerges when the developer actively engages – guiding the AI, questioning its assumptions, validating its output, and applying fundamental software engineering principles. Vibe coding doesn't replace developer skills; it shifts it towards architecture, validation, effective prompting, and critical integration. It's an exciting, powerful, and sometimes challenging new way to build, offering a glimpse into a future where human creativity and artificial intelligence work hand-in-hand, guided by sound engineering judgment.
Leave a Comment