Yes, even the ones who find vibe-coding stupid and complain about it. But everyone has the right to an opinion—free country.


Why this article exists

This is not a benchmark.

This is not a ranking.

This is not another “X vs Y” post pretending to be objective.

This is a retrospective on what I learned after months of using, breaking, over-trusting, and disbelieving AI tools built for developers.

Evaluating AI dev tools is harder than it looks, not because the space is complex, but because:

  1. There are many evaluation criteria.
  2. Most developers don’t know which ones matter.
  3. Marketing language hides architectural differences.
  4. Early positive experiences create long-term bias.
  5. Nobody cares?

As a result, most developers are not choosing tools —they just stick with whichever one impressed them on a problem they found hard to solve.

They find the first tool that “kind of works”, build habits around it, and never seriously question that choice again.

And still, that is the best‑case scenario. The worse one is mindlessly using whatever tool your boss is paying for because his 14‑year‑old nephew mentioned it over dinner last weekend.

This article is about breaking that pattern.

Did You Know Most Developers Are Vibe-Choosing Their AI Tools?

Let’s clear something up about price

Price is not irrelevant.

But price obsession is usually a symptom of shallow understanding.

The relevant unit is not:

  • dollars per month
  • dollars per token
  • subscription tier

The relevant unit is:

cost per correct outcome

Bad tools are expensive.

Bad agent behavior is expensive.

Bad orchestration is very expensive.

A tool that:

  • re-reads entire files unnecessarily
  • fails and retries blindly
  • hallucinates confidently
  • forces you to babysit every step

…will cost you more than a “premium” tool that quietly does the right thing.

Price matters only once you understand what you’re paying for.

Until then, it’s noise.


The real problem: developers are vibe-choosing tools

Developers are vibe-choosing tools

Most developers evaluate AI tools like this:

“This one feels smart.”
“This one completed my task once.”
“This one matches my editor.”

That’s not evaluation. That’s pattern recognition mixed with confirmation bias.

The problem is not that developers are lazy.

The problem is that the important criteria are invisible unless you know what to look for.

So let’s talk about those criteria.


1. Codebase "indexing", actually embedding (vectorization) — the silent multiplier

This is the most misunderstood concept in the entire space.

Indexing does not mean “can read files”.

Indexing means:

  • Your codebase is embedded into a vector space
  • The tool can retrieve relevant fragments on demand
  • Context is persistent, not re-sent every time

Why this matters:

  • The model does not need the whole repo
  • It reads only what is semantically relevant
  • It can jump across files intelligently
  • It scales with repo size

Without indexing:

  • Tools rely on brute-force context stuffing
  • Long context windows become a crutch
  • Costs explode
  • Reasoning degrades as noise increases

Important reality check:

A large context window is a poor substitute

If a tool doesn’t index your codebase, it is guessing more than you think — even if it sounds confident.

Indexing is not sufficient on its own — retrieval quality and ranking matter — but without indexing, scalable reasoning is structurally impossible.

Deep dive in the bonus section below.


2. Context window length — powerful, overrated, often misused

Context window length — powerful, overrated, often misused

Context length is the most marketed number because it’s easy to sell.

But context length only matters in two cases:

  1. No indexing exists
  2. The task is genuinely global (rare)

Most tools misuse context windows by:

  • dumping entire files
  • re-sending unchanged code
  • trusting the model to “figure it out”

This causes:

  • latency spikes
  • hallucinations
  • degraded precision

The real distinction is not large vs small context.

It’s:

  • static context (dumped blindly)
  • dynamic context (retrieved when needed)

Most tools don’t tell you which one they use.

That should already worry you.

Further reading on context usage: GitHub Copilot and Augment Code / GitHub Copilot / Cursor


3. Tooling access: search, MCP, filesystem, shell

This is where assistants turn into agents.

Capabilities that matter:

  • Can search the web
  • Can decide when to search
  • Can verify claims
  • Can refuse to guess
  • Can read and write files selectively

Many tools technically support web search or MCP.

Few tools:

  • choose to use them intelligently
  • stop when information is insufficient
  • escalate instead of hallucinating

A tool that always answers is not helpful.

A tool that knows when to stop is.


4. Agent behavior under uncertainty (the real differentiator)

Agent behavior under uncertainty

This is the most under-discussed axis — and the most important.

When a tool is unsure, does it:

  • ask clarifying questions?
  • dig deeper?
  • explore alternatives?
  • backtrack when wrong?

Or does it:

  • confidently ship nonsense
  • “kind of” complete the task
  • leave you with a 96% solution

This has little to do with raw model intelligence.

A weaker model with better agent policy often outperforms a stronger model with naive orchestration.

If a tool gives up silently, it will betray you at scale.


5. Write performance & file discipline (where trust is won or lost)

Almost nobody talks about this, but developers feel it immediately.

Key behaviors:

  • Can it edit multiple files atomically?
  • Does it read only what it needs?
  • Does it respect existing structure?
  • Does it avoid reformatting everything?
  • Does it minimize diffs?

This is not a model problem.

This is a tooling and orchestration problem.

And it’s often the difference between:

  • “I trust this thing”
  • “I’ll double-check everything forever”

6. Feedback loops & learning latency (trust is built here)

Trust and feedback loop, predictability

Trust is not built when a tool succeeds once.

Trust is built when mistakes do not repeat.

A good coding tool has short, visible feedback loops:

  • When you correct it, the correction sticks
  • When you constrain it, the constraint is respected
  • When it fails, it fails differently next time

Most tools treat every prompt as a stateless event.

That’s fine for chat.

It’s unacceptable for coding.

This is not model training — it’s session-level memory and constraint reinforcement.

Session-level intelligence matters more than raw model strength.

If a tool:

  • keeps making the same class of mistake
  • ignores corrections made minutes ago
  • requires you to restate invariants constantly

…then it is not learning. It is pattern-matching.

And pattern-matching tools scale frustration, not productivity.

A good tool reduces repeated explanation, not just keystrokes.


7. Cognitive load reduction (the real KPI nobody measures)

The true value of a coding tool is not how much code it writes.

It’s how much mental energy it saves.

A good tool:

  • reduces decision fatigue
  • minimizes defensive prompting
  • avoids unnecessary confirmations
  • lets you stay in “problem space” instead of “tool management space”

Bad tools force you to:

  • constantly validate outputs
  • re-explain context
  • guard against obvious mistakes
  • babysit file changes

This is why two tools with similar capabilities can feel radically different.

One makes you calmer.

The other makes you alert.

That difference compounds over weeks.

If a tool increases your cognitive load, it is not helping — no matter how impressive it looks in demos.


Why people stick with the wrong tools

Once a developer has:

  • learned a tool’s quirks
  • adapted their workflow
  • internalized its limitations

Switching feels risky.

So they rationalize:

  • “All tools are the same”
  • “This one is good enough”
  • “The others are hype”

In reality, they are optimizing comfort, not capability.

The worse part is that, there is a high chance you are missing out on good opportunities. Like when you missed Rovo free 20 million token daily (including Claude Sonnet 4.5), or like the fact that you can currently as of January 2025 benefit from Opus 4.5 for free in Google new IDE, Antigravity.


About tool comparisons (and why they age badly)

Comparisons rot fast in this space.

Features change.

Indexing appears or disappears.

Pricing shifts.

Agent behavior evolves.

If you publish comparisons:

  • Be explicit about the date
  • Separate observed behavior from documented claims
  • Admit uncertainty

“I don’t know” is more credible than false precision.


So what actually makes a coding tool “good”?

Here is the definition this article implicitly builds toward:

A good coding tool is one that reliably produces correct, minimal, and explainable changes to a codebase, with predictable behavior, low cognitive overhead, and failure modes that are visible and correctable — at the scale and constraints you operate in.

Not:

  • the smartest model
  • the longest context window
  • the most features
  • the best marketing

But:

  • the most trustable system over time.

The uncomfortable conclusion

Most developers are not under-equipped.

They are under-informed.

AI dev tools didn’t become confusing because they’re complex.

They became confusing because marketing vocabulary replaced architectural explanations.

If you don’t know:

  • whether your tool indexes your code
  • how it builds context
  • how it behaves under uncertainty
  • why it succeeds when it does

Then you are not evaluating a tool.

You are trusting a vibe.

And vibes don’t scale.

TLDR: Use Claude Code and you will not need to know about anything above, true as of January 2025-


Bonus

Bonus 1: Deeper dive into indexing (or more precisely, embedding—the process of converting code into vectors)

Code embedding

For indexing, language barely matters.

Take a sentence:

“The authentication system is broken.”

Now mix languages word by word:

“The authentification System ist cassé.”

It looks wrong to a human.

To a vector model, it’s the same meaning.

That’s because embeddings don’t encode language or grammar — they encode meaning as geometry.

Equivalent words across languages land in the same region of vector space.

Just so you know, embedding code is more complex than embedding natural language, but there are robust solutions for it.

Why this is fast (and why context stays small)

With vector indexing:

  • the codebase is embedded once
  • retrieval is just fast math (vector similarity)
  • only relevant fragments are loaded into context

Instead of brute-forcing thousands of tokens, the model works with tiny, precise context.

Less context → less noise → faster responses → more stable behavior.

Why tools like Cursor feel “in-style” without rules

This explains a common experience with tools like Cursor:

Even without explicit style rules, the generated code often:

  • matches existing patterns
  • respects naming conventions
  • feels consistent with the project

Not because the model “guessed well” — but because it’s likely retrieving nearby code patterns from the indexed codebase and continuing them.

It’s not following rules.

It’s following geometry.

That’s why it feels like “coding inside the project”, not next to it.

Additionally, Cursor stores only vector representations of your code in the cloud, so it never needs direct access to the raw source itself, which is an encouraging signal for privacy.

The part most developers miss

Developers attribute this to:

  • vibes
  • UX
  • model choice

In reality, it’s indexing + math quietly doing the heavy lifting.

Most people never notice.

So they vibe-choose — and argue about models instead.

Funny fact: writing prompt in french consumes ~1.5 to 2 times more token than english, even with the same char count. Try it here.


Bonus 2: The Promise of MoE (Mixture-of-Experts)

Not directly, but related. There's an elegant theoretical solution: Mixture of Experts (MoE). The system would automatically route tasks to the best-suited models without manual intervention.

In theory, MoE addresses the routing challenge through:

  • Automatic assessment: The system evaluates task requirements and matches them to available models
  • Dynamic distribution: Work is automatically allocated to specialized agents based on their strengths
  • Zero manual model selection: Developers don't need to think about which model to use—the system handles it

Research on LLM routing strategies and academic surveys demonstrate the theoretical foundations.

The Current Reality: Still Abstract

These solutions remain largely theoretical. There are no production-ready implementations developers can use today.

The gap between theory and practice is substantial:

  • Research focuses on specific scenarios, not general-purpose systems
  • Implementation details require significant engineering effort
  • Integration with existing development tools is minimal

We have academic papers and proof-of-concepts. We don't have production-grade libraries or battle-tested systems.

Why This Matters

Despite knowing how to solve the routing problem theoretically, developers are still stuck with manual complexity.

Until MoE implementations arrive:

  • You still need to manually select models
  • You still need to understand indexing, context windows, and agent behavior
  • You still need to make architectural decisions based on incomplete information

The promise is real. The theory is sound. But the implementation is missing.

PS: Kimi K2 uses MoE to be wide without being slow — high capacity, sparse compute — but MoE alone doesn’t replace good tooling, indexing, or agent design.

Best bonus

Did you notice a model disobedience was highly correlated to how "smart" it is?

Yes, the sad truth is that it tries to compensate for your own imperfections by assuming what you truly meant — and that assumption inevitably affects the dev experience as well.

Not sponsored but I personally want to say that with Claude Code, the assumption looks fine to me since it feels like:

  • "If you do not one shot this to the level of expectation I would have if you could read my mind, you fail." or
  • "I just want you to do perfectly what I asked in one shot, with no irrelevant question and every relevant checks that will avoid making me waste my time."