Why Ruby Is the Better Language for LLM-Powered Development

When an LLM writes your code, different things matter than when a human does. You stop caring about IDE autocomplete quality and start caring about token count. You stop debating type systems and start measuring first-pass correctness. You stop choosing languages for their ecosystem and start choosing them for how efficiently they compress into — and out of — a context window.

We have been building with LLMs writing significant portions of our production code for the past year. We tried TypeScript, Python, and Ruby. Ruby won, and it was not close.

The data

We wrote the same four features — a CRUD controller, a data transformation pipeline, a test suite, and a background job — in Ruby, TypeScript, and Python. Identical functionality. Then we tokenized all twelve files with both OpenAI's tokenizer (used by GPT-4o, GPT-4.1, o3, and o4-mini) and Anthropic's tokenizer (used by Claude Sonnet 4 and Opus 4).

OpenAI tokenizer (o200k_base)

Example	Ruby	TypeScript	Python	Ruby vs TS	Ruby vs Py
CRUD controller	230	361	406	-36%	-43%
Data pipeline	174	371	203	-53%	-14%
Test suite	338	458	377	-26%	-10%
Background job	165	368	270	-55%	-39%
Total	907	1,558	1,256	-42%	-28%

Claude tokenizer (Sonnet 4 / Opus 4)

Example	Ruby	TypeScript	Python	Ruby vs TS	Ruby vs Py
CRUD controller	269	460	508	-42%	-47%
Data pipeline	220	473	248	-53%	-11%
Test suite	390	586	482	-33%	-19%
Background job	218	481	373	-55%	-42%
Total	1,097	2,000	1,611	-45%	-32%

Ruby saves 42% of tokens on OpenAI and 45% on Claude compared to TypeScript. Against Python, it saves 28% and 32% respectively. The savings are actually larger on Claude's tokenizer because Claude uses more tokens for code overall, which makes Ruby's structural conciseness matter more.

Where the tokens go

The savings are not a tokenizer quirk. They are structural. Look at the data pipeline example — the most dramatic difference at 53% fewer tokens across both tokenizers.

Ruby:

class OrderReport
  def initialize(orders)
    @orders = orders
  end

  def generate
    @orders
      .select { |o| o.status == "completed" && o.total > 0 }
      .group_by { |o| o.created_at.strftime("%Y-%m") }
      .transform_values do |monthly_orders|
        {
          count: monthly_orders.size,
          revenue: monthly_orders.sum(&:total),
          average: monthly_orders.sum(&:total) / monthly_orders.size.to_f,
          top_product: monthly_orders
            .flat_map(&:line_items)
            .tally_by(&:product_name)
            .max_by { |_, count| count }
            &.first
        }
      end
      .sort_by { |month, _| month }
      .to_h
  end
end

174 tokens on GPT-4o. 220 on Claude. Now the TypeScript equivalent:

import { Order, LineItem } from "./types";

interface MonthlyReport {
  count: number;
  revenue: number;
  average: number;
  topProduct: string | undefined;
}

export function generateOrderReport(
  orders: Order[]
): Record<string, MonthlyReport> {
  const completed = orders.filter(
    (o) => o.status === "completed" && o.total > 0
  );

  const grouped = new Map<string, Order[]>();
  for (const order of completed) {
    const month = order.createdAt.toISOString().slice(0, 7);
    const existing = grouped.get(month) ?? [];
    existing.push(order);
    grouped.set(month, existing);
  }

  const result: Record<string, MonthlyReport> = {};
  const sortedKeys = [...grouped.keys()].sort();

  for (const month of sortedKeys) {
    const monthlyOrders = grouped.get(month)!;
    const revenue = monthlyOrders.reduce((sum, o) => sum + o.total, 0);

    const productCounts = new Map<string, number>();
    for (const order of monthlyOrders) {
      for (const item of order.lineItems) {
        productCounts.set(
          item.productName,
          (productCounts.get(item.productName) ?? 0) + 1
        );
      }
    }

    let topProduct: string | undefined;
    let maxCount = 0;
    for (const [name, count] of productCounts) {
      if (count > maxCount) {
        maxCount = count;
        topProduct = name;
      }
    }

    result[month] = {
      count: monthlyOrders.length,
      revenue,
      average: revenue / monthlyOrders.length,
      topProduct,
    };
  }

  return result;
}

371 tokens on GPT-4o. 473 on Claude. Same output. The TypeScript version needs import statements, an interface definition, explicit type annotations on every variable, manual Map construction instead of group_by, manual iteration instead of transform_values, and a manual max-finding loop instead of max_by.

None of that is bad TypeScript. It is idiomatic, well-typed, clean code. It just takes twice as many tokens to say the same thing.

Why fewer tokens matters more than you think

The obvious argument is cost. At GPT-4o rates ($2.50/1M input, $10/1M output), a team of five developers doing 20 LLM sessions per day saves roughly $40/month by using Ruby instead of TypeScript. That is not life-changing.

The real argument is context window budget. Every LLM has a fixed context window — 128K tokens for GPT-4o, 200K for Claude. When you send code to an LLM, you are spending from that budget. When the LLM sends code back, it spends more. When you include project context, documentation, or examples, that spends even more.

A 42% reduction in token usage per code block means 42% more room for everything else. More files in context. Longer conversations before the model starts forgetting. More examples in your prompts. That is the difference between an LLM that understands your codebase and one that lost track three messages ago.

And there is a compounding effect: when the LLM generates Ruby, the output is also shorter. The model finishes faster, costs less, and leaves more room in the context for the next round-trip. Over a multi-turn coding session, the savings stack.

RSpec: the best testing story for LLM output

When an LLM generates code, you need to verify it works. This is where Ruby's testing ecosystem — specifically RSpec — becomes a genuine competitive advantage.

RSpec.describe LlmCodeGenerator do
  subject(:generator) { described_class.new(model: "gpt-4o") }

  describe "#generate_migration" do
    let(:prompt) { "Create a posts table with title, body, and published_at" }
    let(:result) { generator.generate_migration(prompt) }

    it "produces valid Ruby syntax" do
      expect { RubyVM::InstructionSequence.compile(result.code) }.not_to raise_error
    end

    it "includes all requested columns" do
      expect(result.code).to include("t.string :title")
        .and include("t.text :body")
        .and include("t.datetime :published_at")
    end

    context "when the model hallucinates a gem dependency" do
      before { allow(generator).to receive(:raw_response).and_return(hallucinated_response) }

      it "strips require statements for unknown gems" do
        expect(result.code).not_to match(/^require/)
      end
    end

    context "with token budget tracking" do
      it "stays within budget" do
        expect(result.tokens_used).to be < 500
      end

      it "costs less than a penny" do
        expect(result.cost_usd).to be < 0.01
      end
    end
  end
end

338 tokens. Compare to the Vitest equivalent at 458 tokens — 26% more for the same coverage. But the token count is secondary to what makes RSpec actually better for this use case:

Composable matchers. expect(result.code).to include("t.string :title").and include("t.text :body") chains multiple assertions in a single readable line. In Vitest, that is three separate expect().toContain() calls, each awaiting the same async result.

Context blocks as specifications. RSpec's context "when the model hallucinates" reads like a specification document. When you are defining expected behavior for non-deterministic LLM output, this structure matters — you are writing a contract, not a test.

Lazy evaluation with let. let(:result) { generator.generate_migration(prompt) } is evaluated once per test, on first access. No beforeEach boilerplate, no repeated async calls, no stale variable bugs.

Built-in syntax validation. RubyVM::InstructionSequence.compile(result.code) checks that generated Ruby code is syntactically valid without executing it. TypeScript requires spawning a tsc process or using new Function() which only validates JavaScript, not TypeScript.

Ruby syntax is closer to natural language

LLMs are language models. They were pretrained on trillions of tokens of natural language before they ever saw code. Ruby's syntax is closer to English than any mainstream language:

5.times do
  order.fulfill! if order.paid?
end

users.select(&:active?).sort_by(&:created_at).first(10)

retry_on Net::TimeoutError, wait: :polynomially_longer, attempts: 5

Every line reads like a sentence. retry_on TimeoutError, wait polynomially_longer, attempts 5 — that is almost English. Compare to the TypeScript equivalent: backoff: { type: "exponential", delay: 1000 } inside a configuration object three levels deep.

This is not an aesthetic preference. It means LLMs produce correct Ruby on the first attempt more often than they produce correct TypeScript. The code is closer to the natural language description the model works from internally. Fewer tokens to express intent means fewer places for the model to diverge from what you asked for.

The counterarguments

"TypeScript has type safety." It does, and types are valuable for human developers navigating large codebases. But LLMs do not need type annotations to generate correct code — they infer types from context, variable names, and usage patterns. Type annotations are for humans and compilers. To an LLM, name: string is redundant information that costs tokens. A parameter called name is obviously a string.

"Python has the ML ecosystem." True. PyTorch, scikit-learn, and transformers are Python-first. But you do not build your web application in the same language you train models in. Use Python for training and data science. Use Ruby for the product that talks to the LLM API.

"Ruby is slower." With YJIT (Ruby 3.x) and ZJIT (Ruby 4.0), Ruby is fast enough for any web application. Your bottleneck is the database and the LLM API response time, not the language runtime. A 200ms API call to GPT-4o dwarfs any difference between Ruby and TypeScript execution speed.

"Nobody uses Ruby anymore." Ruby has a smaller community than TypeScript or Python, which is actually an advantage for LLM-generated code. The Ruby ecosystem is more stable and opinionated — there is usually one good way to do things (Rails, RSpec, Sidekiq/Solid Queue). LLMs generate more consistent Ruby because there are fewer competing patterns to confuse the model. Ask an LLM to write a TypeScript API and you might get Express, Fastify, Hono, Koa, or NestJS. Ask it to write a Ruby API and you get Rails.

The economics

Here is a concrete scenario. A team of five developers, each running 20 LLM-assisted coding sessions per day, averaging 800 input tokens and 400 output tokens per session (in Ruby token counts):

Language	Daily input	Daily output	Monthly cost
Ruby (baseline)	16,000	8,000	$2.64
Python (1.3x)	20,800	10,400	$3.43
TypeScript (1.7x)	27,200	13,600	$4.49

At GPT-4o rates. For Claude, multiply by roughly 1.2x across the board (Claude's tokenizer produces more tokens for code, so the absolute costs are higher, but Ruby's percentage savings are also larger).

The monthly dollar difference is small. The context window difference is not. Every TypeScript session burns 70% more of the context budget than the same Ruby session. Over a multi-turn conversation with file context, that is the difference between the model seeing your whole feature and the model losing track of your data model halfway through.

What we actually do

We are not theorizing. This is how we build software at Bytecode. Our production stack is Ruby on Rails with RSpec. When we use LLMs to generate code — and we do, extensively — the code is Ruby. The token efficiency of Ruby was not why we chose it originally, but it is a meaningful advantage now that LLMs are part of our daily workflow.

Our fine-tuned Rails models

We went further than just choosing Ruby. We fine-tuned our own models specifically for Ruby on Rails code generation, trained on 111,000 samples extracted from our own internal Rails projects. The models are open and available on Hugging Face:

qwen3-coder-30b-rails — 31B parameter MoE model. Available in Q4_K_M (18.6 GB) and Q5_K_M (21.7 GB) GGUF formats. This is the flagship: it writes idiomatic Rails code that follows our conventions out of the box.
qwen3-8b-rails — 8B parameter dense model. Q4_K_M at just 5 GB. Runs on a laptop with 8 GB of RAM. Fast enough for inline code completion, small enough to run locally alongside your dev server.

Both models run locally with Ollama:

ollama run bytecodehr/qwen3-8b-rails

The combination is what matters: Ruby's 42–45% token savings mean our models can fit more context into the same window, and the fine-tuning means the code they generate already follows our patterns — Devise authentication, namespaced concerns, Sidekiq, state-as-records. No prompt engineering needed to avoid the generic defaults that general-purpose models fall back on.

The full story of how we built the dataset and trained the models is in Part 1: Dataset Engineering and Part 2: Training, Quantization, and Deployment.

Conclusion

If you are starting a new project and you plan to use LLMs as a core part of your development process, consider the language you are asking them to think in. Ruby is shorter, more expressive, closer to natural language, and has the best testing story for validating AI-generated output. The token savings are real and they compound.

And if you want models that already speak fluent Rails, we built those too.