Skip to content

Latest

More results

How We Doubled AI Code Acceptance by Teaching Models to Think Like Roblox Engineers

Domain-Aware Code Intelligence

Everybody gets excited about the next breakthrough AI model, but the secret to doubling the effectiveness of internal AI tools at Roblox wasn’t a new model. It was embedded in the history of our codebase. By leveraging years of code and reviews from our domain experts, we took AI-generated pull request (PR) suggestion acceptance rates from approximately 30% to over 60% across a 10,000 PR set and boosted an agentic code cleanup project’s eval accuracy above 90% over the same period.

Closing the AI Quality Gap

Across the industry, 50% to 60% of coding time is spent on software maintenance.1 Roblox is no different. 

On paper, repetitive maintenance tasks with well-defined requirements and a constrained problem space are perfect candidates for AI automation. In practice, our AI assistants struggled with evaluation accuracy and engineering acceptance.

At Roblox, the problem isn’t capability; it’s context. A generic model hasn’t lived through two decades of Roblox engineering. It hasn’t seen the 700,000 pull requests we’ve merged in the last three years, or learned from the 1.7 million code review comments where our most experienced engineers define and defend our coding standards.

AI assistants that ignore this history fail to win the trust of world-class engineers. Despite half of Roblox engineers adopting AI-powered assistants, only around 20% of AI-generated suggestions are accepted after human review. Our quarterly engineering productivity survey echoes this reality. Engineers score AI’s impact on productivity at 4.02 out of 5 but score confidence in AI code quality at only 3.09 out of 5. In short, AI helps, but trust remains limited, particularly in legacy C++ and more complex code domains.

To close this context gap, we invested in an agentic code intelligence platform built with Roblox’s own engineering history, aligned with expert exemplars, and validated through rigorous evaluation. This code intelligence platform is designed not merely to generate code suggestions but to iterate with the institutional depth of a Roblox engineer.

Learning From the Best of Roblox Engineering Experience

Roblox’s engineering corpus spans nearly 20 years of commits, design docs, and production telemetry, a uniquely rich dataset capturing how our systems evolved and how our engineers solved hard problems.

The code intelligence platform aims to transform that data into a structured knowledge graph, a significant engineering challenge. In a massive polyglot environment, code isn’t just text files. It's a complex web of build targets, C++ template instantiations, and dynamic Lua dependencies. Simply parsing the text is insufficient; the system must understand the deep semantic relationships buried within the codebase itself, specific to our unique architecture. 

Another challenge is tracing and temporal alignment. In order to reason across interconnected systems, an agentic system must link static code repos to noisy runtime telemetry and map millions of production signals back to the exact version of the code that generated them, even as the codebase continues to evolve.

undefined

To solve this, our strategy is to unify version control, build graphs, and runtime telemetry into a hybrid symbolic-vector representation, preserving syntax, semantics, and relationships. This allows the code intelligence platform to understand code the way senior engineers do: as interconnected systems shaped by design rationale, trade-offs, and performance data, not isolated text files.

Extracting Expert Signals Through Exemplar Alignment

True expertise hides in patterns, review comments, commit histories, and subtle code idioms. Code intelligence surfaces this implicit wisdom via an exemplar alignment engine, which allows engineers to curate “gold” examples of ideal implementations or review rationale.

Previously, an experienced engineer might spend hours every week reviewing PRs, repeatedly flagging the use of a blocking FetchData call inside high-frequency loops, a pattern that looks semantically correct but causes severe latency at Roblox scale. If the expert is out of town or misses an error, their knowledge may not be applied, and an anti-pattern could slip into production and cause an outage for our community.

Using the alignment engine, that engineer can encode their judgment into a natural language exemplar. This is a structured definition that combines the code pattern (the “what”) with the reasoning (the “why”). Now, the system automatically detects the blocking call, flags it, explains the latency risk, and links directly to the internal documentation on asynchronous best practices:

Blocking inside a high-frequency loop leads to increased latency and thread exhaustion. When a `FetchData` call is made in an async task, warn the author about latency and thread exhaustion. `FetchData` is OK as long as the task has been awaited already. Provide a direct link to async best practices at: internal_guidance/async.

This effectively encodes an engineer's hard-won knowledge across years of experience. The system transforms a one-off review comment into a permanent, automated guardrail.

“What makes the alignment engine powerful isn’t just that it uplevels code quality—it scales mentorship. We encode the expertise and intuition of our most seasoned experts into the platform itself. It’s like having a senior Roblox domain expert pair programming with you all day, every day.” —Tom Knych, Senior Technical Director

But our experts also have a lot on their plates, and asking them to recall and write down all of their key insights is a time-consuming and lossy process at best. So, how do we help them capture their best advice throughout their time at Roblox?

It’s already there, written across their meticulous code review comments and memorialized in each and every PR that makes it to production:

BuilderAI Figure 2.webp

We route historical PR comments through a pipeline that cleans and extracts the highest value themes from Roblox experience. Initially, noisy data is filled with non-actionable comments like praise or typo fixes, while valuable feedback is often written in shorthand that relies heavily on context. For example, a note like “use the new pattern here” is meaningless without an understanding of the specific file and diff. The system must translate these specific interactions into reusable, generalizable rules.

To solve this, we employ a multistage algorithm that detects recurring themes across thousands of PRs without human intervention. The system embeds historical comments into vector space, uses greedy clustering to find neighborhoods of related feedback, and applies LLM‑guided refinement to merge them into high-value patterns.

The result is a ranked list of candidate exemplars (or learnings), prioritized by how often they appear and how widely they are cited by different reviewers, complete with citations to the original comments. Our domain experts then review the candidates, make edits if necessary, and decide which ones to promote to the knowledge base as core best practices. After the first previews of this pipeline, repository leads were excited to see their favorite topics bubble up as key guardrails and immediately wanted to sign their repos up for analysis.   

undefined

The final step is the alignment agent, which assists human engineers and AI coding agents alike by checking all changes against the exemplar knowledge base. This flexible assessment can be applied throughout the software development lifecycle: at coding time, at merge time, and even with a continuous improvement agent that autonomously grooms the Roblox codebase as the knowledge base grows.  

By using this in-context learning to anchor AI behavior to Roblox standards, we saw one AI coding agent’s pass rates on its golden evaluation dataset jump from 84% to 100%. We aren’t just teaching Roblox AI how to code; we’re teaching our AI how Roblox engineers think.

Learning From Negative Signals

While exemplar alignment has significantly raised our baseline for codebase quality, our ultimate goal is to reach the point where the first pass of AI-suggested code is as trusted as the work of our most experienced engineers. That’s why we use every rejected AI suggestion, failed refactor, or regression-inducing merge as a high-value signal that we can feed back into the system. This creates a pipeline for agents to continuously improve and learn from their mistakes.

Negative outcomes can be filtered and labeled by domain experts with detailed reasoning, a chain of thought, and any additional context around the failure. This data is then embedded semantically and indexed for retrieval. When our code intelligence platform proposes new output, it performs a semantic search through this data, recalling past mistakes and reviewer feedback to avoid repeating them.

This closed feedback loop transforms each code-review into structured learning data, continuously refining future agent behavior through adversarial and critique-based training.

Constructing a Robust Evaluation Framework

Trust is built through reliable, predictable behavior that starts with measurement. We have designed a dedicated evaluation system to track our agents’ performance over time.

undefined

The framework includes:

  • Task-level benchmarks: Precision and recall across thousands of Roblox engineering activities, like refactoring, testing, and bug-fix tasks.

  • Simulation harnesses: Synthetic PRs with deterministic outcomes for reproducible scoring.

  • Human-in-the-loop panels: Expert comparison of AI output vs. gold-standard implementations.

  • Execution framework: When merging agent improvements, relevant evals are parallelized and run as part of the pre-merge continuous integration (CI) suite, giving engineers high confidence in their changes. 

  • Longitudinal metrics: Post-merge regressions, revert frequency, and latency changes tracked across releases.

  • Pervasive observability: Automatic tracing and visualization of agent activity to relate agents with the rest of Roblox and to feed smoothly into online and offline evaluation.

This system produces an agent quality score that accurately tracks performance shifts over time, enabling standardized comparisons across agent revisions and model versions. Since we introduced exemplar alignment and a full eval suite, one Roblox code intelligence agent’s PR suggestion acceptance rate improved from approximately 30% to over 60% on a 10,000-PR set, an early indication of trustworthy, domain-aligned performance. Through the same process, our feature flag cleanup agent increased its overall accuracy from 46% to over 90%.

The Road Ahead: Weaving Expert Judgment Into Every Tool 

We’re enhancing the utility of our established internal systems by building a layer of MCP and tool wrappers and evolving the code intelligence platform from targeted tasks to a system that keeps the Roblox codebase healthy. 

We envision an engineering future where historically hard-to-scale knowledge, like runtime context and expert judgment, is woven into every tool and workflow. When code intelligence, exemplar alignment, and observability come together, we unlock durable leverage: better quality, faster delivery, and a healthier, evolving codebase. The long-term goal is to give every engineer the power of institutional memory, every team the confidence to ship fast, and every engineer the freedom to focus on innovation, not maintenance. 

1 Based on industry data sourced from Deloitte’s IT Spending Analysis 2024.