4 min read

Snorkel AI: How Judge0 Transformed Our Code Preference Analysis Pipeline

Nithin Krishnamurthi profile picture
Nithin Krishnamurthi
Staff Software Engineer @ Snorkel AI

Cover Image

When we first began exploring the domain of code preference, the vision was straightforward: present developers with two different approaches to solving the same problem and determine which implementation was superior. The goal seemed simple enough: evaluate code quality, efficiency, and best practices across various programming solutions. This data could then be leveraged downstream to fine-tune models and improve their ability to reason about code.

But "which code is better" can be a loaded question. Expert contributors across the board could look at a question and within moments analyze and compare overarching code patterns. This high-level qualitative analysis was a key part of the equation for sourcing this data and was a key reason we needed humans in the loop in the first place. But what ended up being trickier was solving the "dumb" question: did the code even run?

The Manual Review Nightmare

In our initial approach, we delegated both the review and execution processes to expert contributors. Each EC would receive code snippets, run them in their local environments, and provide feedback on which solution worked better. This decentralized system seemed reasonable at first, but it created a cascade of problems that threatened the entire project.

The first major issue was proof of work. How could we verify that reviewers had actually executed the code rather than making assumptions based on visual inspection alone? Without a standardized execution trail, we had no way to confirm that the analysis was thorough or that runtime behavior had been properly evaluated. This lack of accountability meant results were inconsistent and unreliable.

Then came the classic "works on my machine" problem. Without a centralized execution environment, code that ran perfectly for one reviewer would mysteriously fail for another. Different Python versions, missing dependencies, conflicting libraries, varied operating systems—the variables were endless. What should have been an objective comparison became muddied by environmental inconsistencies that had nothing to do with the code quality itself.

But the most concerning issue was security. When you're asking people to run arbitrary code on their personal machines, you're essentially inviting potential disaster. Malicious code could delete files, exfiltrate data, or compromise entire systems. Even well-intentioned code could have unintended consequences in different environments. We couldn't ask our team to take those risks, and we certainly couldn't scale a review process that put people's machines in jeopardy.

Enter Judge0

We found Judge0 after investigating event-driven code executors, and it fundamentally transformed how we approached validation. Judge0 provided exactly what we needed: a safe, isolated, and consistent environment for running code across multiple programming languages.

The platform solved our proof of work problem immediately. Every code execution through Judge0 generates clear output, error messages, and execution metadata. We could now verify that code had actually been run and see exactly what happened during execution. The guesswork disappeared.

The centralized execution environment eliminated compatibility issues entirely. Whether our reviewers were on Windows, Mac, or Linux, whether they had Python 3.8 or 3.11 installed, it didn't matter. Judge0 provided a standardized container where code executed the same way every single time. If code worked in Judge0, it just worked.

Most importantly, Judge0's sandboxed execution environment meant we could safely run any code without putting our team at risk. Malicious scripts were contained, filesystem access was restricted, and network operations were controlled. We could finally evaluate code objectively without worrying about security implications.

What This Enabled

With Judge0 integrated into our workflow, code preference analysis became not just possible, but scalable. We could process hundreds of code comparisons, run them through identical execution environments, and generate reliable insights about which approaches truly performed better.

The system allowed us to move beyond subjective code reviews and into objective, data-driven analysis. We could ensure correctness, catch edge cases that only appeared during execution, and provide developers with concrete evidence they could use to make judgments about code quality.

Why This Matters

Building tools for code analysis in an AI-driven world requires solving fundamental infrastructure challenges first. You can't evaluate code quality if you can't run code safely and consistently. Judge0 removed those barriers and let us focus on what actually mattered: helping developers write better code through meaningful comparisons and actionable feedback.

For anyone building systems that need to execute untrusted code, whether for educational platforms, coding assessments, or automated review tools, Judge0 provides the foundation that makes these projects viable. It's the difference between a fragile, manual process and a robust, scalable system that actually works.