Max Agency Podcast

How Benchling builds agents when the smartest AI isn't smart enough

James Donner

June 11, 2026

min

Go back to blog

Create agents

Nicholas Larus-Stone is the Head of AI at Benchling , the R&D data platform that life science companies use to store and manage their experiments, samples, instruments, and analysis. Benchling has been around since 2012. In October 2025, it launched Benchling AI, an intelligence layer with a chat interface, backed by an agent, that helps scientists find data, design experiments, and write reports. Nick came to Benchling through its acquisition of Sphinx Bio (acquired), the analysis startup he founded.

In this conversation with LangChain Co-Founder & CEO Harrison Chase, Nick walks through what it takes to build agents for scientific work, and where the playbook from coding agents holds up and where it breaks down.

🎧 Watch the full conversation on YouTube, or listen & subscribe on Apple Podcasts or Spotify.

What we learned

Why Benchling runs multiple models on the same task

Instead of running the same model multiple times, Benchling runs across different providers. Different model families make different mistakes, so there is a stronger quality indicator for their team. If multiple models agree, it indicates good data quality. If multiple models disagree, there's usually an error.

"Each of them will make slightly different errors... being able to ask different model providers, we found gives us much better performance."

‍

How Benchling approaches trace review

In the world of scientific research, evals can only get you so far. Benchling leans on a structured approach for looking at production traces. Every week, they have a rotating fire chief who addresses and flags issues that are addressed in their weekly tech operations meeting. For external signals, they look at thumbs up & thumbs down user feedback.

"People who are working on specific features are gonna go look at the traces — our product managers, our engineers who are building something will actually go and see how people are using that feature after releasing it."

‍

Agents are having a big impact in scientific work

Nicholas points out that agents are compressing workflows and reducing the number of experiments needed to get an answer. By reducing dead time between steps, a day saved can often become a week saved. In addition, agents are also helping scientists design experiments more rigorously upfront, reducing the number of runs needed to get to a conclusion.

‍

Timestamps

00:00 Intro
01:22 What Benchling AI is, and the 14-year data platform underneath it
04:36 Why a decade of structured data is a core advantage
05:57 The architecture under the hood
08:28 Similarities and differences compared to a coding harness
11:14 Benchling’s multi-agent architectures
14:36 Dealing with verifiable vs non-verifiable tasks
16:19 Doing evals when clean benchmarks aren’t possible
18:13 Context engineering: SQL vs. file-based harnesses
22:11 Memory: agents that create and update their own skills
25:30 What user education for scientists looks like
30:33 Why understanding LLMs is closer to biology than software
33:28 When will agents discover a novel cure for disease?
44:58 The future of harnesses in science
48:13 Why fine-tuning on biology hasn't beaten frontier models

‍

People & Tools Mentioned During This Episode

‍

Get More Max Agency

Hosted by Harrison Chase, CEO of LangChain, each episode goes deep with the builders designing, deploying, and learning from real agent systems in the wild. From architecture decisions to evals, tooling, and failure modes, Max Agency is for people who want to understand what it really takes to build useful agents.

Subscribe today

‍

See what your agent is really doing

LangSmith, our agent engineering platform, helps developers debug every agent decision, eval changes, and deploy in one click.

Try LangSmith

Get a demo

How Benchling builds agents when the smartest AI isn't smart enough

What we learned