Back to research

Published on August 2025

by Xiuxi Pan, PhD

Cloud Sandboxes as the Execution Substrate for the Contextual Agentic Vision Platform

Cloud Sandboxes as the Execution Substrate for the Contextual Agentic Vision Platform

An infrastructure note from building the execution layer underneath the Contextual Agentic Vision Platform.

When the Platform moves from a demo to production, the part of the architecture that becomes load-bearing in unexpected ways is not the model. It is the execution substrate: the place where agent-emitted tool calls actually run, where the customer's own rule packs and post-processing scripts execute against scout outputs, and where the blast radius of any individual action is bounded.

We have ended up running most of that surface inside ephemeral cloud sandboxes, and the choice has shaped enough of the Platform's behavior that it is worth writing down why, and what we have learned in production.

Why the execution layer needs its own substrate

A naive deployment pattern looks like this: the agent emits a tool call, a long-lived backend process receives it, the process executes whatever code is required, and the result is returned. This works until any one of the following becomes true:

  • The tool call invokes customer-provided code or queries. A customer's integrity guidebook rules, defect-classification logic, or report templates need to run inside their logical boundary, not ours. In the Platform's terms, this is the customer's inner-context layer being exercised at decision time.
  • The tool call comes from an agent's reasoning step and is therefore semi-untrusted. Even with strong prompting, model-generated code or shell commands can have unintended effects when executed in a shared process.
  • Different tenants need different runtime environments. One customer pins to an older Python or a specific CUDA version; another requires strict egress controls.
  • Workloads need to be cancellable, time-boxed, and resource-isolated so that one slow tool call cannot starve the rest of the Platform.

The shared assumption behind all four cases is the same: the process boundary of a long-lived backend is the wrong unit of isolation for the work an agent triggers. The right unit is something closer to a fresh, disposable machine.

What a sandbox is, in this context

By cloud sandbox we mean a programmatically provisioned, time-boxed runtime (typically a microVM or a hardened container) that an orchestrator can spin up, hand a task to, and tear down. The defining properties for our use:

  • Per-invocation provisioning. The sandbox is allocated when work arrives and released when the work completes. There is no persistent state by default.
  • Strict network egress controls. The sandbox starts with no outbound network access; the orchestrator opens specific egress lanes for specific tools when required.
  • Bounded blast radius. Anything that runs inside the sandbox cannot affect anything outside the sandbox. The orchestrator can terminate it at any time without coordination.
  • SDK-driven control. The orchestrator interacts with the sandbox the same way an agent interacts with a tool: through a typed interface, not through ssh or shell calls.

The pattern is described in production form by E2B [1] and by the broader literature on lightweight VMs for short-lived workloads such as Firecracker [2]. Browser-isolation systems [3] developed for end-user protection use the same architectural idea applied to a different threat surface.

What we run inside the sandbox

In our platform the sandbox is the home for three classes of work.

1. Customer-specific evaluation logic. A customer's domain rules, for instance the corrosion-grading schedule from a particular industry standard or the geometry tolerances from a specific drawing, are encoded as scripts or rule packs that we do not want running in shared platform code. Each customer's rule pack runs inside its own per-invocation sandbox, with the relevant slice of the context pack mounted in read-only.

2. Agent-emitted tool calls that involve code execution. When the agent decides it needs to compute a derived measurement (e.g. "compare this depth value to the tolerance specified by the drawing"), the computation is rendered as a small program and executed inside a sandbox. The program has no network access; it has read-only access to the relevant scout outputs and context records, and it returns a typed result.

3. Workload-specific inference environments. A customer with a pinned model version, a non-standard runtime, or a regulated data-handling requirement gets a sandbox image tailored to those constraints, while still being scheduled by the same orchestrator that schedules everything else.

In all three cases the orchestrator never executes code itself. It composes a task, picks an image, provisions a sandbox, hands the task off, and waits for a typed result. This is the same shape as an agent invoking a remote tool, and that symmetry is the reason the pattern works at all.

Ephemeral by default, persistent only on purpose

Sandbox platforms typically support both ephemeral and persistent modes. We default to ephemeral-per-invocation for three reasons.

  • Cross-session leakage is the dominant risk. A persistent sandbox accumulates state that the next invocation might inadvertently inherit: temp files, cached credentials, partially written logs. With ephemeral provisioning the default is empty, and any state that does carry over must be explicitly written to durable storage.
  • Reproducibility comes for free. A sandbox started from a known image is, by construction, reproducible. We rely on this to make the platform's outputs auditable: the runtime that produced a given decision can be reconstructed from the image hash and the input record.
  • Cost amortization works in our favor at typical workload shapes. Provisioning latency is meaningful (tens to hundreds of milliseconds), but it is paid once per invocation, not once per tool call within the invocation. The agent's reasoning loop runs in a long-lived process; only the discrete units of code execution drop into a sandbox.

Persistent sandboxes have their place (long-running data preparation jobs, interactive notebooks for the platform's own engineering use), but we treat them as the exception rather than the default.

The orchestration layer that makes this work

The sandbox is not useful in isolation. Three pieces around it carry most of the operational weight.

A task router decides which sandbox image runs a given task, applies tenancy rules, and enforces time, memory, and egress limits. This is the place where customer-specific policies (network allowlists, data residency constraints, GPU quotas) are realized as concrete sandbox parameters.

A typed-IO contract sits between the orchestrator and the sandboxed task. Inputs are serialized as typed records; outputs are validated against a schema before the result is allowed back into the rest of the platform. A sandbox cannot return free-form text into the trust boundary of the orchestrator.

A run trace store records what ran, in which image, against which inputs, with what outputs. This is what turns ephemeral execution into auditable execution: nothing is retained inside the sandbox, but a complete record of each run is retained outside it.

The combination (sandbox + router + typed IO + trace) is what we mean when we say "execution substrate."

What this enables at the platform level

Treating the sandbox as a first-class component shifts what is feasible at the platform layer.

  • Customers can ship logic into the platform without us shipping a release. A new domain rule pack is a new sandbox image plus a config entry, not a code change in the core orchestrator.
  • The agent can write code. Because anything an agent emits runs in a sandbox with no network and no access to the platform's internals, the failure mode of "the agent generated a destructive command" is bounded to the sandbox's lifetime and resources.
  • Auditability is structural. Every decision the platform emits is associated with a chain of sandbox runs whose inputs, outputs, and image hashes are retained. The audit story is not "we logged what the model said" but "here is the exact code that produced the measurement, and here is the image it ran in."
  • Tenancy is honest. A customer's data does not enter another customer's sandbox by construction, not by policy.

Open problems

The substrate is not solved.

  • Cold-start latency still dominates for sub-second interactive workloads. We work around this with image pre-warming and pooled sandboxes, but neither is free.
  • GPU sandboxing is meaningfully harder than CPU sandboxing. The maturity of GPU partitioning and isolation is improving, but the operational story is still rougher than the CPU equivalent.
  • Observability inside the sandbox has to be designed in. By default the sandbox is opaque; making its internals legible to platform operators without breaking the isolation properties requires care.

These are problems we have partial answers to; they are not problems we consider closed.

References

  1. E2B. Open-source infrastructure for AI-generated code in secure isolated cloud sandboxes. e2b.dev
  2. Agache, A. et al., Firecracker: Lightweight Virtualization for Serverless Applications. NSDI 2020. usenix.org/conference/nsdi20/presentation/agache
  3. Cloudflare. What is browser isolation? cloudflare.com