Training Vision Agents Behind the Firewall: PEFT, Local Serving, and Trainable Tool-Use

A training and serving note, drawn from work on the on-prem deployment path of our Contextual Agentic Vision Platform.

In the industries we work with (heavy industry, infrastructure, semiconductor manufacturing, energy and offshore assets, defense-adjacent inspection programs), a sentence we hear early in every conversation is some variant of: the data does not leave the building. Drawings, SOPs, inspection histories, internal regulations, personnel imagery captured in scene, all of it is bound by either explicit compliance constraints or implicit corporate posture toward sending proprietary content to third-party AI APIs.

This note is about what that constraint actually requires once you decide to take it seriously: how we train and serve the agentic vision stack inside customer infrastructure, and the open research problem that sits underneath the deployment work.

Why on-prem is not just "host the model locally"

The naive read of an on-prem deployment is: take an open-weights model, run it behind a firewall, done. In practice an agentic vision platform has three distinct things that need to live on-prem, each with its own constraints.

1. The specialist CV models. Object detection, segmentation, OCR, depth, tracking. These are typically the easiest to host. They are well-bounded, GPU-friendly, and have mature serving stories. They are also the part of the stack most likely to be fine-tuned per customer, since defect taxonomies and acceptable-tolerance bands vary by industry and site.

2. The VLM orchestrator. Larger, harder to serve, and the part of the stack that benefits most from continued improvements at the frontier. The realistic choice is an open-weights foundation model (e.g. Qwen-VL or a comparable family [1]) hosted on customer GPUs, with parameter-efficient adaptation applied on top.

3. The retrieval and context layer. The customer's drawings, SOPs, and inspection history live in their environment. The retrieval index has to live there too, and the VLM has to call it without anything traversing the customer boundary.

Each layer has its own training and serving story. Treating them as one monolithic model is the mistake.

Parameter-efficient adaptation, not full retraining

For both the VLM orchestrator and per-customer specialist tuning we rely on parameter-efficient fine-tuning (PEFT), primarily LoRA and its variants [2].

The choice is not just about compute cost (although that is meaningful, since full-parameter fine-tuning of a frontier-scale VLM on customer GPUs is rarely an option). The more important reasons:

Preserving general capability. A VLM that has been fine-tuned on a narrow industrial corpus often degrades on the long tail of general visual understanding that makes it useful as an orchestrator in the first place. LoRA's low-rank update preserves the base model's behavior on out-of-domain inputs more reliably than full retraining at typical fine-tuning budgets.

Stackable adaptations per customer or task. A LoRA adapter is small and swappable. A single hosted VLM can serve multiple customers by loading the relevant adapter at request time, without needing a full model replica per customer. This is the difference between an on-prem deployment that scales to many internal teams and one that does not.

Auditability. A LoRA adapter is a small artifact whose training corpus we can document and version. A full-retrained checkpoint is opaque by comparison.

For the specialist CV models, the equivalent move is freezing the backbone and adapting only the task heads, plus careful data curation to avoid the failure mode where a model that was excellent on the supplier's test set degrades on the customer's site distribution.

Data sanitization is part of the model training pipeline

Before any adapter training occurs, customer training data passes through a sanitization pipeline that combines named-entity recognition with rule-based redaction. The pipeline strips PII, client identities, and sensitive transaction or location fields before they enter the training corpus, reducing the risk that the adapter memorizes those fields and surfaces them in later generations.

The same principle applies broadly to industrial inspection data: serial numbers, GPS coordinates on infrastructure imagery, identifiable personnel in scene captures, all of these are stripped before adapter training, and the redaction itself is logged so that auditors can verify what the training set did and did not contain.

Serving: the choices that matter

The serving stack is where the difference between "we ran a model locally" and "we ran it well enough that analysts use it" is decided. Three components carry the weight.

vLLM as the inference engine. PagedAttention [3] turns KV-cache memory from a hard cap on batch size into something the scheduler can manage; on the workloads we run (long-context VLM calls with mixed image and text inputs), the practical throughput improvement over a naive HF transformers serving loop is large enough that it is the difference between feasible and infeasible on a typical on-prem GPU budget.

Triton as the front door. Triton Inference Server with a vLLM backend [4] gives us a uniform serving surface across the VLM and the specialist models, with batching, queuing, and multi-model hosting handled outside the model code. The point is not Triton specifically; the point is that the serving layer has to be a real piece of infrastructure, not a Python script behind a Flask app.

Local retrieval-augmented generation. The VLM does not answer from its weights alone. Every call retrieves from a local vector index over the customer's drawings, SOPs, and inspection history, and the prompt requires citation of the retrieved chunks. This is the same RAG pattern that has become standard for text LLMs [5], lifted into the agentic vision setting: the VLM's analyst-layer prose has to point to retrievable evidence, which is the property the judge layer downstream relies on.

The three together (efficient inference, real serving infrastructure, mandatory retrieval) are what makes an on-prem deployment usable rather than just compliant.

The harder problem: jointly training the orchestrator and the specialists

Everything above is deployment engineering. The research problem underneath is harder, and it is the part of our work that we believe has the longest half-life.

The architecture of our platform is a VLM orchestrating a set of specialist CV models. The natural training question is: how do you train the orchestrator's tool-use policy so that the right specialists are called, in the right order, with the right queries, for the outcome of the decision to be correct? Not for the immediate plausibility of any individual tool call, but for the outcome.

This is hard for a specific reason. The workflow (which scout to call, in what sequence, with what query) is discrete. The model parameters (the VLM's policy head, the specialists' weights) are continuous. Gradient-based optimization assumes a differentiable loss; discrete tool-selection decisions do not provide one out of the box. The two regimes do not line up, and naive joint training reduces to either ignoring the workflow structure (and over-fitting individual model calls) or freezing the models and only learning the workflow (and leaving most of the available signal on the table).

The line of work we are pursuing here is counterfactual credit assignment for trainable tool-using vision: assigning credit to individual tool calls based on counterfactual rollouts of what the outcome would have been had a different call been made, and using that signal to update both the orchestrator's tool-selection policy and the specialists' parameters. The mathematical inspiration comes from the broader literature on credit assignment in reinforcement learning [6] and from recent work on training tool-using language models with outcome-level rewards [7]. The vision-specific instantiation, where tool outputs are typed records with calibrated uncertainty rather than free-form text, has its own structure that we have not seen addressed in the existing literature, and that is the part we are actively writing up.

For this note: the deployment story above (LoRA, vLLM, Triton, RAG) is the substrate that makes the research tractable, because each layer's input and output is structured well enough that a reward model can score it. The on-prem story and the research story are not separate threads. They are the same architecture viewed from two angles.

What we are not claiming

A few honest caveats.

PEFT is not magic. There are tasks where full-parameter fine-tuning is the right answer, and we use it where the budget and data justify it. The default is PEFT; the choice is per-customer and per-task.
On-prem is not always faster. Cloud APIs run on hardware most customers do not own. The on-prem story competes on control, auditability, and data residency, not raw throughput. We have a serving stack that closes most of the gap, but we do not claim parity in all settings.
Trainable tool-use is research. The architecture is in production; the training procedure is still being validated. We will write up the latter separately when the results are publishable.

References

Bai, J. et al., Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities. Alibaba. arxiv.org/abs/2308.12966
Hu, E. et al., LoRA: Low-Rank Adaptation of Large Language Models. Microsoft Research. arxiv.org/abs/2106.09685
Kwon, W. et al., Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023. arxiv.org/abs/2309.06180
NVIDIA Triton Inference Server, vLLM backend. docs.nvidia.com/deeplearning/triton-inference-server
Lewis, P. et al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. arxiv.org/abs/2005.11401
Sutton, R. and Barto, A., Reinforcement Learning: An Introduction. MIT Press, 2nd edition. incompleteideas.net/book
Schick, T. et al., Toolformer: Language Models Can Teach Themselves to Use Tools. Meta AI. arxiv.org/abs/2302.04761