Back to research

Published on March 2026

by Xiuxi Pan, PhD

Corner-Case Detection in Industrial Inspection: A VLM-Centered Approach to Long-Tail Defects

Corner-Case Detection in Industrial Inspection: A VLM-Centered Approach to Long-Tail Defects

A research note on corner-case detection in industrial inspection, drawn from a deployment with an infrastructure inspection team.

The hardest scenes in infrastructure inspection are not the common ones. They are the rare and the ambiguous: a woodpecker hole on a utility pole, a hairline crack on a wind turbine blade, a corrosion pattern on a bridge fitting that nobody has seen on this asset class before. These corner cases are too infrequent to populate a closed-class training set, and yet missing one of them is the failure mode that motivates an inspection program in the first place.

Closed-class detectors can flag uncertainty, but they cannot recognize "something here is wrong" when the something has never appeared in their training set. That recognition is open-world reasoning, and it requires a different shape of model on top of the perception pipeline.

This note works through one design for that layer: a VLM-based detector served through vLLM, adapted in three stages that map to deployment maturity, and embedded inside a hybrid pipeline where a fast first-pass detector filters obvious normal cases before the VLM runs. It also documents what is genuinely hard and where the open problems remain.


Why a closed-set detector cannot solve corner cases

The fundamental mismatch is between the closed-world assumption of supervised detectors and the open-world nature of real infrastructure failures.

Defect distributions are inherently long-tailed. A few common types (surface cracks, minor corrosion) dominate datasets, while critical rare anomalies occupy the tail with vanishingly few examples. Among inspected wind turbines, only ~8.5% of 35,000 units showed hairline cracks, the most critical yet hardest-to-detect defect type [1]. Anomaly rates in real-world mass inspection typically fall under 1% of all samples [2]. Human manual inspection error rates reach 10–20%, with overlook probabilities for specific defect types as high as 25% [3].

Each infrastructure domain presents its own corner cases. Power-line inspection must handle woodpecker holes (sub-centimeter), corona discharge (UV-only), and subtle conductor fraying on fittings that are extremely small relative to the overall structure [4]. Bridge inspection confronts internal micro-cracks invisible on the surface, multi-defect overlap, and defects occluded by dirt and vegetation. Wind turbine blade inspection faces hairline cracks barely visible to the human eye, lightning-induced erosion, and subsurface delamination, where a single blade replacement can cost over $300,000.

A YOLO-style detector trained on ten defect classes will never flag an eleventh, regardless of how much more data the known classes are given. The limit is structural to the closed-set paradigm rather than a question of dataset size. A detector whose strongest available output is "none of the classes I was trained on" when the right answer is "something here is wrong" is the exact failure mode an inspection program cannot tolerate.


The two approaches considered

Two families of methods are commonly proposed for open-ended detection under data scarcity. Both are worth walking through, because the choice between them is what most of the design discussion below rests on.

Open-set / one-shot detection

Models like T-Rex2 [5], Grounding DINO [6], and DINO-X [7] localize objects using text prompts, visual prompts (example bounding boxes or points), or both. T-Rex2 introduced a finding directly relevant to industrial settings: for common objects, text prompts outperform visual prompts; for rare objects (ranked 800–1,200 by frequency), visual prompts significantly outperform text [5]. Unusual defect patterns are difficult to describe verbally but easy to demonstrate with an example image.

Open-Set Detectors: Accuracy vs Speed Open-set detection models sit on a clear trade-off between accuracy and speed. DINO-X reports the highest zero-shot AP on both COCO and LVIS rare classes among the models surveyed, while YOLO-World offers real-time performance suitable for first-pass screening. Bubble size is proportional to LVIS rare-class AP, the metric most relevant to corner-case detection.

DINO-X reports strong zero-shot results: 56.0 AP on COCO and 63.3 AP on LVIS rare classes, an improvement of 5.8 AP over the previous best cited in the same paper [7]. YOLO-World reports 35.4 AP at 52 FPS, orders of magnitude faster than the Grounding DINO family [8].

For corner-case work specifically, open-set detectors face a practical limit. In this deployment, the targets were often too general or ambiguous for prompt-grounded detection to handle reliably. Concepts like "anything structurally abnormal" or "unexpected deformation" do not map cleanly to text prompts, and visual prompts require reference boxes that may not generalize across environmental conditions.

VLM-based reasoning

Vision-language models such as Qwen2-VL [9], InternVL [10], and LLaVA-OneVision [11] take a fundamentally different approach. Rather than producing bounding boxes from prompts, they reason over images in natural language. A VLM can be asked "is there anything unusual about this utility pole?" and respond with a structured description of the anomaly, its likely cause, and its approximate location.

The evolution of VLMs for industrial anomaly detection has been rapid. AnomalyGPT (AAAI 2024) showed that with a single normal reference image, a VLM could reach 86.1% accuracy and 94.1% image-level AUC on MVTec-AD, the standard industrial anomaly benchmark, while supporting multi-turn diagnostic dialogues [12]. LogicAD (AAAI 2025) tackled logical anomalies (missing components, wrong arrangements), reporting 86.0% AUROC on MVTec LOCO AD, an 18.1% improvement over prior methods cited in the paper [13]. InfraGPT (2025) demonstrated an end-to-end VLM-based framework for urban infrastructure defect detection and management [14].

MMAD Industrial Anomaly Detection Benchmark The MMAD benchmark (ICLR 2025), one of the more comprehensive VLM evaluations for industrial anomaly detection, shows that even leading frontier models have significant headroom. The gap between frontier API models and open-weight models motivates domain-specific adaptation. Data from Jiang et al. [15].

The MMAD benchmark (ICLR 2025), with 39,672 questions across 8,366 industrial images, reports even GPT-4o at only 74.9% average accuracy [15]. This is sobering and instructive: raw VLM capability is insufficient, and domain-specific adaptation is essential for production. The compensating property is that VLMs natively explain why something is anomalous, a capability closed-set detectors do not have at all, and that downstream review processes need in order to act on a flagged finding.


Adaptation stack

Based on this evaluation, we settled on a VLM-based detector with three adaptation modes. Each can run independently, and they are designed to compose (for example, a LoRA-fine-tuned model further augmented with RAG at inference time). The three modes map to different stages of deployment maturity and can be adopted progressively.

Three Adaptation Paths The three adaptation modes map to different stages of deployment maturity. In-context learning enables day-one operation; LoRA fine-tuning provides production stability; Visual RAG adds domain knowledge without retraining.

In-context learning: immediate deployment

When a new corner case is identified, in-context learning lets the model incorporate it the same day. Reference images (normal and defective) are placed directly in the VLM prompt alongside instruction templates. Ueno et al. (2025) showed that fine-tuned ViP-LLaVA using single-shot ICL achieved MCC 0.804 and F1 0.950 on MVTec-AD, competitive with specialized models [16]. Their finding that Euclidean-distance example selection outperforms cosine-similarity RICES has practical implications for retrieval design.

The trade-off is mechanical: ICL requires zero training compute, but each high-resolution inspection image consumes 2,000–4,000 visual tokens, rapidly filling context windows. Performance plateaus around 4–8 reference images.

Few-shot LoRA fine-tuning: production stability

For recurring inspection operations that require stable, repeatable behavior, LoRA [17] introduces small decomposition matrices into transformer attention layers, training only 0.1–0.5% of total parameters while keeping base weights frozen. QLoRA further quantizes the base model to 4-bit NF4; Qwen2.5-VL-7B can be QLoRA-fine-tuned at rank 8 on a single GPU with ~16–24 GB VRAM.

Data requirements are surprisingly modest. PLG-DINO (2025) showed that LoRA-fine-tuned Grounding DINO outperforms all YOLO variants in low-resource industrial defect scenarios [18], though that result is for an open-set detector, not a VLM. In our own VLM experiments, 500–2,000 labeled examples yielded significant improvement over zero-shot baselines, with diminishing returns beyond 5,000 examples. The resulting adapter weighs 200–400 MB versus 14+ GB for full model weights, making per-deployment version management and A/B testing straightforward, and giving us a small, auditable artifact for each engagement.

Visual RAG: grounding in domain knowledge

When the customer maintains internal knowledge (defect catalogs, engineering guidelines, prior similar cases), retrieval-augmented generation injects this context dynamically at inference time. Known defect images are indexed in a vector database using CLIP or DINOv2 embeddings; for each query, the top-k visually similar examples are retrieved and injected into the VLM prompt.

VisRAG demonstrated 20–40% end-to-end gains over text-based RAG by embedding documents as images directly [19]. Wallace et al.'s InspectVLM (2025) offers a cautionary counterpoint: unified VLM architectures degrade significantly across varying inspection domains without careful domain-specific adaptation [20].

RAG's distinguishing advantage at this layer is that every output can be linked to specific retrieved evidence, which is what makes the downstream review traceable. Traceability is not the same as strict auditability; the VLM's final output is not guaranteed to be faithful to the retrieved evidence. But traceability is the property a human reviewer needs to argue with the model's output, and it is the property closed-set classifiers cannot offer.


vLLM: the serving engine that makes this viable

A VLM-based detector is only viable if inference is fast enough and memory-efficient enough for production. vLLM, through PagedAttention and continuous batching, makes this possible [21][22].

PagedAttention eliminates the KV-cache bottleneck

During autoregressive generation, the model stores key and value matrices for all previous tokens (the KV cache). For a VLM processing high-resolution images, this is particularly demanding. Based on our profiling of Qwen2-VL-7B (FP16 KV cache, 28 layers, GQA with 4 KV heads, 128-dim head), each token generates ~0.03 MB of KV cache, meaning a single 1024×1024 image producing ~4,096 visual tokens can consume 100+ MB of KV cache alone.

Traditional serving pre-allocates contiguous memory blocks per sequence, wasting 60–80% of KV-cache memory through fragmentation and over-reservation [21].

PagedAttention Memory Management PagedAttention borrows virtual-memory concepts from operating systems. KV-cache blocks are stored in non-contiguous physical memory and mapped through block tables. Memory waste drops from 60–80% to under 4%, enabling 2–4× throughput improvement [21].

PagedAttention divides the KV cache into fixed-size blocks (typically 16 tokens) stored non-contiguously in GPU memory, with each sequence maintaining a block table, analogous to an OS page table. Physical blocks are allocated on demand with copy-on-write sharing for common prefixes. The result: under 4% memory waste and 2–4× throughput improvement [21].

Continuous batching maximizes GPU utilization

Static batching forces all requests in a batch to wait for the slowest sequence. vLLM's continuous batching operates at iteration-level granularity: at every decode step, the scheduler removes completed sequences and inserts waiting ones. Benchmarks show 14–24× higher throughput vs HuggingFace Transformers and 2.2–3.5× over Text Generation Inference [21].

VLM-specific optimizations in vLLM V1

vLLM V1 (2025) introduced critical multimodal capabilities [23]. An encoder cache stores computed vision embeddings on GPU, eliminating redundant re-execution of the vision encoder across similar prompts. Metadata-enhanced prefix caching uses image content hashes rather than just token IDs, preventing cache collisions between different images sharing the same <image> placeholder. The hybrid parallelism flag (--mm-encoder-tp-mode data) runs the vision encoder with data parallelism while the language model uses tensor parallelism, reducing all-reduce communication during vision encoding.

Red Hat's developer team reported ~40% throughput improvement over V0 on Molmo-72B across 4×H100 GPUs [23]. AMD's ROCm team independently confirmed significant speedups for image-heavy workloads from enabling data-parallel vision encoding [24].


Production deployment: hardware, tiling, and the hybrid pipeline

GPU memory and hardware selection

VLMs consume additional VRAM beyond text-only models due to vision encoder weights, visual token embeddings, and cross-modal attention. Concrete requirements (our estimates): Qwen2-VL-7B needs ~16–17 GB at FP16 (fits a single L40S with room for KV cache), dropping to 8–9 GB at INT8. Qwen2-VL-72B needs ~144 GB at FP16; at FP8, fits on 4×A100-80GB. Users report OOM on 24 GB GPUs when processing high-resolution images without constraining min_pixels/max_pixels parameters [9].

In this deployment, an NVIDIA L40S (48 GB GDDR6) offered a workable balance of memory, throughput, and acquisition cost, handling a 7B VLM at full precision with room for KV cache. For a workload of ~1,000 images per day, a single L40S sufficed.

Handling high-resolution inspection imagery

Industrial cameras capture at 4K+, but VLM input limits require intelligent tiling. Qwen2-VL's 675M-parameter ViT processes images at native resolution into variable token counts, controlled via min_pixels and max_pixels [9]. InternVL divides images into 448×448 tiles (1–40 tiles, supporting up to 4K), with pixel shuffle reducing each tile to 256 visual tokens plus a global thumbnail [10].

A practical recipe for 4K inspection images: pre-resize to a bounded resolution (longest edge 2048–4096 px), use sliding-window crops for defect localization, process the full image at low resolution for global context alongside high-resolution crops of regions of interest, and aggregate results across tiles with non-maximum suppression.

The hybrid pipeline

Hybrid Detection Pipeline The production pipeline combines a fast first-pass detector with the VLM. In our deployment, roughly 85–95% of images were filtered by the first-pass detector, yielding a 7–20× reduction in VLM inference volume.

The hybrid architecture combines speed with depth. Image ingestion from camera or drone feeds flows into a preprocessing service (resize, normalize, tile). A lightweight first-pass detector (object detectors like YOLO, anomaly methods like PatchCore, or vision-language encoders like SigLIP) filters obvious normal cases. The non-obvious design choice at this stage is that the first-pass detector has to be tuned for high recall rather than high precision. Its job is to confidently exclude only clearly normal images, while routing anything uncertain or borderline to the VLM. A precision-tuned first stage would filter out exactly the corner cases the VLM is meant to catch, which defeats the purpose of having a VLM downstream at all.

In our deployment, the first-pass detector's operating point retained all images with anomaly scores above a deliberately low threshold, plus a configurable fraction of "uncertain" samples. The VLM, served via vLLM with an OpenAI-compatible API, processes only flagged images using structured system prompts and returns JSON with defect type, location coordinates, severity, and natural-language rationale. Post-processing aggregates multi-tile results, applies business rules, and cross-references the customer's defect catalog. Alerts integrate with existing MES/ERP systems.

Key vLLM configuration: --gpu-memory-utilization 0.9 to maximize KV cache; prefix caching enabled for the repeated system prompt; --limit-mm-per-prompt "image=5" to bound memory per request; chunked prefill to prevent long image prompts from blocking decode.

Performance metrics, with operating conditions. Throughput and latency are different metrics with different operating points; we report both, with the conditions under which each was measured, to avoid the implication that any single figure characterizes the system.

  • VLM-stage throughput. On a single L40S with Qwen2.5-VL-7B at FP16, structured-output prompts with image inputs at 1024–2048 px longest edge and 60–200 output tokens, vLLM V1 with prefix caching enabled and batch concurrency 4–8, we measured 10–60 images/minute depending on the configuration point. This is the figure relevant to capacity planning for a day's worth of inspection imagery.
  • Single-flagged-image latency (P50). Same hardware and model. With prefix-cached system prompt, pre-warmed vision encoder, and no batched concurrency, a single flagged image returns a full structured response in roughly 1–4 seconds. This is the figure relevant when an operator is waiting on a specific finding.
  • Hybrid-pipeline operator-visible latency. In the hybrid setup, only the 5–15% of images flagged by the first-pass detector reach the VLM at all. For the 85–95% the first-pass excludes, operator-visible latency is governed by the first-pass frame-rate (tens of milliseconds). Sub-second response, end-to-end, is therefore the typical case for the routine majority, while flagged corner cases reflect the per-image latency above.
  • Latency-engineering target. The "end-to-end response under one second" target discussed in our latency-engineering note refers to a different operating regime entirely: smaller models, lighter-weight prompts, streaming output, and a UX that surfaces partial results before the full response completes. That regime is not the same as the structured-JSON behavior described here.

Quoting one of these numbers without the others is misleading, and we have tried not to.


What we learned and where the open work is

Across the work, three observations stand out.

The hybrid architecture turned out to be the strongest design for this engagement, and the pattern is likely applicable to similar image-heavy inspection workloads with low anomaly rates, though that remains a single-engagement claim. Fast first-pass detectors running at millisecond latency filtered the large majority of normal images (85 to 95 percent under the conditions we saw), and the VLM supplied the reasoning depth needed for the corner cases that closed-set detectors structurally cannot address.

The adaptation stack mattered more than the base model. A 7B VLM, fine-tuned with LoRA on roughly 1,000 domain-specific examples and augmented with visual RAG, substantially outperformed a raw frontier model on the target inspection tasks. This is a single-engagement observation, not a controlled benchmark. The staged path (zero-shot in days, LoRA in weeks, domain specialization in months) gave immediate value while building toward production accuracy.

For image-heavy workloads of this shape, the multimodal optimizations in vLLM V1 also moved the economics in a way that older serving setups did not. Encoder caching, hybrid parallelism, and metadata-enhanced prefix caching specifically target the memory and throughput bottlenecks of image-heavy inference.

The remaining gap is accuracy. Even the best VLMs reach only 74.9% on MMAD [15], and the MVTec AD 2 dataset, designed to expose current method limitations, reports leading methods below 60% average AU-PRO [25]. Closing this gap is an active research area: domain-specific fine-tuning, reinforcement learning from inspection feedback (as in EMIT [26]), and trainable tool-using vision, where the VLM's tool-selection policy is jointly optimized with the specialist models it calls.


References

  1. Shihavuddin et al. "Barely-Visible Surface Crack Detection for Wind Turbine Sustainability." arXiv:2407.07186, 2024.
  2. Baitieva et al. "Supervised Anomaly Detection for Complex Industrial Images." CVPR 2024.
  3. Li et al. "Surface Defect Detection Methods for Industrial Products with Imbalanced Samples: A Review of Progress in the 2020s." Knowledge-Based Systems, 2024.
  4. Zhang et al. "Deep Learning in Automated Power Line Inspection: A Review." arXiv:2502.07826, 2025.
  5. Jiang et al. "T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy." ECCV 2024. arXiv:2403.14610.
  6. Liu et al. "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection." ECCV 2024. arXiv:2303.05499.
  7. Ren et al. "DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding." arXiv:2411.14347, 2024.
  8. Cheng et al. "YOLO-World: Real-Time Open-Vocabulary Object Detection." CVPR 2024. arXiv:2401.17270.
  9. Wang et al. "Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution." arXiv:2409.12191, 2024.
  10. Chen et al. "InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks." CVPR 2024.
  11. Li et al. "LLaVA-OneVision: Easy Visual Task Transfer." arXiv:2408.03326, 2024.
  12. Gu et al. "AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models." AAAI 2024 (Oral). arXiv:2308.15366.
  13. Kim et al. "LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction." AAAI 2025.
  14. Alani et al. "InfraGPT Smart Infrastructure: An End-to-End VLM-Based Framework for Detecting and Managing Urban Defects." arXiv:2510.16017, 2025.
  15. Jiang et al. "MMAD: A Comprehensive Benchmark for Multimodal Large Language Models in Industrial Anomaly Detection." ICLR 2025. arXiv:2410.09453.
  16. Ueno et al. "Vision-Language In-Context Learning Driven Few-Shot Visual Inspection Model." arXiv:2502.09057, 2025.
  17. Hu et al. "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022. arXiv:2106.09685.
  18. Chen et al. "PLG-DINO: Industrial Defect Detection via Prompt-Learning Grounding DINO." OpenReview, 2025.
  19. Yu et al. "VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents." arXiv:2410.10594, 2024.
  20. Wallace et al. "InspectVLM: Unified in Theory, Unreliable in Practice." ICCV 2025 Workshop.
  21. Kwon et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023. arXiv:2309.06180.
  22. vLLM Project. github.com/vllm-project/vllm
  23. Red Hat Developer. "vLLM V1: Accelerating Multimodal Inference for Large Language Models." 2025.
  24. AMD ROCm Blogs. "Accelerating Multimodal Inference in vLLM: The One-Line Optimization for Large Multimodal Models." 2025.
  25. Bergmann et al. "The MVTec AD 2 Dataset: Advanced Scenarios for Unsupervised Anomaly Detection." arXiv:2503.21622, 2025.
  26. Li et al. "EMIT: Enhancing MLLMs for Industrial Anomaly Detection via Difficulty-Aware GRPO." arXiv:2507.21619, 2025.