← 返回
cs.CV updates on arXiv.org

cs.CV updates on arXiv.org

AI
更新于 2026-05-15 01:27 共 100 条
  1. 1 UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
  2. 2 ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation
  3. 3 Distill, Diffuse, and Semanticize (DDS): Annotation-Free 3D Scene Understanding Based on Multi-Granularity Distillation and Graph-Diffusion-Based Segmentation
  4. 4 STORM: Segment, Track, and Object Re-Localization from a Single Image
  5. 5 GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking
  6. 6 LoREnc: Low-Rank Encryption for Securing Foundation Models and LoRA Adapters
  7. 7 Compact 3D Gaussian Splatting For Dense Visual SLAM
  8. 8 M3Net: A Macro-to-Meso-to-Micro Clinical-inspired Hierarchical 3D Network for Pulmonary Nodule Classification
  9. 9 VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority
  10. 10 M2Retinexformer: Multi-Modal Retinexformer for Low-Light Image Enhancement
  11. 11 Pyramid Self-contrastive Learning Framework for Test-time Ultrasound Image Denoising
  12. 12 SSDA: Bridging Spectral and Structural Gaps via Dual Adaptation for Vision-Based Time Series Forecasting
  13. 13 What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs
  14. 14 CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference
  15. 15 GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents
  16. 16 Improving Diffusion Posterior Samplers with Lagged Temporal Corrections for Image Restoration
  17. 17 Gradient-Free Noise Optimization for Reward Alignment in Generative Models
  18. 18 DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction
  19. 19 Robust and Explainable Bicuspid Aortic Valve Diagnosis Using Stacked Ensembles on Echocardiography
  20. 20 3D Primitives are a Spatial Language for VLMs
  21. 21 Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning
  22. 22 TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
  23. 23 SymbolSight: Minimizing Inter-Symbol Interference for Reading with Prosthetic Vision
  24. 24 A Data Efficiency Study of Synthetic Fog for Object Detection Using the Clear2Fog Pipeline
  25. 25 DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection
  26. 26 MambaPanoptic: A Vision Mamba-based Structured State Space Framework for Panoptic Segmentation
  27. 27 Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting
  28. 28 DIVER:Diving Deeper into Distilled Data via Expressive Semantic Recovery
  29. 29 CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
  30. 30 CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis
  31. 31 Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents
  32. 32 No One Knows the State of the Art in Geospatial Foundation Models
  33. 33 Data Agent: Learning to Select Data via End-to-End Dynamic Optimization
  34. 34 Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?
  35. 35 Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models
  36. 36 MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence
  37. 37 Taming the Long Tail: Rebalancing Adversarial Training via Adaptive Perturbation
  38. 38 Inline Critic Steers Image Editing
  39. 39 Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs
  40. 40 Is Video Anomaly Detection Misframed? Evidence from LLM-Based and Multi-Scene Models
  41. 41 LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models
  42. 42 Just Ask for a Table: A Thirty-Token User Prompt Defeats Sponsored Recommendations in Twelve LLMs
  43. 43 UNIV: Unified Foundation Model for Infrared and Visible Modalities
  44. 44 WildPose: A Unified Framework for Robust Pose Estimation in the Wild
  45. 45 When Diffusion Breaks Constraints: Sequential Autoregressive Generation with RL and MCTS
  46. 46 FRAME: Forensic Routing and Adaptive Multi-path Evidence Fusion for Image Manipulation Detection
  47. 47 Aligning Forest and Trees in Images & Long Captions for Visually Grounded Understanding
  48. 48 AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects
  49. 49 Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance
  50. 50 PRISM: Perinuclear Ring-based Image Segmentation Method for Acute Lymphoblastic Leukemia Classification
  51. 51 NTIRE 2026 The Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results
  52. 52 Prediction of Rectal Cancer Regrowth from Longitudinal Endoscopy
  53. 53 COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
  54. 54 Adaptive Conformal Prediction for Reliable and Explainable Medical Image Classification
  55. 55 PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset
  56. 56 GuardMarkGS: Unified Ownership Tracing and Edit Deterrence for 3D Gaussian Splatting
  57. 57 Evidence-based Decision Modeling for Synthetic Face Detection with Uncertainty-driven Active Learning
  58. 58 Anatomy-Slot: Unsupervised Anatomical Factorization for Homologous Bilateral Reasoning in Retinal Diagnosis
  59. 59 A Mimetic Detector for Adversarial Image Perturbations
  60. 60 AuraMask: An Extensible Pipeline for Developing Aesthetic Anti-Facial Recognition Image Filters
  61. 61 VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference
  62. 62 CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation
  63. 63 Energy Scaling Laws for Diffusion Models: Quantifying Compute in Image Generation
  64. 64 DirectTryOn: One-Step Virtual Try-On via Straightened Conditional Transport
  65. 65 Test-Time Training with KV Binding Is Secretly Linear Attention
  66. 66 Debunking Grad-ECLIP: A Comprehensive Study on Its Incorrectness and Fundamental Principles for Model Interpretation
  67. 67 (Sparse) Attention to the Details: Preserving Spectral Fidelity in ML-based Weather Forecasting Models
  68. 68 Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
  69. 69 ThermalTap: Passive Application Fingerprinting in VR Headsets via Thermal Side Channels
  70. 70 AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding
  71. 71 On Hallucinations in Inverse Problems: Fundamental Limits and Provable Assessment Methods
  72. 72 GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion
  73. 73 What Limits Vision-and-Language Navigation ?
  74. 74 Reducing Bias and Variance: Generative Semantic Guidance and Bi-Layer Ensemble for Image Clustering
  75. 75 DeepFilters: Scattering-Aware Pupil Engineering with Learned Digital Filter Reconstruction for Extended Depth of Field Microscopy
  76. 76 Asymmetric Flow Models
  77. 77 Min Generalized Sliced Gromov Wasserstein: A Scalable Path to Gromov Wasserstein
  78. 78 ImageAttributionBench: How Far Are We from Generalizable Attribution?
  79. 79 History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
  80. 80 Amortized Guidance for Image Inpainting with Pretrained Diffusion Models
  81. 81 PVLM: Parsing-Aware Vision Language Model with Dynamic Contrastive Learning for Zero-Shot Deepfake Attribution
  82. 82 OCH3R: Object-Centric Holistic 3D Reconstruction
  83. 83 NFR: Neural Feature-Guided Non-Rigid Shape Registration
  84. 84 PRISM: Prior Rectification and Uncertainty-Aware Structure Modeling for Diffusion-Based Text Image Super-Resolution
  85. 85 Scalable Object Detection in the Car Interior With Vision Foundation Models
  86. 86 ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence
  87. 87 Unmasking Puppeteers: Leveraging Biometric Leakage to Expose Impersonation in AI-Based Videoconferencing
  88. 88 CoGE: Sim-to-Real Online Geometric Estimation for Monocular Colonoscopy
  89. 89 The Joint Gromov Wasserstein Objective for Multiple Object Matching
  90. 90 EgoForce: Robust Online Egocentric Motion Reconstruction via Diffusion Forcing
  91. 91 Make-It-Poseable: Feed-forward Latent Posing Model for 3D Characters
  92. 92 Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency
  93. 93 Inference-Time Dynamic Modality Selection for Incomplete Multimodal Classification
  94. 94 Uncertainty-aware Spatial-Frequency Registration and Fusion for Infrared and Visible Images
  95. 95 Perception with Guarantees: Certified Pose Estimation via Reachability Analysis
  96. 96 BrainAnytime: Anatomy-Aware Cross-Modal Pretraining for Brain Image Analysis with Arbitrary Modality Availability
  97. 97 Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge
  98. 98 Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
  99. 99 MedOpenClaw and MedFlowBench: Auditing Medical Agents in Full-Study Workflows
  100. 100 HarmoGS: Robust 3D Gaussian Splatting in the Wild via Conflict-Aware Gradient Harmonization