cs.CV updates on arXiv.org热榜 - Hot点·热榜

1 UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation ↗

2 ReLIC-SGG: Relation Lattice Completion for Open-Vocabulary Scene Graph Generation ↗

3 Distill, Diffuse, and Semanticize (DDS): Annotation-Free 3D Scene Understanding Based on Multi-Granularity Distillation and Graph-Diffusion-Based Segmentation ↗

4 STORM: Segment, Track, and Object Re-Localization from a Single Image ↗

5 GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking ↗

6 LoREnc: Low-Rank Encryption for Securing Foundation Models and LoRA Adapters ↗

7 Compact 3D Gaussian Splatting For Dense Visual SLAM ↗

8 M3Net: A Macro-to-Meso-to-Micro Clinical-inspired Hierarchical 3D Network for Pulmonary Nodule Classification ↗

9 VideoSEAL: Mitigating Evidence Misalignment in Agentic Long Video Understanding by Decoupling Answer Authority ↗

10 M2Retinexformer: Multi-Modal Retinexformer for Low-Light Image Enhancement ↗

11 Pyramid Self-contrastive Learning Framework for Test-time Ultrasound Image Denoising ↗

12 SSDA: Bridging Spectral and Structural Gaps via Dual Adaptation for Vision-Based Time Series Forecasting ↗

13 What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs ↗

14 CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference ↗

15 GUIGuard-Bench: Toward a General Evaluation for Privacy-Preserving GUI Agents ↗

16 Improving Diffusion Posterior Samplers with Lagged Temporal Corrections for Image Restoration ↗

17 Gradient-Free Noise Optimization for Reward Alignment in Generative Models ↗

18 DistractMIA: Black-Box Membership Inference on Vision-Language Models via Semantic Distraction ↗

19 Robust and Explainable Bicuspid Aortic Valve Diagnosis Using Stacked Ensembles on Echocardiography ↗

20 3D Primitives are a Spatial Language for VLMs ↗

21 Reasoning to Edit: Hypothetical Instruction-Based Image Editing with Visual Reasoning ↗

22 TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking ↗

23 SymbolSight: Minimizing Inter-Symbol Interference for Reading with Prosthetic Vision ↗

24 A Data Efficiency Study of Synthetic Fog for Object Detection Using the Clear2Fog Pipeline ↗

25 DeCo-DETR: Decoupled Cognition DETR for efficient Open-Vocabulary Object Detection ↗

26 MambaPanoptic: A Vision Mamba-based Structured State Space Framework for Panoptic Segmentation ↗

27 Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting ↗

28 DIVER:Diving Deeper into Distilled Data via Expressive Semantic Recovery ↗

29 CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving ↗

30 CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis ↗

31 Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents ↗

32 No One Knows the State of the Art in Geospatial Foundation Models ↗

33 Data Agent: Learning to Select Data via End-to-End Dynamic Optimization ↗

34 Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty? ↗

35 Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models ↗

36 MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence ↗

37 Taming the Long Tail: Rebalancing Adversarial Training via Adaptive Perturbation ↗

38 Inline Critic Steers Image Editing ↗

39 Realtime-VLA FLASH: Speculative Inference Framework for Diffusion-based VLAs ↗

40 Is Video Anomaly Detection Misframed? Evidence from LLM-Based and Multi-Scene Models ↗

41 LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models ↗

42 Just Ask for a Table: A Thirty-Token User Prompt Defeats Sponsored Recommendations in Twelve LLMs ↗

43 UNIV: Unified Foundation Model for Infrared and Visible Modalities ↗

44 WildPose: A Unified Framework for Robust Pose Estimation in the Wild ↗

45 When Diffusion Breaks Constraints: Sequential Autoregressive Generation with RL and MCTS ↗

46 FRAME: Forensic Routing and Adaptive Multi-path Evidence Fusion for Image Manipulation Detection ↗

47 Aligning Forest and Trees in Images & Long Captions for Visually Grounded Understanding ↗

48 AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects ↗

49 Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance ↗

50 PRISM: Perinuclear Ring-based Image Segmentation Method for Acute Lymphoblastic Leukemia Classification ↗

51 NTIRE 2026 The Second Challenge on Day and Night Raindrop Removal for Dual-Focused Images: Methods and Results ↗

52 Prediction of Rectal Cancer Regrowth from Longitudinal Endoscopy ↗

53 COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts ↗

54 Adaptive Conformal Prediction for Reliable and Explainable Medical Image Classification ↗

55 PicoEyes: Unified Gaze Estimation Framework for Mixed Reality with a Large-Scale Multi-View Dataset ↗

56 GuardMarkGS: Unified Ownership Tracing and Edit Deterrence for 3D Gaussian Splatting ↗

57 Evidence-based Decision Modeling for Synthetic Face Detection with Uncertainty-driven Active Learning ↗

58 Anatomy-Slot: Unsupervised Anatomical Factorization for Homologous Bilateral Reasoning in Retinal Diagnosis ↗

59 A Mimetic Detector for Adversarial Image Perturbations ↗

60 AuraMask: An Extensible Pipeline for Developing Aesthetic Anti-Facial Recognition Image Filters ↗

61 VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference ↗

62 CRePE: Curved Ray Expectation Positional Encoding for Unified-Camera-Controlled Video Generation ↗

63 Energy Scaling Laws for Diffusion Models: Quantifying Compute in Image Generation ↗

64 DirectTryOn: One-Step Virtual Try-On via Straightened Conditional Transport ↗

65 Test-Time Training with KV Binding Is Secretly Linear Attention ↗

66 Debunking Grad-ECLIP: A Comprehensive Study on Its Incorrectness and Fundamental Principles for Model Interpretation ↗

67 (Sparse) Attention to the Details: Preserving Spectral Fidelity in ML-based Weather Forecasting Models ↗

68 Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation ↗

69 ThermalTap: Passive Application Fingerprinting in VR Headsets via Thermal Side Channels ↗

70 AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding ↗

71 On Hallucinations in Inverse Problems: Fundamental Limits and Provable Assessment Methods ↗

72 GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion ↗

73 What Limits Vision-and-Language Navigation ? ↗

74 Reducing Bias and Variance: Generative Semantic Guidance and Bi-Layer Ensemble for Image Clustering ↗

75 DeepFilters: Scattering-Aware Pupil Engineering with Learned Digital Filter Reconstruction for Extended Depth of Field Microscopy ↗

76 Asymmetric Flow Models ↗

77 Min Generalized Sliced Gromov Wasserstein: A Scalable Path to Gromov Wasserstein ↗

78 ImageAttributionBench: How Far Are We from Generalizable Attribution? ↗

79 History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions ↗

80 Amortized Guidance for Image Inpainting with Pretrained Diffusion Models ↗

81 PVLM: Parsing-Aware Vision Language Model with Dynamic Contrastive Learning for Zero-Shot Deepfake Attribution ↗

82 OCH3R: Object-Centric Holistic 3D Reconstruction ↗

83 NFR: Neural Feature-Guided Non-Rigid Shape Registration ↗

84 PRISM: Prior Rectification and Uncertainty-Aware Structure Modeling for Diffusion-Based Text Image Super-Resolution ↗

85 Scalable Object Detection in the Car Interior With Vision Foundation Models ↗

86 ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence ↗

87 Unmasking Puppeteers: Leveraging Biometric Leakage to Expose Impersonation in AI-Based Videoconferencing ↗

88 CoGE: Sim-to-Real Online Geometric Estimation for Monocular Colonoscopy ↗

89 The Joint Gromov Wasserstein Objective for Multiple Object Matching ↗

90 EgoForce: Robust Online Egocentric Motion Reconstruction via Diffusion Forcing ↗

91 Make-It-Poseable: Feed-forward Latent Posing Model for 3D Characters ↗

92 Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency ↗

93 Inference-Time Dynamic Modality Selection for Incomplete Multimodal Classification ↗

94 Uncertainty-aware Spatial-Frequency Registration and Fusion for Infrared and Visible Images ↗

95 Perception with Guarantees: Certified Pose Estimation via Reachability Analysis ↗

96 BrainAnytime: Anatomy-Aware Cross-Modal Pretraining for Brain Image Analysis with Arbitrary Modality Availability ↗

97 Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge ↗

98 Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling ↗

99 MedOpenClaw and MedFlowBench: Auditing Medical Agents in Full-Study Workflows ↗

100 HarmoGS: Robust 3D Gaussian Splatting in the Wild via Conflict-Aware Gradient Harmonization ↗