CVPR 2026

DepthFocus: Controllable Depth Estimation
for See-Through Scenes

Junhong Min^1* Jimin Kim¹ Minwook Kim¹ Cheol-Hui Min¹ Youngpil Jeon¹ Minyong Choi¹

¹Samsung Electronics
^*Corresponding Author

Overview of DepthFocus. DepthFocus resolves depth ambiguities in see-through scenes by selectively estimating surfaces aligned with a user-intended focus distance. Unlike passive systems, our steerable architecture modulates internal features as an adaptive opacity filter to resolve layered occlusions. As the intended focus increases, the model actively "peels away" foreground geometry to reconstruct hidden layers, such as transparent partitions and backgrounds. Feature visualizations confirm that the network dynamically reconfigures its computation to perceive only the geometry relevant to the user's intent.

Abstract

Depth in the real world is rarely singular. Transmissive materials create layered ambiguities that confound conventional perception systems. Existing models remain passive; conventional approaches typically estimate static depth maps anchored to the nearest surface, and even recent multi-head extensions suffer from a representational bottleneck due to fixed feature representations. This stands in contrast to human vision, which actively shifts focus to perceive a desired depth. We introduce DepthFocus, a steerable Vision Transformer that redefines stereo depth estimation as condition-aware control. Instead of extracting fixed features, our model dynamically modulates its computation based on a physical reference depth, integrating dual conditional mechanisms to selectively perceive geometry aligned with the desired focus. Leveraging a newly curated large-scale synthetic dataset, DepthFocus achieves state-of-the-art results across all evaluated benchmarks, including both standard single-layer and complex multi-layered scenarios. While maintaining high precision in opaque regions, our approach effectively resolves depth ambiguities in transparent and reflective scenes by selectively reconstructing geometry at a target distance. This capability enables robust, intent-driven perception that significantly outperforms existing multi-layer methods, marking a substantial step toward active 3D perception.

Method

Conditional Depth Estimation Framework

We formulate a general framework for steerable depth estimation where a scalar conditioning variable c ∈ [0, 1] acts as a proxy for physical distance. The ideal conditional estimator is governed by three core properties:

1. Opaque Determinism

In non-transmissive opaque regions, the estimated depth remains strictly invariant to changes in the control variable.

2. Monotonicity

In transmissive regions, the directional ordering of layers is preserved. An increase in focus never reverts the estimate to a shallower layer.

3. Reference Proximity

The model discretely selects a valid surface layer that is optimally positioned relative to the physical reference depth plane.

Grounding Control as Physical Depth

To ensure c acts as a direct proxy for distance, we first convert a physical reference depth Z_ref into disparity d_ref using the stereo camera calibration parameters (focal length f and baseline B):

d_ref = (f × B) / Z_ref
This reference disparity is then mapped to the normalized control variable c ∈ [0, 1] based on the maximum valid disparity d_max of the scene:

d_ref(c) = (1 - c) × d_max
By grounding the supervision in the disparity domain while maintaining a depth-based control intent, the network effectively scans through overlapping surfaces in a step-wise manner according to the user's focus.

Reference-Driven Ground Truth Assignment

To realize the active "peeling" effect of foreground geometry, the model assigns the ground truth by selecting a valid surface based on a directional proximity criterion. When multiple transmissive layers overlap, the network estimates the actual surface disparity d^* that is closest to the reference disparity d_ref while being strictly positioned physically behind it.

d^* = argmin_{d ∈ S, d ≤ d_ref} (d_ref - d)

Here, S represents the set of all valid surface disparities in the scene. By enforcing the condition d ≤ d_ref (which translates to physical depth Z ≥ Z_ref), the assignment guarantees monotonic depth transitions as the focus distance increases without regressing to shallower layers.

Architecture & Feature Modulation

We realize the steerable framework through two complementary modulation modules integrated into our Vision Transformer. Instead of re-executing a heavy backbone, we append a conditional fusion stage where pre-computed multi-resolution features are progressively aggregated.

The Conditional Mixture-of-Experts (C-MoE) employs a router to produce continuous weights, enabling dynamic feature transformation paths based on the scalar control variable. Concurrently, Direct Condition Injection (DCI) provides explicit guidance via an attention-based mechanism. These components work in parallel to selectively extract features relevant only to the target depth layer.

Dataset

Large-scale Synthetic Dataset

We curated a large-scale synthetic dataset specifically designed for condition-aware depth estimation in see-through scenes. It features diverse transmissive materials such as glass, plastic, and transmissive meshes. Furthermore, the dataset provides comprehensive ground truth including dense disparities for multiple surfaces and precise semantic segmentation labels to train the network's layer-selective capabilities.

Real-world Benchmark Dataset

To evaluate real-world generalization, we captured a high-quality real stereo dataset in a controlled laboratory environment. This benchmark features complex bi-layer configurations using acrylic plates with two distinct levels of transmissivity (60% and 80%). Crucially, it provides high-precision dense depth annotations for both layers, serving as a rigorous benchmark to assess the model's robustness and steerability against real physical noise and light interactions.

Comprehensive Benchmark Evaluation

Performance Comparison

We evaluate DepthFocus using a standard single-layer benchmark protocol, measuring accuracy strictly on the first visible surface. While our model achieves leading performance on standard opaque datasets like Middlebury, its architectural advantages are most evident in see-through scenes.

Crucially, the performance margin over existing top-tier models becomes remarkably large in complex multi-layered environments. On datasets featuring severe depth ambiguities, namely Booster, LayeredFlow, and our custom dataset (Ours), DepthFocus demonstrates a substantial lead. This confirms the robustness of our condition-aware control in resolving transmissive layered geometry.

Steerable & Continuous Depth Transitions

This section visualizes the continuous depth transitions and feature modulations generated by our steerable architecture in see-through scenes. The results demonstrate how the network actively resolves layered ambiguities according to the intended focus.

Swipe, drag, or click the dots below to explore the transitions across 10 different scenes.

Scene 01

Demonstrating continuous and steerable depth transitions on our custom dataset.

Scene 02

Demonstrating continuous and steerable depth transitions on our custom dataset.

Scene 03

Demonstrating continuous and steerable depth transitions on the Booster dataset.

Scene 04

Demonstrating continuous and steerable depth transitions on the Booster dataset.

Scene 05

Demonstrating continuous and steerable depth transitions on our custom dataset.

Scene 06

Demonstrating continuous and steerable depth transitions on our custom dataset.

Scene 07

Demonstrating continuous and steerable depth transitions on our custom dataset.

Scene 08

Demonstrating continuous and steerable depth transitions on our custom dataset.

Scene 09

Demonstrating continuous and steerable depth transitions on the LayeredFlow dataset.

Scene 10

Demonstrating continuous and steerable depth transitions on the LayeredFlow dataset.

Citation

@inproceedings{min2026depthfocus,
  title={DepthFocus: Controllable Depth Estimation for See-Through Scenes},
  author={Min, Junhong and Kim, Jimin and Kim, Minwook and Min, Cheol-Hui and Jeon, Youngpil and Choi, Minyong},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}