S2M2: Scalable Stereo Matching Model for Reliable Depth Estimation

Junhong Min¹, Youngpil Jeon¹, Jimin Kim¹, Minyong Choi¹
¹Samsung Electronics
International Conference on Computer Vision (ICCV) 2025

S2M2 Teaser Image Figure 1: Qualitative comparison of 3D point clouds. Compared to SOTA models (Selective-IGEV, FoundationStereo), our model shows more reliable reconstructions in fine structures like bicycle spokes.

Resources

Paper Supplement Poster Code

Abstract

The pursuit of a generalizable stereo matching model, capable of performing well across varying resolutions and disparity ranges without dataset-specific fine-tuning, has revealed a fundamental trade-off. Iterative local search methods achieve high scores on constrained benchmarks, but their core mechanism inherently limits the global consistency required for true generalization. However, global matching architectures, while theoretically more robust, have historically been rendered infeasible by prohibitive computational and memory costs. We resolve this dilemma with S2M2: a global matching architecture that achieves state-of-the-art accuracy and high efficiency without relying on cost volume filtering or deep refinement stacks. Our design integrates a multi-resolution transformer for robust long-range correspondence, trained with a novel loss function that concentrates probability on feasible matches. This approach enables a more robust joint estimation of disparity, occlusion, and confidence. S2M2 establishes a new state of the art on Middlebury v3 and ETH3D benchmarks, significantly outperforming prior methods in most metrics while reconstructing high-quality details with competitive efficiency.


Motivation & Objective

Prior stereo matching models struggled to generalize across diverse input conditions. Attempts to scale models often led to inefficiencies, revealing a need for a more adaptable solution. We aim to develop a unified architecture that achieves:


Highlights

Our key contributions include:


Method

Our proposed model, S2M2, is designed to revitalize the global matching paradigm by addressing its long-standing scalability challenges. To achieve this, our architecture is composed of four main stages, as illustrated in the figure below: (1) Feature Extraction, (2) Global Matching, (3) Refinement, and (4) Upsampling.

S2M2 Architecture Overview Figure 2: Overview of the S2M2 architecture. It consists of a hierarchical feature extraction stage with a Multi-Resolution Transformer (MRT) and an Adaptive Gated Fusion Layer (AGFL), a global matching stage using Optimal Transport, and iterative refinement and upsampling stages.
Key Components:
PMC Loss Illustration Illustration of our Probabilistic Mode Concentration (PMC) Loss.

Results

3D Visualization
Middlebury Benchmark Comparison
Bicycle
FoundationStereo
S2M2 (Ours)
Staircase
FoundationStereo
S2M2 (Ours)
Performance on Transparent/Reflective Objects (Booster Dataset)
Barrel
Bottles
Lid
Benchmark Performance

S2M2 establishes a new state-of-the-art on diverse and challenging benchmarks.

As of July 2025, it ranks first on both the ETH3D and Middlebury v3 leaderboards.
In October 2025, it also achieved the top rank on the Booster benchmark, further demonstrating its robustness across real-world scenarios.

ETH3D Leaderboard ETH3D low-res two-view benchmark (July 2025).
Middlebury V3 Leaderboard Middlebury v3 benchmark (July 2025).
Booster Leaderboard Booster benchmark (Oct 2025).
Real Benchmark Results Comprehensive evaluation on ETH3D (Bad-0.5) and Middlebury v3 (Bad-2.0). Lower is better. Circle size indicates model parameters.
Scalability Analysis

Our S2M2 family forms a compelling Pareto front, offering significantly better performance at every level of computational budget and validating the scalability of our architecture.

Scalability Analysis Accuracy vs. Efficiency (Synthetic Benchmark). The S2M2 family (red) achieves higher or comparable accuracy with significantly less computation than larger models like FoundationStereo (cyan).
Our High-Resolution Synthetic Dataset

To rigorously test our model, we created a new high-resolution synthetic dataset using Blender. This dataset includes challenging scenarios like complex objects, reflective surfaces, and large disparity ranges, which are often not covered by existing benchmarks.

Synthetic Dataset Overview Overview of our high-resolution synthetic data generation using Blender.

Critical Re-evaluation of the KITTI Benchmark

We argue that the KITTI benchmark’s leaderboard scores are an unreliable indicator of true generalization due to the inherent noise and systematic biases in its LiDAR-based ground truth.

Our analysis shows a contradiction: while fine-tuning on KITTI improves error metrics like EPE, it simultaneously degrades photometric consistency (measured by SSIM), suggesting overfitting to dataset artifacts. The visualizations below show how fine-tuned models produce distorted structures that align with noisy GT labels rather than the actual image content.

KITTI Fine-tuning Comparison Figure 5: Negative effects of fine-tuning on KITTI. Zero-shot models (FoundationStereo, S2M2) reconstruct clean 3D structures, whereas fine-tuned models adapt to noise in the GT annotation, resulting in distorted geometry.