STPro : Spatial and Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Video Grounding


CVPR 2025

1 BITS Pilani
2 University of Central Florida
SSL

Temporal Referral Grounding (TRG) Failures: Removing the last action from a multi-action caption causes the model’s temporal prediction to shift incorrectly. It mispredicts 65% of the time when trimming the first action and 54% when trimming the last, indicating poor action understanding. As scenes grow more complex with multiple actors, the model struggles to ground both the correct actor type (e.g., woman) and referral attributes (e.g., brown hair).

SSL

STPro Overview: STPro applies parallel spatial and temporal curricula to address TRG’s shortcomings. Sub-Action Temporal Curriculum Learning (SA-TCL) incrementally builds action and composition understanding by breaking down complex captions. Congestion-Guided Spatial Curriculum (CG-SCL), comprising Soft-Label Filtering (SLF) and Congestion-Guided Sampling (CGS), helps the model progressively learn actor-type and referral-attribute grounding by transitioning from similar to more distinct backgrounds, enhancing discriminative feature learning.

Abstract

In this work, we study Weakly Supervised Spatio-Temporal Video Grounding (WSTVG), a challenging task of localizing subjects spatio-temporally in videos using only textual queries and no bounding box supervision. Inspired by recent advances in vision-language foundation models, we investigate their utility for WSTVG, leveraging their zero-shot grounding capabilities. However, we find that a simple adaptation lacks essential spatio-temporal grounding abilities. To bridge this gap, we introduce Tubelet Referral Grounding (TRG), which connects textual queries to tubelets to enable spatio-temporal predictions. Despite its promise, TRG struggles with compositional action understanding and dense scene scenarios. To address these limitations, we propose STPro, a novel progressive learning framework with two key modules: (1) Sub-Action Temporal Curriculum Learning (SA-TCL), which incrementally builds compositional action understanding, and (2) Congestion-Guided Spatial Curriculum Learning (CG-SCL), which adapts the model to complex scenes by spatially increasing task difficulty. STPro achieves state-of-the-art results on three benchmark datasets, with improvements of 1.0% on VidSTG-Declarative and 3.0% on HCSTVG-v1.

Video (coming soon)

Results

SSL
SSL

Qualitative Analysis

SSL

Spatial Grounding Improvement: TRG misidentifies the actor, while TRG + STPro refines localization, despite similar temporal boundary prediction. Red = ground truth, blue = TRG, violet = TRG + STPro.


SSL

Temporal Grounding Improvement: TRG mispredicts temporal boundaries, causing spatial errors. TRG + STPro refines temporal boundary, achieving near-perfect localization.