Aaryan Garg

I'm a fourth year undergraduate student at BITS Pilani studying Computer Science. I have an interest in Deep Learning on the whole, but have recently spent most of my time playing with vision and multimodal models.

I've been a research intern at CRCV at UCF since November 2023 working with Dr. Yogesh S Rawat. At CRCV, my research has spanned adaptive classification models, efficient transformer based segmentation and most recently spatio-temporal video grounding.

I'm always looking for new things to work on, so if you have an interesting project or idea, please reach out!

Email  /  Scholar  /  Github  /  CV  /  Blog

profile photo

Updates

  Mar'25: STPro is accepted to PixFoundation & MAR workshops at CVPR
  Feb'25: STPro is accepted at CVPR 2025 💥

Talks

  April'25: Building effective tools for LLM based agents (DevRev Community Event) - Slides | Recording

Research

Some representative papers are highlighted.

STPro: Spatial & Temporal Progressive Learning for Weakly Supervised Spatio-Temporal Video Grounding
Computer Vision & Pattern Recognition (CVPR) - 2025 (Poster)
🎨 PixFoundation Workshop
🧠 Multimodal Algorithmic Reasoning Workshop
🏛️ Multimodal Foundation Models Workshop
👁️ Emergent Visual Abilities & Limits of Foundation Models Workshop
🌐 Domain Generalization Workshop
Aaryan Garg, Akash Kumar, Yogesh S Rawat

🛠️ Project page / arXiv / 🤗 HuggingFace

Improved VLMs grounding capabilities via action composition and complex spatio-temporal scene understanding.

Other