The rapid evolution toward next-generation embodied AI has driven remarkable convergence across computer vision, graphics, and robotics, with growing emphasis on robust 3D understanding for in-the-wild and real-world applications. Recent breakthroughs have introduced powerful 3D foundational models that demonstrate unprecedented zero-shot generalization across diverse 3D reconstruction tasks. Meanwhile, advances in dynamic reconstruction representations and vision-language-action (VLA) models have enabled practical real-to-sim-to-real transfer, language-grounded robotic manipulation, and navigation in real-world settings.
This workshop will focus on exploring how 3D foundational models enable robust, generalizable scene understanding for embodied systems, examining the interplay between geometric reconstruction, semantic grounding, and physical interaction to advance the next generation of vision, graphics, and robotics research.TBD.
![]() |
![]() |
![]() |
| Andreas Geiger (University of Tübingen) |
Derek Hoiem (UIUC) |
Song-Chun Zhu (BIGAI, PKU, THU) |