![]() |
![]() |
![]() |
![]() |
Deva Ramanan (CMU) | Angel X. Chang (SFU) | Carl Vondrick (Columbia) | Chenfanfu Jiang (UCLA) |
![]() |
![]() |
![]() |
|
Iro Armeni (Stanford) | Kiana Ehsani (Vercept) | Guanya Shi (CMU) |
TBD
The developments in AI technology have spurred calls for next-generation AI, such as Embodied AI and General AI, which will enable systems to physically interact with their environments under comprehensive tasks in a human-like manner. Towards this goal, researchers from diverse fields, e.g., computer vision, computer graphics, and robotics, have made separate efforts and made progress across various topics, including 3D representation (e.g., NeRF, Gaussian Splatting), foundation models (e.g., SAM(2), Stable (Video) Diffusion), datasets (e.g., Objaverse (XL), Open X-Embodiment), end-to-end vision-language-action (VLA) models (e.g., RT-X), etc.
However, new fundamental questions arise about how to sustain a substantially more comprehensive understanding of the environment, unite these efforts, and facilitate the future development of General and Embodied AI. For example, what is the role of traditional scene parsing/detection/localization in today’s development? How to leverage scene understanding techniques to improve the physical interaction? Could pure end-to-end models and scaling large-scale datasets work, or are intermediate representations, even symbolic ones more suitable for certain tasks?
This year’s focus will be exploring the fundamental aspects to enhance interaction between agents and 3D scenes in the new era of AI, promoting future directions and ideas envisioned to emerge within the next two to five years.
![]() |
![]() |
![]() |
![]() |
Yixin Chen (BIGAI) | Baoxiong Jia (BIGAI) | Yao Feng (Stanford) | Songyou Peng (DeepMind) |
![]() |
![]() |
![]() |
![]() |
Chuhang Zou (Amazon) | Sai Kumar Dwivedi (MPI) | Yixin Zhu (PKU) | Siyuan Huang (BIGAI) |
![]() |
![]() |
![]() |
Marc Pollefeys (ETH Zurich) | Derek Hoiem (UIUC) | Song-Chun Zhu (BIGAI, PKU, THU) |