Chen Change (Cavan) Loy is a President's Chair Professor at the College of Computing and Data Science, Nanyang Technological University (NTU), Singapore. He is the Director of MMLab@NTU and Co-Associate Director of S-Lab. He received his Ph.D. in Computer Science from Queen Mary University of London in 2010. Prior to joining NTU, he served as a Research Assistant Professor at the Multimedia Laboratory of The Chinese University of Hong Kong.
His research focuses on large multimodal models, generative AI, spatial intelligence and representation learning. Prof. Loy currently serves or has served as an Associate Editor for leading journals including the International Journal of Computer Vision (IJCV), IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), and Computer Vision and Image Understanding (CVIU). He has also served as an Area Chair or Senior Area Chair for major conferences such as CVPR, ICCV, ECCV, ICLR, and NeurIPS. He serves as Program Co-Chair of CVPR 2026 and General Co-Chair of ACCV 2028.
MMLab@NTU
MMLab@NTU was formed on the 1 August 2018, with a research focus on computer vision and deep learning. Its sister lab is MMLab@CUHK. It is now a group with four faculty members and more than 40 members including research fellows, research assistants, and PhD students. Members in MMLab@NTU conduct research primarily in large multimodal models, generative AI, and embodied AI.
Visit MMLab@NTU HomepageRecent
Papers
Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
K. Liao, S. Wu, Z. Wu, L. Jin, C. Wang, Y. Wang, F. Wang, W. Li, C. C. Loy
International Conference on Learning Representations, 2026 (ICLR)
[PDF]
[arXiv]
[Project Page]
[Demo]
Puffin is a camera-centric multimodal model that unifies camera understanding and controllable image generation. By treating camera parameters as language tokens, it aligns geometric reasoning with vision–language models, enabling spatially consistent cross-view generation, camera reasoning, and scene exploration. The model is trained on Puffin-4M, a large dataset of vision–language–camera triplets.
Next Visual Granularity Generation
Y. Wang, Z. Wang, Z. Wu, Q. Tao, K. Liao, C. C. Loy
International Conference on Learning Representations, 2026 (ICLR)
[arXiv]
[Project Page]
NVG (Next Visual Granularity Generation) proposes a structured approach to image generation by representing images as sequences of visual granularities. Starting from an empty canvas, the model progressively refines images from coarse layouts to fine details through hierarchical token representations, enabling controllable, structured generation and strong performance on large-scale image generation benchmarks.
STream3R: Scalable Sequential 3D Reconstruction with Causal Transformer
Y. Lan, Y. Luo, F. Hong, S. Zhou, H. Chen, Z. Lyu, S. Yang, B. Dai, C. C. Loy, X. Pan
International Conference on Learning Representations, 2026 (ICLR)
[arXiv]
[Project Page]
STream3R reformulates multi-view 3D reconstruction as a streaming Transformer problem. Using causal attention and feature caching across frames, it incrementally reconstructs dense scene geometry from image streams, enabling scalable and efficient online 3D perception. The model learns geometric priors from large-scale 3D data and achieves strong performance on both static and dynamic scenes.
4RC: 4D Reconstruction via Conditional Querying Anytime and Anywhere
Y. Luo, S. Zhou, Y. Lan, X. Pan, C. C. Loy
Technical report, arXiv:2602.10094, 2026
[arXiv]
[Project Page]
4RC is a unified feed-forward framework for 4D reconstruction from monocular videos. It learns a compact spatio-temporal latent representation that jointly models scene geometry and motion, enabling an encode-once, query-anytime paradigm to recover dense 3D structure and motion between arbitrary frames and timestamps efficiently.
VLANeXt: Recipes for Building Strong VLA Models
X. M. Wu, B. Fan, K. Liao, J. J. Jiang, R. Yang, Y. Luo, Z. Wu, W. S. Zheng, C. C. Loy
Technical report, arXiv:2602.10094, 2026
[arXiv]
[Project Page]
VLANeXt systematically studies the design space of vision–language–action (VLA) models and distills practical recipes for building strong robotic policies. By analyzing key components across perception, foundation backbones, and action modeling, it proposes a unified framework that improves task success and generalization on robotic manipulation benchmarks.