MMLab@NTU

Multimedia Laboratory @
Nanyang Technological University
Affiliated with S-Lab

About

MMLab@NTU

MMLab@NTU was formed on the 1 August 2018, with a research focus on computer vision and deep learning. Its sister lab is MMLab@CUHK. It is now a group with three faculty members and more than 40 members including research fellows, research assistants, and PhD students.

Members in MMLab@NTU conduct research primarily in low-level vision, image and video understanding, creative content creation, 3D scene understanding and reconstruction. Have a look at the overview of our research. All publications are listed here.

We are always looking for motivated PhD students, postdocs, research assistants who have the same interests like us. Check out the careers page and follow us on Twitter.

New Challenges

07/2022: We are hosting PointCloud-C Challenge (robustness of 3D models, deadline: Sep 19, 2022) and OmniBenchmark Challenge (generalization of 2D models, deadline: Oct 9, 2022).

View more

ECCV 2022

07/2022: The team has a total of 18 papers (including 3 oral papers) accepted to ECCV 2022.

View more

CVPR 2022

03/2022: The team has a total of 18 papers (including 6 oral papers) accepted to CVPR 2022.

View more

AISG FELLOWSHIP AWARD

12/2021: Haonan Qiu, Bo Li, Yuhan Wang, Siyao Li, Quanzhou Li, Jianyi Wang are awarded the competitive AISG Fellowship 2022 to pursue their PhD study. Congrats!

View more

Check Out

News and Highlights

  • 12/2021: We release MMHuman3D, a new toolbox under OpenMMLab, for the use of 3D human parametric models in computer vision and computer graphics.
  • 09/2021: Kelvin Chan and Fangzhou Hong are awarded the very competitive and prestigious Google PhD Fellowship 2021 under the area “Machine Perception, Speech Technology and Computer Vision”.
  • 09/2021: The team has a total of 8 papers accepted to NeurIPS 2021.
  • 09/2021: Six outstanding ICCV 2021 reviewers from our team! Congrats to Chongyi Li, Kelvin Chan, Jingkang Yang, Liang Pan, Zhongang Cai, and Kaiyang Zhou.
  • 07/2021: We organize two challenges in conjunction with ICCV 2021 Sensing, Understanding and Synthesizing Humans Workshop, namely, MVP Point Cloud Challenge and Face Forgery Analysis Challenge. The deadline has passed. Check out the workshop for more details.
  • 07/2021: The team has a total of 11 papers accepted to ICCV 2021 (including one oral).

View more

ECCV 2022

MIPI Workshop

We organize a new workshop called Mobile Intelligent Photography and Imaging (MIPI) in conjunction with ECCV 2022. We invited a cool lineup of speakers from both academia and industry to share their recent work.

Recent

Projects

Extract Free Dense Labels from CLIP
C. Zhou, C. C. Loy, B. Dai
European Conference on Computer Vision, 2022 (ECCV, Oral)
[arXiv] [Project Page]

We examine the intrinsic potential of CLIP for pixel-level dense prediction, specifically in semantic segmentation. With minimal modification, we show that MaskCLIP yields compelling segmentation results on open concepts across various datasets in the absence of annotations and fine-tuning. By adding pseudo labeling and self-training, MaskCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins.

Open-Vocabulary DETR with Conditional Matching
Y. Zang, W. Li, K. Zhou, C. Huang, C. C. Loy
European Conference on Computer Vision, 2022 (ECCV, Oral)
[arXiv] [Project Page]

We propose a novel open-vocabulary detector based on DETR, which once trained, can detect any object given its class name or an exemplar image. This first end-to-end Transformer-based open-vocabulary detector achieves non-trivial improvements over current state of the arts.

HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling
Z. Cai, D. Ren, A. Zeng, Z. Lin, T. Yu, W. Wang, X. Fan, Y. Gao, Y. Yu, L. Pan, F. Hong, M. Zhang, C. C. Loy, L. Yang, Z. Liu
European Conference on Computer Vision, 2022 (ECCV, Oral)
[arXiv] [Project Page] [YouTube]

We contribute HuMMan, a large-scale multi-modal 4D human dataset with 1000 human subjects, 400k sequences and 60M frames. HuMMan has several appealing properties: 1) multi-modal data and annotations including color images, point clouds, keypoints, SMPL parameters, and textured meshes; 2) popular mobile device is included in the sensor suite; 3) a set of 500 actions, designed to cover fundamental movements; 4) multiple tasks such as action recognition, pose estimation, parametric human recovery, and textured mesh reconstruction are supported and evaluated.

AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars
F. Hong, M. Zhang, L. Pan, Z. Cai, L. Yang, Z. Liu
ACM Transactions on Graphics, 2022 (SIGGRAPH - TOG)
[arXiv] [Project Page]

We propose AvatarCLIP, a zero-shot text-driven framework for 3D avatar generation and animation. Unlike professional software that requires expert knowledge, AvatarCLIP empowers layman users to customize a 3D avatar with the desired shape and texture, and drive the avatar with the described motions using solely natural languages.

Text2Human: Text-Driven Controllable Human Image Generation
Y. Jiang, S. Yang, H. Qiu, W. Wu, C. C. Loy, Z. Liu
ACM Transactions on Graphics, 2022 (SIGGRAPH - TOG)
[arXiv] [Project Page] [YouTube]

We present a text-driven controllable framework, Text2Human, for high-quality and diverse human generation. We synthesize full-body human images starting from a given human pose with two dedicated steps: 1) With some texts describing the shapes of clothes, the given human pose is first translated to a human parsing map. 2) The final human image is then generated by providing the system with more attributes about the textures of clothes.

Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory
S-Y. Li, W. Yu, T. Gu, C. Lin, Q. Wang, C. Qian, C. C. Loy, Z. Liu
in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2022 (CVPR, Oral)
[PDF] [arXiv] [Supplementary Material] [Project Page] [YouTube]

We address the spatial and temporal challenges of 3D dance generation by proposing a novel framework named Bailando, which is composed of a choreographic memory to address the spatial constraint by encoding and quantizing dancing-style poses, and an actor-critic GPT to realize the temporal coherency with music that translates and aligns various motion tempos and music beats.

Video K-Net: A Simple, Strong, and Unified Baseline For End-to-End Dense Video Segmentation
X. Li, W. Zhang, J. Pang, K. Chen, G. Cheng, Y. Tong, C. C. Loy
in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2022 (CVPR, Oral)
[PDF] [arXiv] [Supplementary Material] [Project Page]

Video K-Net is a simple, strong and unified framework for fully end-to-end video panoptic segmentation. It achieves state-of-the-art resuls on popular benchmarks including Cityscapes-VPS and KITTI-STEP.

Explore

MMLab@NTU