MMLab@NTU was formed on the 1 August 2018, with a research focus on computer vision and deep learning. Its sister lab is MMLab@CUHK. It is now a group with three faculty members and more than 40 members including research fellows, research assistants, and PhD students.
Members in MMLab@NTU conduct research primarily in low-level vision, image and video understanding, creative content creation, 3D scene understanding and reconstruction. Have a look at the overview of our research. All publications are listed here.
Google PhD Fellowship 2022
News and Highlights
- 10/2022: Chongyi Li, Shuai Yang and Kaiyang Zhou are selected as outstanding reviewers of ECCV 2022. Congrats!
- 09/2022: Call for Papers: IJCV Special Issue on The Promises and Dangers of Large Vision Models. Full paper submission deadline is March 1st, 2023.
- 07/2022: We are hosting PointCloud-C Challenge (robustness of 3D models, deadline: Sep 19, 2022) and OmniBenchmark Challenge (generalization of 2D models, deadline: Oct 9, 2022).
- 07/2022: Our journal paper 'Learning to Enhance Low-Light Image via Zero-Reference Deep Curve Estimation' was selected as the `Most Popular Article' by IEEE Transactions on Pattern Analysis and Machine Intelligence in July 2022.
- 12/2021: Haonan Qiu, Bo Li, Yuhan Wang, Siyao Li, Quanzhou Li, Jianyi Wang are awarded the competitive AISG Fellowship 2022 to pursue their PhD study. Congrats!
- 12/2021: We release MMHuman3D, a new toolbox under OpenMMLab, for the use of 3D human parametric models in computer vision and computer graphics.
We organize a new workshop called Mobile Intelligent Photography and Imaging (MIPI) in conjunction with ECCV 2022 (Sunday Oct. 23). We invited a cool lineup of speakers from both academia and industry to share their recent work. Come and join us!
VToonify: Controllable High-Resolution Portrait Video Style Transfer
S. Yang, L. Jiang, Z. Liu, C. C. Loy
ACM Transactions on Graphics, 2022 (SIGGRAPH ASIA - TOG)
[arXiv] [Project Page] [YouTube] [Demo]
We present a novel VToonify framework for controllable high-resolution portrait video style transfer. VToonify leverages the mid- and high-resolution layers of StyleGAN to render high-quality artistic portraits based on the multi-scale content features extracted by an encoder to better preserve the frame details. The resulting fully convolutional architecture accepts non-aligned faces in videos of variable size as input, contributing to complete face regions with natural motions in the output.
We propose a zero-shot text-driven framework, Text2Light, to generate 4K+ resolution HDRIs without paired training data.
We examine the intrinsic potential of CLIP for pixel-level dense prediction, specifically in semantic segmentation. With minimal modification, we show that MaskCLIP yields compelling segmentation results on open concepts across various datasets in the absence of annotations and fine-tuning. By adding pseudo labeling and self-training, MaskCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins.
We propose a novel open-vocabulary detector based on DETR, which once trained, can detect any object given its class name or an exemplar image. This first end-to-end Transformer-based open-vocabulary detector achieves non-trivial improvements over current state of the arts.
HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling
Z. Cai, D. Ren, A. Zeng, Z. Lin, T. Yu, W. Wang, X. Fan, Y. Gao, Y. Yu, L. Pan, F. Hong, M. Zhang, C. C. Loy, L. Yang, Z. Liu
European Conference on Computer Vision, 2022 (ECCV, Oral)
[arXiv] [Project Page] [YouTube]
We contribute HuMMan, a large-scale multi-modal 4D human dataset with 1000 human subjects, 400k sequences and 60M frames. HuMMan has several appealing properties: 1) multi-modal data and annotations including color images, point clouds, keypoints, SMPL parameters, and textured meshes; 2) popular mobile device is included in the sensor suite; 3) a set of 500 actions, designed to cover fundamental movements; 4) multiple tasks such as action recognition, pose estimation, parametric human recovery, and textured mesh reconstruction are supported and evaluated.
We propose AvatarCLIP, a zero-shot text-driven framework for 3D avatar generation and animation. Unlike professional software that requires expert knowledge, AvatarCLIP empowers layman users to customize a 3D avatar with the desired shape and texture, and drive the avatar with the described motions using solely natural languages.
We present a text-driven controllable framework, Text2Human, for high-quality and diverse human generation. We synthesize full-body human images starting from a given human pose with two dedicated steps: 1) With some texts describing the shapes of clothes, the given human pose is first translated to a human parsing map. 2) The final human image is then generated by providing the system with more attributes about the textures of clothes.