MMLab@NTU

Multimedia Laboratory @
Nanyang Technological University
Affiliated with S-Lab

About

MMLab@NTU

MMLab@NTU was formed on the 1 August 2018, with a research focus on computer vision and deep learning. Its sister lab is MMLab@CUHK. It is now a group with three faculty members and more than 40 members including research fellows, research assistants, and PhD students.

Members in MMLab@NTU conduct research primarily in low-level vision, image and video understanding, creative content creation, 3D scene understanding and reconstruction. Have a look at the overview of our research. All publications are listed here.

We are always looking for motivated PhD students, postdocs, research assistants who have the same interests like us. Check out the careers page and follow us on Twitter.

CVPR 2023

03/2023: The team has a total of 14 papers (including 4 highlights) accepted to CVPR 2023.

View more

ICLR 2023

01/2023: The team has a total of 5 papers (including 2 oral and 1 spotlight papers) accepted to ICLR 2023.

View more

Google PhD Fellowship 2022

11/2022: Yuming Jiang and Jiawei Ren are awarded the very competitive and prestigious Google PhD Fellowship 2022 under the area “Machine Perception, Speech Technology and Computer Vision”.

View more

The AI Talks

09/2022: We launch a new initiative, The AI Talks, inviting active researchers from all over the globe to share their latest research in AI, machine learning, computer vision etc. Subscribe the newsletter here.

View more

Check Out

News and Highlights

View more

CVPR 2023

Second MIPI Workshop

Here comes the second workshop of Mobile Intelligent Photography and Imaging (MIPI) to be held in conjunction with CVPR 2023 (Sunday June 18). We organize several challenge tracks and also call for workshop papers.

  • Paper submission deadline: Feb 12, 2023
  • Challenge start date: Dec 25, 2022
  • Challenge end date: Feb 20, 2023

Recent

Projects

VToonify: Controllable High-Resolution Portrait Video Style Transfer
S. Yang, L. Jiang, Z. Liu, C. C. Loy
ACM Transactions on Graphics, 2022 (SIGGRAPH ASIA - TOG)
[arXiv] [Project Page] [YouTube] [Demo]

We present a novel VToonify framework for controllable high-resolution portrait video style transfer. VToonify leverages the mid- and high-resolution layers of StyleGAN to render high-quality artistic portraits based on the multi-scale content features extracted by an encoder to better preserve the frame details. The resulting fully convolutional architecture accepts non-aligned faces in videos of variable size as input, contributing to complete face regions with natural motions in the output.

Text2Light: Zero-Shot Text-Driven HDR Panorama Generation
Z. Chen, G. Wang, Z. Liu
ACM Transactions on Graphics, 2022 (SIGGRAPH ASIA - TOG)
[arXiv] [Project Page] [YouTube] [Demo]

We propose a zero-shot text-driven framework, Text2Light, to generate 4K+ resolution HDRIs without paired training data.

Extract Free Dense Labels from CLIP
C. Zhou, C. C. Loy, B. Dai
European Conference on Computer Vision, 2022 (ECCV, Oral)
[arXiv] [Project Page]

We examine the intrinsic potential of CLIP for pixel-level dense prediction, specifically in semantic segmentation. With minimal modification, we show that MaskCLIP yields compelling segmentation results on open concepts across various datasets in the absence of annotations and fine-tuning. By adding pseudo labeling and self-training, MaskCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins.

Open-Vocabulary DETR with Conditional Matching
Y. Zang, W. Li, K. Zhou, C. Huang, C. C. Loy
European Conference on Computer Vision, 2022 (ECCV, Oral)
[arXiv] [Project Page]

We propose a novel open-vocabulary detector based on DETR, which once trained, can detect any object given its class name or an exemplar image. This first end-to-end Transformer-based open-vocabulary detector achieves non-trivial improvements over current state of the arts.

HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling
Z. Cai, D. Ren, A. Zeng, Z. Lin, T. Yu, W. Wang, X. Fan, Y. Gao, Y. Yu, L. Pan, F. Hong, M. Zhang, C. C. Loy, L. Yang, Z. Liu
European Conference on Computer Vision, 2022 (ECCV, Oral)
[arXiv] [Project Page] [YouTube]

We contribute HuMMan, a large-scale multi-modal 4D human dataset with 1000 human subjects, 400k sequences and 60M frames. HuMMan has several appealing properties: 1) multi-modal data and annotations including color images, point clouds, keypoints, SMPL parameters, and textured meshes; 2) popular mobile device is included in the sensor suite; 3) a set of 500 actions, designed to cover fundamental movements; 4) multiple tasks such as action recognition, pose estimation, parametric human recovery, and textured mesh reconstruction are supported and evaluated.

AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars
F. Hong, M. Zhang, L. Pan, Z. Cai, L. Yang, Z. Liu
ACM Transactions on Graphics, 2022 (SIGGRAPH - TOG)
[arXiv] [Project Page]

We propose AvatarCLIP, a zero-shot text-driven framework for 3D avatar generation and animation. Unlike professional software that requires expert knowledge, AvatarCLIP empowers layman users to customize a 3D avatar with the desired shape and texture, and drive the avatar with the described motions using solely natural languages.

Text2Human: Text-Driven Controllable Human Image Generation
Y. Jiang, S. Yang, H. Qiu, W. Wu, C. C. Loy, Z. Liu
ACM Transactions on Graphics, 2022 (SIGGRAPH - TOG)
[arXiv] [Project Page] [YouTube]

We present a text-driven controllable framework, Text2Human, for high-quality and diverse human generation. We synthesize full-body human images starting from a given human pose with two dedicated steps: 1) With some texts describing the shapes of clothes, the given human pose is first translated to a human parsing map. 2) The final human image is then generated by providing the system with more attributes about the textures of clothes.

Explore

MMLab@NTU