Chen Change Loy is a Nanyang Associate Professor with the School of Computer Science and Engineering, Nanyang Technological University, Singapore. He is also an Adjunct Associate Professor at The Chinese University of Hong Kong. He is the Lab Director of MMLab@NTU and Co-associate Director of S-Lab. He received his Ph.D. (2010) in Computer Science from the Queen Mary University of London. Prior to joining NTU, he served as a Research Assistant Professor at the MMLab of The Chinese University of Hong Kong, from 2013 to 2018.
His research interests include computer vision and deep learning with a focus on image/video restoration and enhancement, generative tasks, and representation learning. He serves as an Associate Editor of the International Journal of Computer Vision (IJCV) and IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). He also serves/served as an Area Chair of top conferences such as ICCV 2021, CVPR 2021, CVPR 2019, ECCV 2022 and ECCV 2018. He is a senior member of IEEE.
MMLab@NTU was formed by Chen Change Loy on the 1 August 2018, with a research focus on computer vision and deep learning. Its sister lab is MMLab@CUHK. It is now a group with three faculty members and more than 40 members including research fellows, research assistants, and PhD students. Members in MMLab@NTU conduct research primarily in low-level vision, image and video understanding, creative content creation, 3D scene understanding and reconstruction.Visit MMLab@NTU Homepage
We examine the intrinsic potential of CLIP for pixel-level dense prediction, specifically in semantic segmentation. With minimal modification, we show that MaskCLIP yields compelling segmentation results on open concepts across various datasets in the absence of annotations and fine-tuning. By adding pseudo labeling and self-training, MaskCLIP+ surpasses SOTA transductive zero-shot semantic segmentation methods by large margins.
This paper presents Dense Siamese Network (DenseSiam), a simple unsupervised learning framework for dense prediction tasks. It learns visual representations by maximizing the similarity between two views of one image with two types of consistency, i.e., pixel consistency and region consistency. In contrast to previous methods that require negative pixel pairs, momentum encoders or heuristic masks, DenseSiam benefits from the simple Siamese network and optimizes the consistency of different granularities. It also shows that the simple location correspondence and interacted region embeddings are effective enough to learn the similarity.
We propose a novel open-vocabulary detector based on DETR, which once trained, can detect any object given its class name or an exemplar image. This first end-to-end Transformer-based open-vocabulary detector achieves non-trivial improvements over current state of the arts.
This paper provides a comprehensive study of distilling from the popular StyleGAN-like architecture. Our key insight is that the main challenge of StyleGAN distillation lies in the output discrepancy issue, where the teacher and student model yield different outputs given the same input latent code. Standard knowledge distillation losses typically fail under this heterogeneous distillation scenario. We conduct thorough analysis about the reasons and effects of this discrepancy issue, and identify that the mapping network plays a vital role in determining semantic information of generated images. Based on this finding, we propose a novel initialization strategy for the student model, which can ensure the output consistency to the maximum extent.
CelebV-HQ: A Large-Scale Video Facial Attributes Dataset
H. Zhu, W. Wu, W. Zhu, L. Jiang, S. Tang, L. Zhang, Z. Liu, C. C. Loy
European Conference on Computer Vision, 2022 (ECCV)
[arXiv] [Project Page]
We propose a large-scale, high-quality, and diverse video dataset, named the High-Quality Celebrity Video Dataset (CelebV-HQ), with rich facial attribute annotations. CelebV-HQ contains 35,666 video clips involving 15,653 identities and 83 manually labeled facial attributes covering appearance, action, and emotion. We conduct a comprehensive analysis in terms of ethnicity, age, brightness, motion smoothness, head pose diversity, and data quality to demonstrate the diversity and temporal coherence of CelebV-HQ. Besides, its versatility and potential are validated on unconditional video generation and video facial attribute editing tasks.
StyleGAN-Human: A Data-Centric Odyssey of Human Generation
J. Fu, S. Li, Y. Jiang, K.-Y. Lin, C. Qian, C. C. Loy, W. Wu, Z. Liu
European Conference on Computer Vision, 2022 (ECCV)
[arXiv] [Project Page] [YouTube]
We collect and annotate a large-scale human image dataset with over 230K samples capturing diverse poses and textures. Equipped with this large dataset, we rigorously investigate three essential factors in data engineering for StyleGAN-based human generation, namely data size, data distribution, and data alignment.
HuMMan: Multi-Modal 4D Human Dataset for Versatile Sensing and Modeling
Z. Cai, D. Ren, A. Zeng, Z. Lin, T. Yu, W. Wang, X. Fan, Y. Gao, Y. Yu, L. Pan, F. Hong, M. Zhang, C. C. Loy, L. Yang, Z. Liu
European Conference on Computer Vision, 2022 (ECCV, Oral)
[arXiv] [Project Page] [YouTube]
We contribute HuMMan, a large-scale multi-modal 4D human dataset with 1000 human subjects, 400k sequences and 60M frames. HuMMan has several appealing properties: 1) multi-modal data and annotations including color images, point clouds, keypoints, SMPL parameters, and textured meshes; 2) popular mobile device is included in the sensor suite; 3) a set of 500 actions, designed to cover fundamental movements; 4) multiple tasks such as action recognition, pose estimation, parametric human recovery, and textured mesh reconstruction are supported and evaluated.
Recovering a textured 3D mesh from a monocular image is highly challenging, particularly for in-the-wild objects that lack 3D ground truths. In this work, we present MeshInversion, a novel framework to improve the reconstruction by exploiting the generative prior of a 3D GAN pre-trained for 3D textured mesh synthesis. Reconstruction is achieved by searching for a latent space in the 3D GAN that best resembles the target mesh in accordance with the single view observation. Since the pre-trained GAN encapsulates rich 3D semantics in terms of mesh geometry and texture, searching within the GAN manifold thus naturally regularizes the realness and fidelity of the reconstruction. Importantly, such regularization is directly applied in the 3D space, providing crucial guidance of mesh parts that are unobserved in the 2D space.
Night photography typically suffers from both low light and blurring issues due to the dim environment and the common use of long exposure. While existing light enhancement and deblurring methods could deal with each problem individually, a cascade of such methods cannot work harmoniously to cope well with joint degradation of visibility and textures. Training an end-to-end network is also infeasible as no paired data is available to characterize the coexistence of low light and blurs. We address the problem by introducing a novel data synthesis pipeline that models realistic low-light blurring degradations. With the pipeline, we present the first large-scale dataset for joint low-light enhancement and deblurring. The dataset, LOL-Blur, contains 12,000 low-blur/normal-sharp pairs with diverse darkness and motion blurs in different scenarios. We further present an effective network, named LEDNet, to perform joint low-light enhancement and deblurring.
We presented BRACE, a new dataset for audio-conditioned dance motion syn- thesis. BRACE was collected to challenge the main assumptions taken by motion synthesis models, i.e. relative simple poses and movement captured with controlled data. We designed an efficient pipeline to annotate poses in videos, striking a good balance between labelling cost and keypoint quality. Our pipeline is flexible and can be adopted for any pose estimation task involving complex movements and dynamic recording settings.
We present a new lighting estimation and editing framework to generate high-dynamic-range (HDR) indoor panorama lighting from a single limited field-of-view (FOV) image captured by low-dynamic-range (LDR) cameras. StyleLight takes FOV-to-panorama and LDR-to-HDR lighting generation into a unified framework and thus greatly improves lighting estimation. StyleLight also enables intuitive lighting editing on indoor HDR panoramas.
Particle-based systems provide a flexible and unified way to simulate physics systems with complex dynamics. Most existing data-driven simulators for particle-based systems adopt graph neural networks (GNNs) as their network backbones, as particles and their interactions can be naturally represented by graph nodes and graph edges. However, while particle-based systems usually contain hundreds even thousands of particles, the explicit modeling of particle interactions as graph edges inevitably leads to a significant computational overhead, due to the increased number of particle interactions. In this paper, we propose a novel Transformer-based method, dubbed as Transformer with Implicit Edges (TIE), to capture the rich semantics of particle interactions in an edge-free manner.
We are looking for Research Fellows who want to do exciting research in the area of machine leanring and computer vision.