3D Human Texture Estimation from a Single Image with Transformers

ICCV 2021, Oral

Paper

Abstract

We propose a Transformer-based framework for 3D human texture estimation from a single image. The proposed Transformer is able to effectively exploit the global information of the input image, overcoming the limitations of existing methods that are solely based on convolutional neural networks. In addition, we also propose a mask-fusion strategy to combine the advantages of the RGB-based and texture-flow-based models. We further introduce a part-style loss to help reconstruct high-fidelity colors without introducing unpleasant artifacts. Extensive experiments demonstrate the effectiveness of the proposed method against state-of-the-art 3D human texture estimation approaches both quantitatively and qualitatively.

(a) Existing methods are based on CNNs and thus cannot effectively exploit global information. (b) We propose a Transformer-based framework for 3D human texture estimation from a single image, which overcomes this difficulty.

The Framework

Texformer

We propose a Transformer-based framework, termed as Texformer, for 3D human texture estimation from a single image. Based on the attention mechanism, the proposed network is able to effectively exploit global information of the input. It naturally overcomes the limitations of existing algorithms that solely rely on CNNs and effectively facilitates higher-quality 3D human texture reconstruction.

The Query is a pre-computed color encoding of the UV space obtained by mapping the 3D coordinates of a standard human body mesh to the UV space. The Key is a concatenation of the input image and the 2D part-segmentation map. The Value is a concatenation of the input image and its 2D coordinates. We first feed the Query, Key, and Value into three CNNs to transform them into feature space. Then the multi-scale features are sent to the Transformer units to generate the Output features. The multi-scale Output features are processed and fused in another CNN, which produces the RGB UV map T, texture flow F, and fusion mask M. The final UV map is generated by combining T and the textures sampled with F using the fusion mask M.

Quantitative

Evaluation

We compare against the state-of-the-art 3D texture estimation methods: CMR [4], HPBTT [5], RSTG [3], and TexGlo [1]. The proposed Texformer achieves consistently better results than the baseline approaches on all metrics.

Method CosSim CosSim-R SSIM LPIPS
CMR 0.5241 0.4978 0.7142 0.1275
HPBTT 0.5246 0.5027 0.7420 0.1168
RSTG 0.5282 0.4924 0.6735 0.1778
TexGlo 0.5408 0.5048 0.6658 0.1776
Texformer 0.5747 0.5422 0.7422 0.1154

Visual

Results

For each example, the image on the left is the input, and the image on the right is the rendered 3D human, where the human texture is predicted by the proposed Texformer, and the geometry is predicted by RSC-Net [2].

Datasets

We use the following datasets in our paper:

Paper

Citation

@InProceedings{xu2021texformer,
 author = {Xu, Xiangyu and Loy, Chen Change},
 title = {{3D} Human Texture Estimation from a Single Image with Transformers},
 booktitle = {Proceedings of the IEEE International Conference on Computer Vision},
 year = {2021}
}

References

  1. 3D Human Pose, Shape and Texture from Low-Resolution Images and Videos
    X. Xu, H. Chen, F. Moreno-Noguer, L. A. Jeni, and F. De la Torre
    in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021 (TPAMI)
    [arXiv]
  2. 3D Human Shape and Pose from a Single Low-Resolution Image with Self-Supervised Learning
    X. Xu, H. Chen, F. Moreno-Noguer, L. A. Jeni, and F. De la Torre
    in Proceedings of European Conference on Computer Vision, 2020 (ECCV)
    [arXiv]
  3. Re-Identification Supervised Texture Generation
    J. Wang, Y. Zhong, Y. Li, C. Zhang, and Y. Wei
    in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019 (CVPR)
    [arXiv]
  4. Learning Category-Specific Mesh Reconstruction from Image Collections
    A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik
    in Proceedings of European Conference on Computer Vision, 2018 (ECCV)
    [arXiv]
  5. Human Parsing Based Texture Transfer from Single Image to 3D Human via Cross-View Consistency
    F. Zhao, S. Liao, K. Zhang, and L. Shao
    in Advances in Neural Information Processing Systems, 2020 (NeurIPS)
    [PDF]

Contact


Xiangyu Xu
Email: xiangyu.xu at ntu.edu.sg