Yu Rong

I am currently a Research Engineer of (Meta) Reality Labs Research and was a former intern in Facebook AI Research (FAIR). Prior to that, I obtained the Ph.D. degree from The Chinese University of Hong Kong, Multimedia Laboratory, in September 2021. I received my B.E from computer science and technology department of Tsinghua University in 2016.

My research interests include multimodal LLM, video generation, and 3D computer vision.

Email / Github / Google Scholar / LinkedIn / CV

Industry Experience

Mar. 2022 - Now, Reality Labs Research
Research Engineer. Redmond, WA, U.S.

Build human centric foundation model for processing large-scale images and videos. The model is built upon ViT backbone pretrained on billions images and post-trained with synthetic datasets. Additional prediction heads are added for estimating human motions and dense landmarks. The model is used to process millions videos which are used to train the foundational avatar model.

Jan. 2020 - May. 2020, Facebook AI Research (FAIR)
Research Intern. Menlo Park, CA, U.S.

We use SMPL-X to represent 3D hands and bodies and adopt separate modules for predicting independent hand and body motion first. Hand and body motion predictions are then combined and finetuned to get unified hand and body motion results. Our model runs 10x faster than previous methods with better performance on challenging in-the-wild scenarios with motion blur.

Education

July. 2017 - September 2021, The Chinese Unviersity of Hong Kong
Department of Information Engineering
Doctor of Philosophy

Aug. 2012 - Jul. 2016 , Tsinghua University
Department of Computer Science and Technology,
Bachelor of Engineering

Selected Publications [Full Publication List]

Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining [PDF] [Project Page]

We present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner. we introduce a pre/post-training paradigm for 3D avatar modeling at scale: pretraining on 1M in-the-wild videos to learn broad priors, then post-training on high-quality multi-view studio data for enhanced fidelity.

LUCAS: Layered Universal Codec Avatars
Di Liu, Teng Deng, Giljoo Nam, Yu Rong, Stanislav Pidhorskyi, Junxuan Li, Jason Saragih, Dimitris N. Metaxas, Chen Cao,
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025
[PDF] [Project Page]

We present a layered representation to building avatar with separate hair and face. Our method supports both real-time mesh avatar and high-fidelity Gaussian avatar.

Towards Diverse and Natural Scene-aware 3D Human Motion Synthesis
Jingbo Wang, Yu Rong, Jingyuan Liu, Sijie Yan, Dahua Lin, Bo Dai
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022
[PDF] [Demo]

We present a multi-stage framework to synthesize natural and diverse human motions interacting with given scenes under the guidance of action labels.

Monocular 3D Reconstruction of Interacting Hands via Collision-Aware Factorized Refinements
Yu Rong, Jingbo Wang, Ziwei Liu, Chen Change Loy
International Conference on 3D Vision (3DV), 2021
[PDF] [Project Page] [Code] [Demo]

We present a two-stage framework for reconstructing collision-aware 3D interacting hands from monocular single images. The first stage uses a CNN to generate initial 3D hands and 2D/3D joints. The second stage refines initial results to diminish collisions via factorized refinement.

VoteHMR: Occlusion-Aware Voting Network for Robust 3D Human Mesh Recovery from Partial Point Clouds
Guanze Liu, Yu Rong, Lu Sheng
ACM Multimedia (MM), 2021, Oral
[PDF] [Code] [Demo]

We deisgn a framework, named VoteHMR, to reconstruct reliable 3D human pose and shapes from single-frame partial point clouds obtained from commercial depth sensors such as Kinect.

FrankMocap: A Monocular 3D Whole-Body Pose Estimation System via Regression and Integration
Yu Rong, Takaaki Shiratori, Hanbyul Joo
International Conference on Computer Vision Workshops (ICCVW), 2021
[PDF] [Project Page] [Code] [Demo]

We present a framework, named FrankMocap, to simultaneously capture 3D whole-body motioin (body, face, and hands) from monocular RGB inputs.

Chasing the Tail in Monocular 3D Human Reconstruction with Prototype Memory
Yu Rong, Ziwei Liu, Chen Change Loy
IEEE Transactions on Image Processing (TIP), 2022
[PDF] [Project Page] [Code]

We design a novel framework to increase the 3D human mocap accuracy for challenging poses.

Delving Deep into Hybrid Annotations for 3D Human Recovery in the Wild
Yu Rong, Ziwei Liu, Cheng Li, Kaidi Cao, Chen Change Loy
International Conference on Computer Vision (ICCV), 2019
[PDF] [Project Page] [Code]

We provided sufficient investigation of annotation design for in-the-wild 3D human reconstruction.

Pose-Robust Face Recognition via Deep Residual Equivariant Mapping
Kaidi Cao*, Yu Rong*, Cheng Li, Xiaoou Tang, Chen Change Loy
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
[PDF] [Project Page] [Code]

We presented a Deep Residual EquivAriant Mapping (DREAM) block to improve the performance of face recognition on profile faces.

Academic Activities

Co-organize a workshop on human sensing in computer vision at ICCV 2019.
Serve as reviewer for CVPR 2019~2022, ICCV 2019~2023, ECCV 2018~2022, ICLR 2022, and AAAI 2020/2021.
Serve as reviewer for TPAMI, IJCV, and TIP.