When it comes to 3D model estimation from 2D sources we run into a corner due to a conflict between memory capacity and accuracy. We need a sustainable amount of data flow to maintain a high context for our machines while high-resolution is necessary for accurate renders with neural-networking. Thus far, applications in this field favored low-resolution inputs to cover more ground overall. This study takes us a leap forwards to a cozy middle-ground.
Facebook Research tackles this issue by adopting a multi-layered analyzing system. A crude analysis takes on the whole image, focusing on all-around reasoning of what's where. A second level takes output data from here to use as a roadmap and puts together a more detailed geometry with the help of higher-resolution images.
This research is not the only endeavor in this field. Human digitalization can open the door for many possibilities for a variety of areas such as medical imaging to virtual reality to simply a custom 3D emoji rendering. To this day, this technology was limited for the general public due to limitations like the need for multiple cameras and strict lighting requirements. The team in Facebook research aims to achieve a highly flexible rendering system that can maintain a high-fidelity when it comes to details like folds in clothing, fingers, and nuances in facial features.
The previously existing technology
A notable example, SCAPE, published in 2005, Stanford employed pre-modeled meshes over image inputs to produce 3D renders. While these appear detailed on their own, they did not faithfully represent what they were modeling. In this project, however, no 3D geometry is imposed upon the images, instead, geometrical context is applied upon higher levels without making premature assumptions. Meaning, from coarse-input to detailed analysis, missing details are incrementally implemented and the final determination of the model's geometric properties is only made at the final level.
But how about the backside? It remains unobserved in a single-image reconstruction. Missing information would surely mean blurry butt and back estimations, right? Well, the team overcame this issue by determining backside normals, as they put it: "We overcome this problem by leveraging image-to-image translation networks to produce backside normals. Conditioning our multi-level pixel-aligned shape inference with the inferred back-side surface normal removes ambiguity and significantly improves the perceptual quality of our reconstructions with a more consistent level of detail."
If you are interested, they left out a self-testing kit at Google Colab, though to be fair, it requires a certain amount of tech-savviness and a basic understanding of programming environments to run.