Facebook Can Make VR Avatars Look—and Move—Exactly Like You
“There’s this big, ugly sucker at the door,” the young woman says, her eyes twinkling, “and he said, ‘Who do you think you are, Lena Horne?’ I said no, but that I knew Miss Horne like a sister.”
It’s the beginning of a short soliloquy from Walton Jones’ play The 1940’s Radio Hour, and as she continues with the monologue, it’s easy to see that the young woman knows what she’s doing. Her smile grows while she goes on to recount the doorman’s change of tune—like she’s letting you in on the joke. Her lips curl as she seizes on just the right words, playing with their cadence. Her expressions are so finely calibrated, her reading so assured, that with the dark background behind her, you’d think you were watching a black-box revival of the late-’70s Broadway play.
There’s only one problem: her body disappears below the neck.
Yaser Sheikh reaches out and stops the video. The woman is a stunningly lifelike virtual-reality avatar, her performance generated by data gathered beforehand. But Sheikh, who heads up Facebook Reality Labs’ Pittsburgh location, has another video he considers more impressive. In it, the same woman appears wearing a VR headset, as does a young man. Their headsetted real-life selves chat on the left-hand side of the screen; on the right side, simultaneously, their avatars carry on in perfect concert. As mundane as the conversation is—they talk about hot yoga—it’s also an unprecedented glimpse at the future.
For years now, people have been interacting in virtual reality via avatars, computer-generated characters who represent us. Because VR headsets and hand controllers are trackable, our real-life head and hand movements carry into those virtual conversations, the unconscious mannerisms adding crucial texture. Yet even as our virtual interactions have become more naturalistic, technical constraints have forced them to remain visually simple. Social VR apps like Rec Room and Altspace abstract us into caricatures, with expressions that rarely (if ever) map to what we’re really doing with our faces. Facebook’s Spaces is able to generate a reasonable cartoon approximation of you from your social media photos but depends on buttons and thumbsticks to trigger certain expressions. Even a more technically demanding platform like High Fidelity, which allows you to import a scanned 3D model of yourself, is a long way from being able to make an avatar feel like you.
Related Stories
That’s why I’m here in Pittsburgh on a ridiculously cold early March morning, inside a building very few outsiders have ever stepped foot in. Yaser Sheik and his team are finally ready to let me in on what they’ve been working on since they first rented a tiny office in the city’s East Liberty neighborhood. (They’ve since moved to a larger space on the Carnegie Mellon campus, with plans to expand again in the next year or two.) Codec Avatars, as FRL calls them, are the result of a process that uses machine learning to collect, learn, and re-create human social expression. They’re also nowhere near being ready for the public. At best, they’re years away—if they end up being something that Facebook deploys at all. But Sheik and his colleagues are ready to get this conversation started. “It’ll be big if we can get this finished,” Sheik says with the not-at-all contained smile of a man who has no doubts they’ll get it finished. “We want to get it out. We want to talk about it.”
In the 1949 essay “The Unconscious Patterning of Behavior in Society,” anthropologist Edward Sapir wrote that humans respond to gestures “in accordance with an elaborate and secret code that is written nowhere, known by none, and understood by all.” Sixty years later, replicating that elaborate code has become Sheik’s abiding mission.
Before he came to Facebook, Yaser Sheik was a Carnegie Mellon professor investigating the intersection of computer vision and social perception. When Oculus chief scientist Michael Abrash reached out to him in 2015 to discuss where AR and VR might be going, Sheikh didn’t hesitate to share his own vision. “The real promise of VR,” he says now, both hands around an ever-present bowl of coffee, “is that instead of flying to meet me in person, you could put on a headset and have this exact conversation that we’re having right now—not a cartoon version of you or an ogre version of me, but looking the way you do, moving the way you do, sounding the way you do.”
(In his founding document for the facility, Sheik described it as a “social presence laboratory,” a reference to the phenomenon wherein your brain responds to your virtual surroundings and interactions as though they’re real. Then again, he also wrote that he thought they could accomplish photorealistic avatars within five years, using seven or eight people. While the mission remained, expectations necessarily changed. So did the name: Oculus Research became known as Facebook Reality Labs last year.)
The theory underlying Codec Avatars is simple and twofold, what Sheikh calls the “ego test” and the “mom test”: You should love your avatar, and your loved ones should as well. The process enabling the avatars is something far more complicated—as I discovered for myself during two different capture procedures. The first takes place in a domelike enclosure called Mugsy, the walls and ceiling of which are studded with 132 off-the-shelf Canon lenses and 350 lights focused toward a chair. Sitting at the center feels like being in a black hole made of paparazzi. “I had awkwardly named it ‘Mugshooter,'” Sheikh admits. “Then we realized it’s a horrible, unfriendly name.” That was a couple of versions ago; Mugsy has increased steadily in both cameras and capability, sending early kludges (like using a ping-pong ball on a string to help participants hold their face in the right place, car garage-style) to deserved obsolescence.
In Mugsy, research participants spend about an hour in the chair, making a series of outsize facial expressions and reading lines out loud while an employee in another room coaches them via webcam. Clench your jaw. Relax. Show all your teeth. Relax. Scrunch up your whole face. Relax. “Suck your cheeks in like a fish,” technical program manager Danielle Belko tells me while I try not to succumb to paralyzing self-consciousness. “Puff your cheeks.”
If the word “panopticon” comes to mind, it should—though it would be better applied to the second capture area, a larger dome known internally as the Sociopticon. (Before joining Oculus/Facebook, Sheikh established its predecessor, Panoptic Studio, at Carnegie Mellon.) The Sociopticon looks a lot like Microsoft’s Mixed Reality Capture Studio, albeit with more cameras (180 to 106) that are also higher-resolution (2.5K by 4K, versus 2K by 2K) and capture a higher frame rate (90Hz versus 30 or 60). Where Mugsy concentrated on your face, the Sociopticon helps the Codec Avatar system learn how our bodies move—and our clothes. So my time in there is less about facial expression and more about what I’d describe as Lazy Calisthenics: shaking out limbs, jumping around, playing charades with Belko via webcam.
The point is to capture as much information as possible (Mugsy and the Sociopticon gather 180 gigabytes every second) so that a neural network can learn to map expressions and movements to sounds and muscle deformations, from every possible angle. The more information it captures, the stronger its “deep appearance model” becomes, and the better it can be trained to encode that information as data—and then decode it on the other end, in another person’s headset, as an avatar. As anyone who struggled through video compression woes in the early days of the internet knows, that’s where the “codec” in Codec Avatars comes from: coder/decoder.
It’s not just raw measurements. As research scientist Jason Saragih tells me, the data has to be interpreted. Regular users won’t have Mugsy and the Sociopticon in their living rooms, after all—they’ll only have their VR and AR headsets. While VR wearables of today are known as “head-mounted displays,” researchers at FRL have created a line of “HMCs,” or head-mounted capture systems. Known internally as Argent, these HMCs point infrared LEDs and cameras at various areas of the face, allowing the software to re-constellate them into the person’s likeness.
Someday soon, Sheikh and his team want to be able to extend that face scan to the whole body, so the software will need to be able to work around what Saragih calls “extrinsics”—the weirdnesses that would otherwise make a virtual interaction less lifelike. If it’s dark where you are, for example, the system needs to be able to compensate. If you move your hand behind your back, the system needs to be able to account for that so that if your friend walks behind you (in VR), they can see what your hand is doing. There are others, like being able to predict how you move in order to keep your avatar’s movement as smooth as possible, but they’re all aimed at removing the variables and letting your avatar be an unfettered, undiluted representation of you.
Animating people is hard. That’s just the truth. Even mega-blockbuster videogames struggle with things like hair, eyes, and the inside of the mouth—and errant paths lead straight into the uncanny valley, that visceral discomfort brought on by seeing something that looks almost but not quite human. After my experience with the capture process, when I put a headset on to chat live with Sheikh and researcher Steve Lombardi, I’m fully expecting the reality of the virtuality to fall into that same trap.
Nope. Sheikh’s avatar doesn’t have the beard or owlishly round glasses he wears in real life (ostensibly they’re harder to get right, so he did the capture without them), but it’s him. It’s him so much that when he invites me to lean in and take a closer look at the stubble on his face, it feels incredibly invasive to do so. It’s so much Steve Lombardi that, when he later walks into the room for real, I feel like I already know him—despite never having met him in the flesh. The results aren’t perfect. When people are speaking excitedly, their avatars’ mouths don’t move quite as much as their tone would suggest; hair is visible to the individual strand, but has a hazy aura around it; tongues look a bit fuzzy. But the aggregate effect is overwhelmingly something along the lines of this shouldn’t be possible.
That’s a marvelous thing to experience. Troubling, too. While Codec Avatars are still little more than a research project, we’re learning about them at an uncertain time. Deepfakes, AI so powerful it can create faces from nothing, data privacy, misinformation campaigns, and toxic behavior have all become very real issues on a very real internet—and as VR and AR begin to make inroads toward becoming humanity’s dominant communications platforms, funded by a social media company that’s been at the epicenter of some of those issues, they’ll become even more pressing. You thought harassment was bad online? You thought VR, which adds embodiment and personal space to the mix, made it even more viscerally disturbing? You ain’t seen nothing yet.
Sheikh understands the concern. “Authenticity isn’t just crucial to the success of this, it’s crucial to protecting users as well,” he says. “If you get a call from your mother and you hear her voice, there isn’t an iota of doubt in your mind that what she says is what you hear, right? We have to build that trust and maintain it from the start.” He cites the sensors on the HMCs as a crucial means of authentication—our eyes, voices, even mannerisms are all biometrics of a sort. (Which, yes, allays one concern but also intensifies another.) Conversations around data privacy and VR have been growing louder over the past few years, but a breakthrough like this may well turn them up to 11.
For all the progress VR has made over the past decade, a thing like Codec Avatars represents a transition to an entirely new phase of experience—and those in the company who have seen it know that. Each year at the Oculus Connect developer conference, Michael Abrash gets onstage and gives a state of the union about the pace of research and innovation in the company’s research labs. Over time, he’s settled into being bullish on some VR breakthroughs, bearish on others. This past October, though, one of his habitually ursine stances started growing horns. “I’m not betting on having convincingly human avatars within four years,” he said, “but I’m no longer betting against it either.”
Sitting with Yaser Sheikh now, I ask him how he’d felt about Abrash’s proclamation at the time.
“He’s right,” he says, smiling and sipping his coffee.