This is part one of a three part series based on a presentation we were invited to give at the University of Massachusetts Amherst.
The goal of the series is to outline at a very high level, and for a non-technical audience the requirements for what Michael Abrash described as “Hard Augmented Reality (AR)” back in 2012.
Part two will cover Display Technology with part three covering the Content and Infrastructure Ecosystem.
Disclaimer: This series will not give a solution to the rendering black problem and is not a comprehensive roadmap to Hard AR.
A complex dance
When describing how computers understand their environment in an Augmented Reality context, it’s important to understand that there is a delicate balance between many different systems, all working together, in a constant feedback loop. Mapping, Tracking and Loop-Closure are the pieces in that system. Taken together these pieces form the problem set of Simultaneous Localization and Mapping (SLAM). The simplified flowchart below should help when thinking about the process:
Mapping the environment
In order to create an illusion of a virtual object that looks as real as a physical object, any AR system needs to understand it’s surroundings in a very precise way.
A real-time, highly accurate, 3D scanned map of the user’s environment is an ideal foundation for environmental understanding.
Sidenote: It is unclear whether sub-millimeter resolution mapping will be necessary for consumers to adopt AR – but we doubt it. We estimate that sub-centimeter scan resolution will be acceptable for the overwhelming majority of consumer applications. This is an important distinction when it comes to evaluating what kind of of scanning process to build for an AR system.
Methods for mapping include: passive image processing, which evaluates images and video, active radio frequency radiation systems, which send out some kind of emission, or some combination of the two. The look and feel of the device, as well as consumer safety are major considerations when determining what mix is required between active and passive mapping systems. As Pair is for consumer phones and tablets with no accessories, our system is purely passive.
The simplest step up from a purely passive mapping system is typically the integration of a small form Infrared (IR) depth sensor, which consists of an IR LED and an IR depth camera. Consumer hardware such as the Kinect and Structure sensor or platforms such as the Google Tango or the Intel Realsense, are the most readily available IR depth systems on the market. The benefits of IR depth systems are arguably shrinking as passive systems get more robust with mapping and tracking, however one place that IR will always beat passive visual mapping and tracking is in dark environments, as passive systems need to be able to take quality images to work.
More advanced mapping methods use higher frequency emissions, such as ultrasonic transducers, as seen on small robots, or structured light systems such as used with handheld 3D scanners. These higher frequency emitters can provide very high resolution mapping in nearly any type of environment which makes them the most robust of the mapping implementations. That quality comes at the cost of a larger form factor, larger processing footprint and marginal but non-zero increase in risk to the user from emissions. The major advantage that we see with these high frequency active systems is that they can map behind, underneath and around objects if properly implemented, making better maps with less work.
Along with a virtual map of the environment, tracking (localizing) where the user is as they move around the environment is a critical component. The process of extracting that viewpoint is called pose estimation. The primary discussion around tracking is whether “Inside-Out” or “Outside-In” tracking is the best approach.
A pose estimate can be derived from the environmental mapping system as the user moves, but this data does not fall out of mapping by default and there are complex processing systems that are implemented to make the whole process work. This approach is considered Inside-Out, as all of the tracking data comes from inside the user’s system, not from external hardware doing time difference triangulation. Describing inside-out tracking and odometry much further than this would go beyond the scope of this post.
In contrast, Outside-In systems typically use infrared or in some cases low power lasers to track reflective surfaces on the user, like the lighthouse system with the Vive or more contained VR systems.
However for AR there are significant logistical problems that come with this as soon as you step out of a controlled environment. Basically for “outside-in” tracking, the entire world needs to have these emitters for it to work everywhere and we don’t think that’s reasonable.
Needless to say, inside-out is much more difficult to do correctly, but has the advantage of working anywhere all the time and is the approach we take with Pair.
Relocalization aka “Loop Closure”
Ok, so we have a virtual map of the environment and we know where the viewer is within it. So far we aren’t seeing anything new in this system though, just a representation of the world around us. So now we want to add something virtual to the world so that we get some benefit out of this whole thing.
For simplicity sake, we wont describe how to build the virtual 3D world beyond saying that it is identical to how you would do it for any 3D video game. The previous two sections describe how you get the first person view within the 3D world correct, and this section will cover how we make sure it stays correct.
There is a certain “stickiness” to that virtual world that is necessary to build in order to keep the virtual map and the real environment “stuck” together so that objects seem real. If the virtual map is is not corrected according to the real world, then any virtual object rendered within the virtual map will always mismatch with the real world.
To solve this, a process called relocalization or “loop-closure” verifies that the virtual world matches the real one. Loop closure detects when the viewer returns to a previously visited place and corrects the position of the world fixing any mismatches between the virtual map and the real world.
Think of it like this: If you close your eyes and walk around your house, no matter where you stop and open your eyes, you will know where you are because you have been there before. The critical piece here is that places need to be seen by an AR system before we can interact with them virtually – which is a large subject of discussion in and of itself.
This process must be repeated constantly in order to glue these two worlds together. Here below you can see these elements all together exploded out into their component parts. Virtual objects on top, a scan of the world in the middle and the real world at the bottom:
In brief, any robust AR system needs to do three things in order to work effectively: Create a virtual map of the real environment, track where the viewer is inside of the virtual map and keep those systems stuck together.
Next: Augmented Reality Displays