Volumetric Training

Four ways to show somebody how to service a motorcycle engine, compared head to head. A volumetric recording of the instructor beat an avatar, a hand, and an annotation on almost everything that mattered.

The three systems before this one in the series were about guidance: a remote expert helping someone find or move a thing, in the moment. This one asks a harder question. Can any of it teach you a skill you did not have?

That changes the requirements. Guidance can be vague and still work, because the expert is right there to correct you. Instruction has to survive on its own, and the thing being taught has to be complex enough that getting it wrong matters. So I picked a real one: servicing a 2008 Hyosung GT250R motorcycle engine, with tasks lifted from the actual service manual.

Conventional training and remote collaboration systems allow users to see each other's faces, heightening the sense of presence while sharing content like videos or slideshows. However, these methods lack depth information and a free 3D perspective of the training content. This paper investigates the impact of volumetric playback in a Mixed Reality (MR) spatial training system. We describe the MR system in a mechanical assembly scenario that incorporates various instruction delivery cues. Building upon previous research, four spatial instruction cues were explored: "Annotation", "Hand gestures", "Avatar", and "Volumetric playback". Through two user studies that simulated a real-world mechanical assembly task, we found that the volumetric visual cue enhanced spatial perception in the tested MR training tasks, exhibiting increased co-presence and system usability while reducing mental workload and frustration. We also found that the given tasks required less effort and mental load when eye gaze was incorporated. Eye gaze on its own was not perceived to be very useful, but it helped to complement the hand gesture cues. Finally, we discuss limitations, future work and potential applications of our system.

We developed a system capable of recording and replaying instructions using four different visual cues. The prototype system was built with an HMD (HTC Vive Pro Eye), three depth cameras (Azure Kinect), and Unity running on a Windows PC. For the assembly task, a motorcycle engine from a 2008 Hyosung GT 250R was used. The tasks were inspired by general engine maintenance procedures from the motorcycle's service manual.

The see-through video capabilities of the HTC Vive were chosen over an optical see-through AR headset due to the need for high tracking precision over a large area. Portable HMD devices were also ruled out, as volumetric playback requires significant computational resources not available on mobile hardware.

For recording instructions, the instructor performed real-world tasks while wearing the HMD, with actions saved to the PC. Annotations were recorded by moving the HTC Vive controller along the desired path while pressing the trigger; the controller's time and position were saved and played back in time-space synced format. Gestures were captured using the front-facing cameras of the Vive HMD and played back to recreate the instruction. For the Avatar representation, the instructor wore the HMD while holding both controllers; the position and rotation of the HMD and controllers were recorded over time and applied to an inverse-kinematics-rigged skeleton. Volumetric capture used the Azure Kinect cameras, with the instructor performing tasks while being captured by three cameras placed 1.2 m apart at the vertices of an equilateral triangle for optimum coverage.

A total of 15 basic motorcycle engine maintenance tasks were selected from the service manual. An informal trial with three participants had them perform all fifteen tasks across all four conditions, with completion time and self-rated difficulty recorded. Harder tasks consistently took longer. Based on this, the task set was narrowed to nine tasks classified into three difficulty levels (hard, medium, and easy) based on the number of individual actions required. For example, a medium task involved placing the oil filler cap in its corresponding location, while a hard task required removing a banjo bolt connecting the radiator feed to the engine, threading it through, and tightening it back. In any given condition, participants received a random subset of six instructions, two per difficulty level. Audio and visual confirmation followed each completed instruction before the system advanced. Completion time was logged at the end of each condition.

This paper presents an MR system for enhanced training that features annotation, hand gestures, avatar representation, and eye gaze as instruction delivery cues. Participants felt more connected with the instructor when volumetric playback was used. Volumetric playback significantly improved Social Presence and system usability while reducing mental workload and frustration compared to avatar representation, and was the most preferred visual cue overall.

In the follow-up study comparing eye gaze with hand gestures, participants reported that seeing both cues simultaneously reduced mental load and effort, as they complemented each other when one cue was lacking. The gesture-only interface was generally ranked highest; eye gaze added limited value and sometimes caused distraction through misleading information or obstructed views. Hand gestures are more effective than eye gaze alone in AR assembly training. Although combining both cues may not improve task performance, the combination is preferable as it reduces overall workload.

The pattern across the series

That eye gaze result is the third time this thread has landed on the same finding. The CHI study found workers describing gaze as redundant next to gesture. Here it is again: useful as a complement, close to useless alone. Gaze is cheap to capture and it keeps not being the thing that helps, because a continuous stream of where somebody's eyes went is not the same as a decision about what matters.

Volumetric playback wins for the opposite reason. It is expensive, three Azure Kinects and a machine that can keep up, and it carries the whole instructor: posture, approach angle, which hand does what, how hard they are pulling. An avatar approximates that and a hand gesture abstracts it. For a task where the difference between right and wrong is how you seat a banjo bolt, the full recording is worth what it costs.

The catch is that it is not portable. Volumetric playback needs computers that were not going in anybody's bag in 2021, which is the constraint the last post in this series runs straight into.

The paper is Spatial Perception Enhancement in Assembly Training Using Augmented Volumetric Playback, Frontiers in Virtual Reality 2021, with Soumith Chittajallu, Navindd Raj, Huidong Bai and Mark Billinghurst.

Read the next post in this series here:
Remote Collaboration: Bringing 3D to Zoom

Volumetric Training

Github

Series

Tags

Four ways to show somebody how to service a motorcycle engine, compared head to head. A volumetric recording of the instructor beat an avatar, a hand, and an annotation on almost everything that mattered.

The pattern across the series

You might also like …

You might also like …