Volumetric Training

2 January 2021 20 mins read

Volumetric visual cue enhanced spatial perception in the tested MR training tasks, exhibiting increased co-presence and system usability while reducing mental workload and frustration.

Conventional training and remote collaboration systems allow users to see each other's faces, heightening the sense of presence while sharing content like videos or slideshows. However, these methods lack depth information and a free 3D perspective of the training content. This paper investigates the impact of volumetric playback in a Mixed Reality (MR) spatial training system. We describe the MR system in a mechanical assembly scenario that incorporates various instruction delivery cues. Building upon previous research, four spatial instruction cues were explored: "Annotation", "Hand gestures", "Avatar", and "Volumetric playback". Through two user studies that simulated a real-world mechanical assembly task, we found that the volumetric visual cue enhanced spatial perception in the tested MR training tasks, exhibiting increased co-presence and system usability while reducing mental workload and frustration. We also found that the given tasks required less effort and mental load when eye gaze was incorporated. Eye gaze on its own was not perceived to be very useful, but it helped to complement the hand gesture cues. Finally, we discuss limitations, future work and potential applications of our system.

We developed a system capable of recording and replaying instructions using four different visual cues. The prototype system was built with an HMD (HTC Vive Pro Eye), three depth cameras (Azure Kinect), and Unity running on a Windows PC. For the assembly task, a motorcycle engine from a 2008 Hyosung GT 250R was used. The tasks were inspired by general engine maintenance procedures from the motorcycle's service manual.

The see-through video capabilities of the HTC Vive were chosen over an optical see-through AR headset due to the need for high tracking precision over a large area. Portable HMD devices were also ruled out, as volumetric playback requires significant computational resources not available on mobile hardware.

For recording instructions, the instructor performed real-world tasks while wearing the HMD, with actions saved to the PC. Annotations were recorded by moving the HTC Vive controller along the desired path while pressing the trigger; the controller's time and position were saved and played back in time-space synced format. Gestures were captured using the front-facing cameras of the Vive HMD and played back to recreate the instruction. For the Avatar representation, the instructor wore the HMD while holding both controllers; the position and rotation of the HMD and controllers were recorded over time and applied to an inverse-kinematics-rigged skeleton. Volumetric capture used the Azure Kinect cameras, with the instructor performing tasks while being captured by three cameras placed 1.2 m apart at the vertices of an equilateral triangle for optimum coverage.

A total of 15 basic motorcycle engine maintenance tasks were selected from the service manual. An informal trial with three participants had them perform all fifteen tasks across all four conditions, with completion time and self-rated difficulty recorded. Harder tasks consistently took longer. Based on this, the task set was narrowed to nine tasks classified into three difficulty levels (hard, medium, and easy) based on the number of individual actions required. For example, a medium task involved placing the oil filler cap in its corresponding location, while a hard task required removing a banjo bolt connecting the radiator feed to the engine, threading it through, and tightening it back. In any given condition, participants received a random subset of six instructions, two per difficulty level. Audio and visual confirmation followed each completed instruction before the system advanced. Completion time was logged at the end of each condition.

This paper presents an MR system for enhanced training that features annotation, hand gestures, avatar representation, and eye gaze as instruction delivery cues. Participants felt more connected with the instructor when volumetric playback was used. Volumetric playback significantly improved Social Presence and system usability while reducing mental workload and frustration compared to avatar representation, and was the most preferred visual cue overall.

In the follow-up study comparing eye gaze with hand gestures, participants reported that seeing both cues simultaneously reduced mental load and effort, as they complemented each other when one cue was lacking. The gesture-only interface was generally ranked highest; eye gaze added limited value and sometimes caused distraction through misleading information or obstructed views. Hand gestures are more effective than eye gaze alone in AR assembly training. Although combining both cues may not improve task performance, the combination is preferable as it reduces overall workload.

You might also like the following posts …