Immersion Analysis Through Eye-Tracking and Audio in Virtual Reality

: In this study, using Head Mounted Display (HMD), which is one of the biggest advantage of Virtual Reality (VR) environment, tracks the user’s gaze in 360 ◦ video content, and examines how the gaze pattern is distributed according to the user’s immersion. As a result of analyzing the gaze pattern distribution of contents with high user immersion and contents with low user immersion through a questionnaire, it was confirmed that the higher the immersion, the more the gaze distribution tends to be concentrated in the center of the screen. Through this experiment, we were able to understand the factors that make users immerse themselves in the VR environment, and among them, the importance of the audio of the content was shown. Fur-thermore, it was found that the shape of the gaze distribution for grasping the degree of immersion by the subject of the content was different. While reviewing the experimental results, we also confirmed the necessity of research to recognize specific objects in a VR environment.


Introduction
As technology has progressed, smart cities and the Internet of Things (IoT) have been actively studied [1][2][3]. Several services such as remote healthcare services and IoT remote-home control services based on 5G mobile networks have been developed. These services have certain specific characteristics: they need higher network speeds and well-organized networks while improving mobility and providing flexibility and freedom of location. In particular, online services such as non-face-to-face lectures are a major breakthrough accomplished using smart city technology. First, they give people the means to use their time efficiently and freely. Second, these services enable people to maintain their regular routines even in the event of unforeseen emergencies such as epidemics or natural disasters. Given these advantages, several efforts have focused on improving the quality of online services, one of which is to use Virtual Reality (VR) technology.
VR using a Head-mounted Display (HMD) provides more immersive content than existing 2D content. An HMD can provide user immersion in virtual environments. Research on HMDs is underway to increase user immersion by various means. Through a study that deduces facial expressions based on pupil movements and the shape of eyes from cameras for eye-tracking, we can see that the quality of interactions between users and VR content has improved [4][5][6].
Several technological innovations have been applied in response to  To prevent the spread of the virus, telecommuting, non-face-to-face lectures, and online meetings are being actively employed [7,8]. These measures have been positively evaluated regarding work efficiency. However, these measures have received more criticism in the educational domain. A typical reason is that it is difficult for the teacher to measure the students' attention unless the teacher and the student are in the same physical space. In this same context, students may have difficulty in establishing immediate communication with the teacher.
In this study, the degree of immersion through the user's gaze tracking coordinate data in a virtual environment through 360-degree video is investigated. Based on researches through eyetracking data and immersion in existing 2D contents, we check whether they are applied equally to VR contents. In addition, as well as the effect of sound for 2D contents' immersion improvement, we checked how the sound of VR contents shows the difference in immersion improvement compared to 2D contents.

Related Works
To improve user immersion in 2D content, research was conducted on optimal localization with sound effects, gaze tracking, and environmental changes. Among them, descriptions of the content were added through sound and it was confirmed that there is a clear difference in the degree of immersion associated with the user's environment. Reference [9] proposed usermodeling according to the propensity by combining the user's location-positioning data and gaze-tracking data using Wi-Fi Received Signal Strength (RSS) technology [9]. This was done through experiments on smartphones, tablet PCs, and laptops to determine the difference in the disposition and screen size according to the distance between the user and the media. In addition, by classifying spaces such as public places or study rooms, the correlation between immersion score and concentration was derived based on the number of people sharing a space.
The COVID-19 epidemic is driving non-face-to-face activities such as social distancing and remote classes. As a result, research to improve content immersion in VR environment has become more active. VR content can provide content to users with better immersion with HMD device. Therefore, combining VR with various types of content such as games, education, and advertisements has been studied [10][11][12][13][14]. Specifically, from an educational perspective, VR can improve the quality of online lectures. Reference [15] researched VR for education focused on the use of immersive VR [15]. In addition, research to experience 360 • content in VR is also underway. To apply the recent trend, studies for faster and simpler real-time streaming and algorithms for video-frame stabilization and focus assistance for stabilization of 360 • content are also being conducted [16][17][18].
Eye-tracking technology tracks user gaze and measures the eyeball position and movement. It is used in research fields such as psychology and cognitive linguistics [19]. Based on these studies, eye-tracking was examined in Human-Computer Interaction (HCI) for applications such as marketing. It was also used to determine what types of content users found more interesting in a desktop environment rather than a VR environment and what elements were eye-catching on a webpage [20]. There has also been a considerable amount of research into applying these eyetracking techniques to education and medical fields to remove the limitations of locations [21][22][23].
Reference [24] was cited for the questionnaire in the experiment and several modifications were made since the content cannot be manipulated [24]. Reference [24] conducted a total of three experiments on VR game content. Through their experiment, they confirmed that both objective measurement through eye-movement and subjective measurement through a questionnaire can be effective. In addition, both positive and negative emotions were strong influences when participants engaged in immersion. The experiment in this paper visually confirms participants' eye-movement directly and shows the difference in the distribution of gazes in the HMD according to the degree of immersion through the questionnaire items that have proven effectiveness. Fig. 1 below is the heatmap on a 2D webpage and 3D 360 • video.

Experiment
There were 20 experiment participants in total, all of whom were in their 20s. There were 13 males and seven females. Due to COVID-19 restrictions, the experiment was conducted in a separate room. Participants always wore masks for the experiment and their masks did not interfere with the HMD device or earphones. The experiment was conducted following social distancing protocols; during the experiment, no one including the author was allowed to enter the separate room except where unavoidable. Participants were able to immediately stop the experiment if they felt abnormal in any way or were severely unable to concentrate.

360 • Video Contents and Questionnaire
The subjects of the 360 • video contents for the experiment were sports and travel. Each participant experienced the two pieces of content depending on whether it included sound and four contents were experienced in total. Tab. 1 above shows example contents. And Fig. 2 below is the images of participant's eyes, HMD's location and gazed coordination.  While subjects were experiencing the content, they could quit out and end the experience at any time they wanted. In addition, the subjects were allowed sufficient rest time until the next content was executed. Whenever each content piece was finished, a survey was conducted, the content of which was a modified version of the survey contents in Reference [21]. Immediately after the end of the content, the subject was asked to respond to the questionnaire, the content of which can be found in the Appendix.

Experiment Environment
In this study, we experimented with two subjects of 360 • content in a VR environment. Tab. 2 shows the environment of our experiment. We used FOVE VR Eye-tracking HMD as the main display equipment and Apple's Bluetooth earphone, Airpods Pro, as the main audio equipment.

Data Collection
FOVE's SDK was used to gather eye-tracking data used in the experiment. Fig. 3 below is the screen that can be seen when executing FOVE's debug tool.
Tab. 3 describes the values of X, Y, Z, and W that can be seen in the (1) box of Fig. 3. These values are for the location of the HMD and indicate the horizontal, vertical, front and rear, and height positions, respectively.  We used the FOVE SDK to get the gaze coordinates of the FOVE debug tool. The gaze coordinate data was received directly from the HMD via C++. Figs. 4 and 5 below show the output of the C++ program and the part of csv files that shows the gaze points of each frame. The Gaze Point indicates the "Screen Gaze" of the FOVE SDK. In Screen Gaze, the center of the screen where the eye is looking is (0, 0), and the maximum coordinate value for each is 1. Fig. 6 below presents an example of Screen Gaze.
The columns in Tab. 5 correspond to the x and y coordinates for the left and right eyes on the HMD screen coordinates, respectively. When mapping the coordinates and video, there are a few things worth keeping in mind. First, it is necessary to prepare for a case where the HMD cannot recognize the eyes, such as when blinking. In this paper, for instances of None Data, which is the input for the eye blinking, the frame at the time is removed and mapped by five frames from the input point. Meanwhile, when None Data is checked as output, the HMD screen also progresses to a blank screen. In the 360 • image, the part that the user can actually see from the HMD is not the entire content but only a 180 • section that is the user's viewpoint. Therefore, the image to be used for mapping only proceeded to the screen viewed as the HMD's mirror client. Fig. 7 below shows the user's perspective when using the HMD.

Experiment Result
Analyzing the collected data and looking at the positions of the participants' left and right eyes by content, we can derive graphs as shown in Fig. 9 below. Number 1, the representative question of the questionnaire, is related to emotional transfer, an important factor in the degree of immersion. The average response is that Contents 1 and 3 that included sound were emotionally immersive; however, Contents 2 and 4 for which the sound had been removed were not well-transmitted.
Number 4 is a question that checks how much the participant assimilated into the content based on the degree of immersion. As an average response, Contents 1 and 3 responded that they wanted to move, and Content 2 and 4 showed neutral and negative responses, respectively. This question shows that the user's reaction may be different depending on the contents' characteristic.
In addition, in the experiment of this study, males have a greater proportion of males and females. With this factor, we could understand the result that Content 2 was neutral.
7.71 4.14 8.5 7.14 Imm/S Number 13 shows how strongly the user was interested in the content. Due to the nature of video content, unlike games that are directly manipulated, participants experience events in order without knowing what they will experience later. In the case of Contents 1 and 3, participants were interested in the content by being blocked from the outside on average.
Number 16 directly asks about the participant's immersion. Participants expressed that while experiencing the content on average, they seem to experience events directly within the content, rather than being in the real world. However, an important point is that Contents 2 and 4 had more neutral responses than disagreement. It can be seen that experiencing VR content through an HMD can give users a sense of immersion.
Number 18 asks how deeply the participant experienced the content. Contents 1 and 3 were each able to enable complete immersion in participants on average. However, negative responses were average for Contents 2 and 4, suggesting that whether the content includes sound greatly affects the level of immersion.
We can confirm some facts from Tab. 6 and Fig. 9. Although the questionnaire is a subjective indicator, relatively few participants were confirmed as neutral (N), calculated as average values. Through this, it can be confirmed that participants have relatively similar opinions on the content. What can be inferred from the immersion score for each piece of content for participants is that in the sports category, audio data is very influential on immersion. This is because, in the case of sports content, not only visual elements such as commentary, shouting, cheering, etc. but also immersion through audio can be specified. However, in the case of travel, there are some differences but we can see that this has very little effect. In travel contents, enhanced immersion can be expected through the guide's audio data but it is understood that the influence of audio data on immersion is relatively weak because traveling alone can be viewed as sufficiently common. We can deduce one more fact here. The element that contributes to immersion in the subject of content is more effective if that element is not intended by the participant. For example, audio data in the sports category includes commentary, cheering, and shouting, as mentioned earlier. These elements can be heard sufficiently even if participants do not intend them, so this element can bring a sense of presence. However, in the case of travel, the only content that can be received as audio data is guide data or ambient white noise. This is because audio data has a significantly lower weighting than that of sports subjects, so it has the amount of change as shown in Fig. 9.

Discussion and Conclusion
Smart cities are still in the process of development with IoT technology, future-generation networking services, and other technological innovations still being in the pipeline. In this context, we identified a common feature of studies in this domain: an online system with various devices. We focused on the word "immersion" in the keywords and had some confidence in the growth of the online education market. Improving the quality of online educational services such as lectures, conferences, and presentations requires increasing participants' immersion in online environments. We found that the best way to achieve this is by means of a VR environment with an HMD device, which is commonly used when experiencing VR content. To enhance VR immersion, we intensively analyzed several factors that have been extensively researched in the past. Based on eyetracking coordinates and audio data that can best be grasped in VR environments, we investigated what characteristics can be identified when a user is immersed and when they are not immersed.
The biggest factor that presents a problem in VR content is motion sickness. Until now, research to relieve VR motion sickness has been steadily progressing [25][26][27][28]. VR motion sickness induces symptoms such as nausea, vomiting, and dizziness, and Reference [29] said that through an experiment, there was a big difference in the degree of motion sickness according to movement in the genre. In addition, although it depends on the user's environment and gender, the experience of VR does not have a significant effect.
When experiencing content in both 2D and in a VR environment, sound has a great influence on user immersion. First, in Fig. 8, you can see different gaze distributions between Contents 1 and 2 in the same category. In the case of Content 1, this content is reproduced with sound in the sports content and has a distribution concentrated in the center. In the case of Content 2, it is possible to check the distribution in a form that spreads in all directions by removing the sound.
Unlike Contents 1 and 2 that have the theme of sports, Contents 3 and 4 on the theme of travel show different sights. Both contents seem relatively less influenced by sound when compared to sports-themed contents. The theme of travel can give a better view of the objects you want to see by moving the position of the HMD device itself (the position of the head) to show what the user wants to see. As can be seen in Tab. 6, the immersion ratings of Contents 3 and 4 are not significantly different on average compared to Contents 1 and 2, which have the sports theme. This is an analysis of both objective indicators (Gaze data) and subjective indicators (Questionnaire) shown through Reference [21].
However, a limitation of this study is that, due to the nature of VR, which necessitates direct experimentation using HMD, it was not possible to recruit many experimental participants due to social distancing and non-face-to-face activities caused by the COVID-19 pandemic. If the experiment were conducted based on the sample data received from more participants, more accurate results could be obtained.
The lack of 360 • video content is also a limitation. It is expected that more detailed results could be obtained if the contents of various subjects other than travel and sports conducted in this study were secured. In addition, motion sickness in VR can cause a big problem in immersion. In fact, in the case of the experimental participant with >70% of the missing values as a result of the experiment, it is said that only the first half of all contents were processed. When motion sickness arose, they were waiting for the contents to end. Since it was difficult for participants to judge that the content experience was valid, there was a situation where they gave up and were replaced with another person. In VR, motion sickness usually occurs when the frame rate of the content actually viewed by the user differs from the frame rate of the VR environment viewed through the HMD.
Through this study, we were able to approve the effect of sound on immersion in a VR environment and that understanding of immersion through gaze data was related to the subject of the content. With experiments in this paper, we were able to confirm the following: 1) Gaze data can be a good indicator to determine whether users are immersed but more accurate indicators should be made for gaze patterns by category. This experiment was conducted on contents with themes of sports and travel, but there are various other contents in addition to VR contents.
2) Sound data, like 2D content, has a great influence on content immersion. However, in the case of the travel category in the experiment in this study, it was confirmed that even without sound data, participants can still fully immerse themselves. Through this confirmation we were able to plan the future experiment for the category of the VR content.
3) Due to the use of an HMD device, which can be viewed as the difference between 2D and 3D contents, it was found that the gaze patterns for grasping immersion have different patterns.
For example, frequently mentioning the content topic derived from the content or continuously showing related objects is used to check whether the user was immersed in the 2D content. Therefore, to grasp the degree of immersion through the gaze pattern in 2D content, other factors (audio data contents, recognition of objects exposed to the content, etc.) must be identified. However, in the case of 3D content, when the user was immersed, the gaze pattern tended to be concentrated in the center. Conversely, when the user could not concentrate, the dispersive gaze pattern could be confirmed.
The study was conducted with a relatively small number of participants due to the COVID-19 pandemic. For future research, we will check the distribution of gazes in various contents such as fashion, education, and advertisements other than sports and travel, the similarities or differences between various contents, and the object recognition method in the VR environment to measure the immersion. It is necessary to proceed with this experiment in detail based on contents of various categories by collecting more content that was insufficient in this study. If further research proceeds, we plan to conduct more accurate data analysis by recruiting participants with more diverse topics and more diverse samples.
Furthermore, object recognition in VR content becomes a very important index for immersion. VR content that users experience with more choices, rather than showing only a fixed and limited screen as in 2D content, has limited capacity to grasp objects only with simple coordinates. Currently, it is widely used to grasp the location of an HMD's Gaze Vector Point through Unity and check the ID of the Asset there. However, object recognition using Unity's assets is impossible for 360 • video, which is a type of VR content. To solve this problem, future research should check the scene facing the HMD in 360 • video and conduct a study on object recognition methods through You Only Look Once (YOLO).

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.