Enhancing the Robustness of Visual Object Tracking via Style Transfer

: The performance and accuracy of computer vision systems are affected by noise in different forms. Although numerous solutions and algorithms have been presented for dealing with every type of noise, a comprehen-sive technique that can cover all the diverse noises and mitigate their damaging effects on the performance and precision of various systems is still missing. In this paper, we have focused on the stability and robustness of one computer vision branch (i.e., visual object tracking). We have demonstrated that, without imposing a heavy computational load on a model or changing its algorithms, the drop in the performance and accuracy of a system when it is exposed to an unseen noise-laden test dataset can be prevented by simply applying the style transfer technique on the train dataset and training the model with a combination of these and the original untrained data. To verify our proposed approach, it is applied on a generic object tracker by using regression networks. This method’s validity is confirmed by testing it on an exclusive benchmark comprising 50 image sequences, with each sequence containing 15 types of noise at five different intensity levels. The OPE curves obtained show a 40% increase in the robustness of the proposed object tracker against noise, compared to the other trackers considered.


Introduction
Visual object tracking (VOT), which is a subset of computer vision systems, refers to the process of examining a region of an image in order to detect one/several targets and to estimate its/their positions in subsequent frames [1]. Computer vision includes other sub-branches such as object detection [2], classification [3], optical-flow computation [4], and segmentation [5]. Because of its greater challenges and more versatile applications, further attention has been paid to the subject of VOT, and it has become one of the main branches of computer vision, especially in the last two decades [6].
The VOT procedure is implemented in four steps of i) target initialization, ii) appearance model, iii) motion prediction, and iv) target positioning [15]. In the target initialization step, the object/objects we intend to track is/are usually specified by one/several bounding boxes in the first frame. The appearance model itself comprises the two steps of visual representation (which is used in the construction of robust object descriptors with the help of various visual features) and statistical modeling (which is employed in the construction of mathematical models by means of the statistical learning techniques) for the detection of objects in image frames [16,17]. The target positions in other frames are estimated in the motion prediction step. The ultimate position of a target is determined in the final step by different search methods such as greedy search [18] or by the maximum posterior prediction techniques [19].
In any computer vision application, a correct and precise object tracking operation can be achieved by feeding clean data and images to a system; image corruptions in any form can lead to a drop in system performance and robustness. For example, the presence of atmospheric haze can diminish the performance and accuracy of autonomous vehicles and surveillance systems. Mehra et al. [20] showed that the presence of haze or any type of suspended particles in the atmosphere has an adverse snow noise effect on an image, degrading its brightness, contrast and texture features. Also, these suspended particles may sometimes alter the foreground and background of images and cause the failure of any type of computer vision task (e.g., VOT). In another research, the retrieval of lost information in LIDAR images acquired by autonomous vehicles in snowy and rainy conditions has been investigated. The principle component analysis has been used to improve the obtained images [21].
Other issues also influence the VOT robustness and factors such as the quality of camera sensors, requirements for real-time processing, noise, loss of information during the transfer from 3D to 2D space, and environmental changes. Several factors could cause the environmental fluctuations themselves, e.g., the presence of occlusions, illumination problems, deformations, camera rotation, and other external disturbances [15]. In VOT, the occlusions can occur in three forms: self-occlusion, inter-object occlusion, and occlusion by the background; and for each of these occlusions, four different intensity levels are considered: non-occlusion, partial occlusion, full occlusion, and long-term full occlusion [22].
Modeling an object's motion by means of linear and nonlinear dynamic models is one way of dealing with occlusion in object tracking. Such models can be used to predict the motion of an object from the moment of its occlusion to its reemergence. Other methods such as the silhouette projections, color histogram, and optical flow techniques have also been employed for removing the occlusions and boosting the robustness of object trackers [22]. Liu et al. [23] presented a robust technique for detecting traffic signs. They claimed that all the traffic signs with occlusion of less than 50% could be identified by their proposed method. In another study [24], occlusion problem was solved by using particle swarm optimization as a tracker and combining it with Kalman filter.
In this paper, we have proposed a new method for increasing the robustness and preventing the performance drop of object trackers under different ambient conditions. The presented method can be applied to various types of trackers and detectors and it does not impose a heavy computational load on a system. To substantiate our claim, we have implemented our approach on a visual tracker known as the generic object tracking using regression networks. The main challenge we confronted was the lack of a specific benchmark for evaluating the proposed model and comparing it with other existing algorithms. To deal with this deficiency, we tried to create a benchmark that included most of the existing noises. The main contributions of this work can be summarized as follows: • Building new training data from previous data through style transfer and combining them.
• Modeling and classifying 15 different types of noises in four groups with five different intensity levels and applying them to the benchmark. • Applying the proposed method on one of the existing object trackers and comparing the obtained results with those of the other trackers.
It should be mentioned that the presented technique can be applied to multi-object trackers as well. In the rest of this paper, a review of the research activities was conducted to improve image quality and suppress the adverse effects of noise on visual tracker performance has been presented in Section 2. The proposed methodology has been fully explained in Section 3. In Section 4, the obtained results have been given and compared with those of other techniques. Finally, the conclusions and the future work have been covered in Section 5.

Common Methods of Maintaining Robustness in Object Trackers
Image enhancement and image restoration are usually known as image denoising, deblocking and deblurring [25]. Yu et al. [25] have defined the image enhancement and restoration process as follows: "A procedure that attempts to improve the image quality by removing the degradation while preserving the underlying image characteristics." The works conducted on the subject of robustness in object trackers can be generally divided into two categories: i) denoising techniques and ii) using deep networks. The denoising techniques inflict a high computational cost. Conversely, the low speed of deep networks in updating the weights has become a serious hurdle in the extensive use of these networks in visual tracking [26]. Each of these methods has been explained in the following subsections.

Denoising Techniques
The first and simplest method for improving a system's accuracy and performance against noisy data is to use a denoising or image restoration technique. In this approach, before feeding the data to the system, different filters and algorithms are used to remove the noise from a corrupted image and to keep the edges and other details of the image intact as much as possible. Some of the more famous of these techniques in the last decade are the Markov random field [27], block-matching and 3D filtering (BM3D) [28], decision-based median filter (DBMF) [29], incremental multiple principal component analysis [30], histogram of oriented gradients [31], local binary pattern human detector [32], co-tracking with the help of support vector machine (SVM) [33], and the nonlocal self-similarity [34] methods [14]. For example, the BM3D filtering technique has been employed in [28] for image denoising, using the unnecessary information of images.
The standard image processing filters have many problems. For example, the median filter acts on all image pixels and restores them without paying attention to the presence or absence of noise. To deal with this drawback, fuzzy smart filters have been developed. These filters have been designed to act more intensely on the noisy regions of images and overlook the regions with no noise. Fuzzy logic was used in [35] for the first time to improve the quality of color images, remove the impulsive noises, and preserve the image details and edges. Earlier, Yang et al. [36] had employed the heuristic fuzzy rules to enhance the multilevel median filters' performance. Despite the mentioned advantages of the fuzzy smart filters, they have two fundamental flaws: • New image corruptions: The mentioned techniques cause new corruptions in the processed images in proportion to the noise intensity levels. For example, in applying the median filter, the edges in the improved images are displaced in proportion to the window size.
As another example, in image denoising with a diffusion filter's help, the image details, especially in images with high noise intensities, fade considerably. • Application-based: The mentioned filters cannot be applied to any type of noise. For example, it was demonstrated in [37] that the Weiner filter performs better on speckle, Poisson, and Gaussian noises than the mean and the median filters.
The denoising techniques improved very little during the last decade. The denoising algorithms were believed to have reached their optimal performance, which cannot be further improved [38]. It was about this time that the emergence of machine learning techniques opened a new door to image quality improvement and denoising.

Learning-Based Methods
First convolutional neural network (CNN), called LeNet, was presented by LeCun et al. [39] to deal with large data sets and complex inference-based operations. Later on, and since the development of the AlexNet, the CNNs have turned into one of the most common and successful deep learning networks for image processing. Jain et al. [40] have claimed that using the CNNs to denoise natural images is more effective than using other image processing techniques such as the Markov random field. For face recognition in noisy images, Meng et al. [41] have proposed a deep CNN consisting of denoising and recognition sub-networks. Contrary to the classic methods, in which the two mentioned sub-networks are trained independently, these two have been trained as a sequence in the above-mentioned work.
Using a CNN and training it without a pre-learned pattern requires a large training dataset. Moreover, even if such data are available, it would take a long time (tens of minutes) for training a network and reaching the desired accuracy [26]. Considering this matter, a CNN-based object tracker consisting of four layers (two convolutional layers and two fully-connected layers) was presented in [26]. This tracker has been proposed by adding a robust sampling mechanism in mini-batches and modifying the stochastic gradient descent to update the parameters, significantly boosting the execution speed and robustness during training.
Based on machine learning knowledge, a prior image is required by the learning-based methods. Despite the simplicity of these techniques, they have two drawbacks: • Extra computational cost-which is imposed on a system due to the optimization of these techniques • Manual adjustment-which has to be performed because of the non-convexity of these techniques and the need to enhance their performance To deal with these two issues, discriminative learning methods were proposed. Using a discriminative learning approach, Bhat et al. [42] presented an offline object tracking architecture based on the target model prediction network, predicting a model in just a few optimization steps. Despite all the approaches presented so far, the problem of dependency on prior data still remains. Some researchers have tackled this problem with the help of correlation filters. Using the correlation filters in object tracking techniques to improve performance and accuracy is common; this has led to two classes of object trackers: correlation filter-based trackers (CFTs) and non-correlation filter-based trackers (NCFTs). A novel method of vehicle detection and tracking based on the Yolov3 architecture has been presented in [43]. The researchers have used a vision and image quality improvement technique in this work, which includes three steps: illumination enhancement, reflection component enhancement, and linear weighted fusion. In another study, and based on a sparse collaborative model, Zhong et al. [44] presented a robust object tracking algorithm that simultaneously exploits the holistic templates and local representations to analyze severe appearance changes.
Another method for preventing accuracy loss when using corrupted data is to import these data directly into a training set. Zhao et al. [45] focused on blurred images as a particular case of data corruption. They showed that a low accuracy is obtained by evaluating the final model on blurred images, even by using deeper or wider networks. To rectify this problem, they tried to fine-tune the model on a combination of clear and blurred images in order to improve its performance. A review of the different techniques used for enhancing the robustness of object trackers has been presented in Fig. 1. In this section, we will describe the proposed procedure in full details. At first, we need to introduce an object tracker, which will be used in this work. After selecting a tracker type, the process will be divided into several subsections, which will then be applied in sequence to the model considered.
Our methodology comprises three basic steps. In the first step, we train our network model with a set of initial data and then evaluate it on an OTB benchmark and compare it with other trackers. In the second step, we apply the modeled noises to the benchmark and again evaluate the model on the noisy benchmark. In the third step, we obtain the style transfer of every single training dataset, train the model with a combination of clean and stylized data, apply the trained model on the benchmark of the preceding step, and report the results.

Selecting an Object Tracker
In the early part of 2016, Held et al. [46] demonstrated that the generic object trackers could be trained in real-time by observing objects' motion in offline videos. In this regard, they presented their proposed model known as the generic object tracking using regression networks (GOTURN). They also claimed this tracker to be the first neural network tracker that was able to complete the learning process at a speed of about 100 frames per second (100 fps). Thus, we decided to implement our method on this tracker and compare the results before and after applying the changes. It should be mentioned that the presented method in this paper can be applied to all the object trackers and detectors that might be affected by various noises. Fig. 2 shows the performances of two of the most common object trackers (the GOTURN and the SiamMask [47]) in the presence of snow noise. Here, we applied the said noise at five different intensity levels on a dataset consisting of 70 image frames and evaluated these two trackers' performances on the noisy images. The figure includes only 18 sample frames (frame numbers 0, 1,5,9,13,17,21,25,29,33,37,41,45,49,53,60, 65 and 69, starting from top left). As is observed in the figure, The GOTURN tracker fails in frame 30, at the noise intensity level of 3, and the SiamMask tracker fails in frame 52, at the noise intensity of 4. Although the SiamMask tracker shows more robustness than the GOTURN tracker, the tracking operation in both trackers is hampered at different noise intensity levels.

Training/Testing with Clean Data
In this paper, we trained our network with a combination of common images and films. Also, to minimize the error between the predicted bounding box and the ground-truth bounding box, we used the L1 loss function.
The film set: This set contains 314 video sequences, each of which has been extracted from the ALOV300++ dataset [48]. On the average, the 5 th frame of each video sequence was labeled according to the position of the object to be tracked, and an annotation file was produced for these frames. The film set was then split into two portions; 20% as the test data and 80% as the training data.
The image set: The first set of images has been taken from the ImageNet detection challenge set, which contains 478807 objects with labeled bounding boxes. The second set of images has been adopted from the common objects in context (COCO) set [49]. This dataset includes 330,000 images in 81 different object categories. More than 200,000 of these images have been annotated, and they cover almost 2 million instances.

Model Evaluation with Corrupted Data
Most of the benchmarks presented in the literature include either clean data or only specific noises such as the Gaussian noise, while in the real world, our vision is affected by noises of different types and intensities. We needed a noisy benchmark for this work, so we decided to build our own custom benchmark. Note that the mentioned benchmark will only be employed to evaluate the system robustness against different types of noises, and it will never be used to train the proposed object tracker.
In 2019, Hendrycks et al. [50] introduced a set of 15 image corruptions with five different intensities (a total of 75 corruptions). They used it to evaluate the robustness of the ImageNet model in dealing with object detection. The names and details of these corruptions have been displayed in Fig. 3. Based on our viewpoint, we have divided these 15 visual corruptions into the following four categories and interpreted each one by a model of real-world events: • Brightness: We consider the amount of image brightness equivalent to noise and model it with three types of common noises, i.e., the Gaussian noise, the Poisson noise (which is also known as the shot noise), and the impulse noise. For example, the authors in [50] have claimed that the Gaussian noise appears in images under low-lighting conditions. • Blur: Image blurriness is often a camera-related phenomenon, and it can occur via different mechanisms such as the sudden jerking of the camera, improper focusing, insufficient depthof-field, camera shaking, shutter speed, etc. We modeled these factors' effects on images with four types of blurriness: defocus blur, frosted glass blur, zoom blur, and motion blur. • Weather: One of the most important parameters affecting computer vision systems' quality and reducing their accuracy is the weather condition. We considered a corresponding image corruption for each of the four types of common weather conditions (rainy, snowy, foggy/hazy, and sunny). The snow noise simulates the snowy weather, the frost noise reflects the rainy conditions, the fog noise indicates all the different situations in which a target object is shrouded, and finally, the brightness noise models the sunny conditions and the direct emission of light on camera sensors and lenses. • Digital accuracy: Any type of change in the quality of an image during its saving, compression, sampling, etc., can be considered noise. In this section, such noises will be modeled by the changes of contrast, elastic transforms [51], saving in the JPEG format, and pixelation.
This paper's basic benchmark (the OTB50) includes 50 different sequences such as basketball, box, vehicle, dog, doll, etc. [52]. We apply all the above noises at five different intensity levels (from 1 for the lowest to 5 for the highest intensity) on each of these sequences and build our own custom benchmark. In selecting a benchmark, we must ensure that the data and images in the different sequences of the benchmark don't have any commonality and overlap with the training data; otherwise, the obtained results will be inaccurate and biased and cannot be generalized to other models. For example, the VOT2015 benchmark cannot be used in this paper because of its overlap with the training data.

Model Training/Testing with Combined Data
One of the applications of deep learning in the arts is the style transfer technique, closely resembling the Deep Dream [53]. This technique was first presented by Gatys et al. [54] in 2016. In this transfer process, two images are used as inputs: the content image and the style reference image. Then, with the help of a neural network, these two images are combined to yield the output image. This network aims to construct a completely new image whose content is provided by the content image and whose style is adopted from the style reference image. This new image preserves the content of the original image in the style of another image.
We employ this technique here and get the style transfer of each of our datasets (with hyperparameter α = 1) by means of the adaptive instance normalization (AdaIN) method [55]. Again, as before, an annotation file is created for the new dataset. Finally, we train our object tracker model with a combination of the initial (standard) dataset and the stylized dataset. An example of this transfer and the proposed methodology has been illustrated in Fig. 4. (The style transfer method used for training the proposed model has been taken from https://github.com/bethgelab/stylize-datasets).

Experimental Results
In order to evaluate the performance of the proposed method and the results achieved by applying it to our custom benchmark, we need to define a specific measure. The most common method used for evaluating the object tracker algorithms is the one-pass evaluation (OPE) approach. In this approach, for each algorithm, the ground-truth of a target object is initialized in the first image frame. Then, the average accuracy or the success rate is reported for the rest of the frames [52]. Considering the many types of noises (15 noise models) and the intensity levels for each noise (5 intensities), a total of 75 graphs will be obtained. Plotting all these graphs and comparing them with one another will not be logical or practical and confuse the reader. Thus, we decided to adopt a criterion that would be appropriate to our approach. In this criterion, the abscissa of each diagram is partitioned into many intervals. The number of these partitions and their intervals are indicated with n and x, respectively, so that where a and b represent the lower and the upper bounds of the abscissa and have values of 0 and 1, respectively.
The closer the partitions are, the higher the obtained accuracy. Therefore, we bring n closer to infinity in order to reduce the distance between the partitions. Next, the average value is computed for each of the four noise models (brightness, blur, weather, and digital) and different types of trackers in the OPE diagrams. Thus, we have where x is the overlap threshold and f is the success rate. Also, N indicates the number of subsets in each of the four noise models, and its values are 3, 4, 4 and 4 for the brightness, blur, weather, and digital noise types, respectively.
Similar to the Riemann sum theory [56], the above function will converge either to the upper bound (called underestimation in the literature) or the lower bound (called overestimation in the literature), depending on the chosen values of the functions in the partitioned intervals. This notion can also be described by the upper and lower Darboux sum theory. Therefore, Lemma (1): Assuming a large number of partitioned intervals, the underestimated and overestimated values will be equal to each other, and it will be proven that the above function is integrable in the [a, b] interval. Thus for n → ∞: OPE inf = OPE sup = OPE new (6) Lemma (2): Using the Riemann sums, the value of a definite integral in the following form can be easily approximated for continuous and non-negative functions: Hypothesis (1): A bounded function is Riemann integrable over a compact interval if, and only if, it is continuous almost everywhere. It means that the set of non-continuity points in terms of the Lebesgue size has zero value. This characteristic is sometimes called the "Lebesgue's integrability condition" or the "Lebesgue's criterion for Riemann integrability." By considering Lemma (2) and Hypothesis (1) and assuming equal lengths for the partitioned intervals, the above equation can be rewritten as The simulation results obtained based on the defined criterion have been displayed in Fig. 5. As is observed in this figure, without altering the structure of a model, the proposed approach has been able to significantly enhance the robustness of the model against different types of noises.
In conclusion, by using the results of Fig. 5, we have calculated the average area under curve (AUC) of each tracker and also calculated the amount of their AUC drop after applying noise in 5 different levels according to the following equations. The results are reported in Tab. 1.
where M is equal to the number of noise categories modeled, L 0 is the value of the AUC without noise and s also represents the noise levels in which; s ∈ {1, 2, 3, 4, 5}.
Although our work has reached its aims, it has potential limitations. First, due to the combination of clean data and their style transfer, the size of the final data set will be more than doubled, which will increase the network learning time. Second, selecting the proper content layer, style layer, and optimization techniques (e.g., Chimp optimization algorithm [57]), to some extent, might affect the obtained result and performance of the tracker in presence of noise.
According to the results, at the noise level of 1, all trackers showed relatively good robustness, and their AUC drop was less than 18%. At the noise level of 2, a small number of trackers experienced an AUC drop of more than 24%, and the rest of the trackers had a maximum AUC drop of 20%. From the noise level of 3 onwards, there is a significant drop in the trackers' robustness, in which the upper limit of AUC drop between the trackers and in the noise level of 3, 4 and 5 was about 25%, 30% and 40%, respectively. However, the GOTURN trackers training, according to the approach proposed in this paper, showed excellent robustness to all five noise levels, and the maximum AUC drop in all five levels did not exceed 5%.

Conclusion and Future Work
Visual noises in images are unwanted and undesirable aberrations, which we always try to get rid of or reduce. In digital images, noises appear as random spots on a bright surface, and they can substantially reduce the quality of these images. Image noises can occur in different ways and by various mechanisms such as overexposure, sudden jerking or shaking of camera, changes of brightness, magnetic fields, improper focusing, and environmental conditions like fog, rain, snow, dust, etc. Noises have negative effects on the performance and precision of computer vision systems such as object trackers. Separately dealing with each of these challenges is an easy task, but it is much more difficult to manage them collectively, which is practically more important. In this paper, a novel method was presented for preserving the performance and accuracy of object trackers against noisy data. In this technique, the tracker model is only trained by a combination of standard training data and their style transfer. To validate the presented approach, an object tracker was chosen from the commonly used trackers available, and the proposed technique was applied to it. This tracker was tested on a customized benchmark containing 15 types of noises at five different noise intensity levels. The obtained results show an increase in the proposed model's accuracy and robustness against different noises than the other considered object trackers. In future work, we intend to apply the Deep Dream technique on our custom training set and train the object tracker with the combination of this dataset and its style transfer. We also intend to test it on both single-object and multi-object trackers. It is worthy of mentioning that this method can be used as a kind of preprocessing block for maintaining robustness in any object detections or computer vision tasks.
Funding Statement: The authors received no specific funding for this study.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.