|Computers, Materials & Continua |
Spatial-Resolution Independent Object Detection Framework for Aerial Imagery
1Deptartment of CSA, Utkal University, Bhubaneswar, 751004, India
2Department of Information Technology, VNRVJIET, Hyderabad, 500090, India
3Department of CSE, Sona College of Technology, Salem, 636005, India
4Research Institute for Innovation & Technology in Education (UNIR iTED), Universidad Internacional de La Rioja (UNIR), Logroño, 26006, Spain
*Corresponding Author: Daniel Burgos. Email: email@example.com
Received: 18 September 2020; Accepted: 14 February 2021
Abstract: Earth surveillance through aerial images allows more accurate identification and characterization of objects present on the surface from space and airborne platforms. The progression of deep learning and computer vision methods and the availability of heterogeneous multispectral remote sensing data make the field more fertile for research. With the evolution of optical sensors, aerial images are becoming more precise and larger, which leads to a new kind of problem for object detection algorithms. This paper proposes the “Sliding Region-based Convolutional Neural Network (SRCNN),” which is an extension of the Faster Region-based Convolutional Neural Network (RCNN) object detection framework to make it independent of the image’s spatial resolution and size. The sliding box strategy is used in the proposed model to segment the image while detecting. The proposed framework outperforms the state-of-the-art Faster RCNN model while processing images with significantly different spatial resolution values. The SRCNN is also capable of detecting objects in images of any size.
Keywords: Computer vision; deep learning; multispectral images; remote sensing; object detection; convolutional neural network; faster RCNN; sliding box strategy
Surveillance of a large geographical area through aerial imagery is undoubtedly a faster and less time-consuming process than conventional methods that use a horizontal perspective. Although there are some cases where aerial imagery cannot be used for surveillance, like person or facial detection and pedestrian or vehicle license plate detection, it can be used for detection of the number and types of vehicles in a city or any geographical area. To perform this task using a horizontal perspective, it is very expensive in terms of planning, procurement and execution, but computationally it is quite simple to analyse through an aerial perspective. The field of computer vision has resolved numerous problems of surveillance, irrespective of their type and complexity. Surveying the Earth from an aerial view by using deep learning has not only reduced the time and cost but has also become more accurate and robust with the availability of training data and computation power. There are many application areas, like the study of vegetation distribution in an area and changes in shape and size of agricultural land, towns, or slums, where the machine outsmarts humans concerning time as well as efficiency.
1.1 Object Detection
Object detection is a computer vision technique widely used in surveillance. It is generally used to determine the number, type and position of a particular object in an image. There are many state-of-the-art object detection frameworks such as the Region-based Fully Convolutional Network (RFCN) , Single-Shot Detector (SSD) , You Only Look Once (YOLO)  and RCNN  and its multiple variants, such as Mask RCNN , Fast RCNN , Faster RCNN , YOLO version 2  and YOLO version 3 . Each of these frameworks uses different methods and principles to detect objects, but all are based on deep neural networks. This study uses Faster RCNN rather than SSD and YOLO because of its accuracy , although it is slower and more resource-heavy than SSD and YOLO. When detecting objects from an extremely large aerial image, time and computational resources can be traded for accuracy.
The change in the size of the objects in the image makes the detection process more complex for the algorithm. When a trained model processes an input image with higher or lower spatial resolution than the training image dataset, the Region Proposal Network (RPN) of Faster RCNN fails to provide a Region of Interest (RoI). This is because the RPN uses similar sized anchor boxes as evaluated during the training process. For example, an object detection model trained on a dataset with a spatial resolution of 7.5 cm cannot perform well with an image with a spatial resolution of 30 cm. The same thing happens for the size of the image. A model trained on a dataset of images with the dimensions 250 px 250 px cannot perform accordingly with larger images with the dimensions 1000 px 1000 px or smaller images with the dimensions 100 px 100 px.
1.2 Problem Statement
Innovations in optical sensors, storage devices and sensor carriers like satellites, airplanes and drones have revolutionized the remote sensing and Geographic Information System (GIS) industries. These sensors are producing a huge number of multispectral images with different characteristics, such as spatial resolution. The spatial resolution of an aerial image can be defined as the actual size of an individual pixel on the surface, as demonstrated in Fig. 1. Images with lower spatial resolution values seem to be clearer and larger than those with relatively higher spatial resolution values.
An object detection model trained with an arial image dataset will perform accordingly with test images having the same spatial resolution, but its accuracy drops drastically when tested with images having a different spatial resolution. Almost all existing state-of-the-art frameworks fail to detect objects in this scenario. Though image cropping can be used where the spatial resolution of the training image is less than that of the testing image, the reverse (i.e., the spatial resolution of the training image is higher than that of the testing image) cannot be done with this technique.
1.3 Research Contributions
This paper proposes an extension to the state-of-the-art Faster RCNN. It is based on the sliding windows strategy which uses a mathematically-derived optimal window size for precise detection. The primary use cases of the proposed model can be noted as follows:
i) To detect objects from images of any spatial resolution value and size, such as detection of vehicles in a city  and tree crowns in a forest .
ii) For object detection in images captured from drones  or aircraft , where the elevation is not fixed, as elevation is directly proportional to the spatial resolution value, where the sensor remains constant.
iii) For the detection of small and very small objects, such as headcounts in protests or social gatherings [15,16].
iv) It can also be used for microscopic object detection such as cells , molecules , pathogens , red blood cells  and blob objects .
The rest of the paper is organized as follows. Section 2 provides an overview of some critical works on object detection in remote sensing and aerial imagery and methods to deal with size and resolution. Sections 3 and 4 provide the proposed model and its results, respectively. Finally, Section 5 contains the conclusion.
2 Related Works
Many pieces of literature have reviewed the application of deep-learning-based computer vision techniques in aerial imagery. The authors  surveyed about 270 publications related to object detection. This includes the detection of objects by (i) matching the template, (ii) matching the knowledge, (iii) image analysis and (iv) machine learning. They also raised a concern about the availability of labelled data for supervised learning. Han et al.  proposed a framework in, where a weakly labelled dataset can be used to extract high-level features. The problem of object orientation in remote sensing imagery is addressed in [24–26].
Diao et al.  proposed a deep belief network in, whereas  used a convolutional neural network for object detection. In , a basic RCNN model is used and in  a single-stage densely connected feature pyramid network is used for object detection specifically for very-high-resolution remote sensing imagery. The studies in [31,32] used the SSP and the state-of-the-art YOLO 9000, respectively. Huang et al. used a densely connected YOLO based on the SSP in . The proposed model aims to provide a framework that can process any aerial image with any value of spatial resolution. Although very few studies addressed this problem, the semantics of [34–36] and the method used in  are close to the working principle of the proposed model.
3 Proposed Method
This study proposes an extension that is based on the sliding window strategy; therefore, it is called the Sliding Region-based Convolutional Neural Network. In the proposed model, the slider box shown in Fig. 2i(a) will roam all over the input image just like a convolution operation with a determined stride value. The stride value is derived from the spatial resolution of the input image. At each instance of the box position, the model will perform the object detection process according to the stock Faster RCNN on the fragment of the image that falls under the footprint of the slider box as demonstrated in Fig. 2i(b). Fig. 2 shows the architecture of the proposed SRCNN. The proposed SRCNN is divided into three phases.
• Phase 1: Image Analysis
• Phase 2: Image Pre-Processing
• Phase 3: Object Detection
3.1 Phase 1: Image Analysis
Phase 1 of the proposed model includes data acquisition, data analysis and a box dimension proposal. This phase plays a vital role in normalizing the spatial resolution factor. As illustrated in Fig. 1 in Section 1.2, the size changes according to the spatial resolution value. So, the image has to be scaled in such a way that the size of the object in the training and testing images feels similar in terms of spatial view. In Fig. 3, the visual object size feels very similar in (a) and (b) as the image in (b) is down-scaled almost three times. For the proposed model, the original dimension of the scaled image can be the size of the slider box. The box length m can be derived from the average length s of the input image with dimensions a b and the spatial resolution of both training image r and testing image R, as follows:
Thus, the slider box width is the product of the training image width and the ratio between the spatial resolution of the training image and the input image. This value is also helpful when cropping a large image to process individually.
3.2 Phase 2: Image Pre-Processing
Phase 2 of the proposed model is image pre-processing, which includes image size analysis and padding. The size of the slider box, evaluated in Section 3.1, depends upon the spatial resolutions of the training and testing image and the dimensions of the training image. But the slider box has to traverse every pixel present in the testing image, so it must be compatible with the image size. In Fig. 4ii, the original image is too short to accommodate the last set of slider boxes. As the image area covered by these last boxes will be exempted from the object detection process, it cannot be ignored. This problem can be solved by either image resizing or image padding, in such a way that the end of the last slider box will converge with the end of the image, as demonstrated in Figs. 4i and 4iii.
It is observed in Fig. 4 that the object size in the padded image is the same as the original image, but the object size in the resized image is bigger than the original, and this is similar to Fig. 1. This means that resizing the image results in a significant change in spatial resolution. Thus, the proposed model has used the padding method over resizing. The given image needs to be padded with 0s in such a way that the sliding boxes can cover the entire image area. To determine the padding amount, two cases have to be considered for the slider box of length m, which takes p number of steps to cover the image having length n with O percentage of overlapping. The best and worst cases are demonstrated in Fig. 5.
a) Best Case:
The last box converges perfectly with the image as shown in Fig. 5 (case 1). The size of the image is calculated as follows:
b) Worst Case:
The last box does not converge with the image as shown in Fig. 5 (case 2). The box will take p number of steps to cover the image.
With p number of instances, an image of length n is needed to converge perfectly like the best case. The same formula is applied for vertical sliding as well.
3.3 Phase 3: Object Detection
Phase 3 of the proposed model is detection. The fraction of the image that falls under the footprint of the slider box is selected and the image matrix is processed by the Faster RCNN to detect the objects. Here, a trained Faster RCNN model is used to detect objects in the input image. Rather than taking the whole image at once, it takes the box image, i.e., the portion of the input image covered by the sliding box. By using Eq. (3), the row instance Pr and column instance Pc can be evaluated for an input image of dimension a b. The product of Pc and Pr is the total number of iterations I.
4 Results and Discussion
A computer with an Intel i5 8th generation processor, 8 GB RAM and a dedicated 4 GB NVIDIA GTX 1050ti graphics card is used to train a Faster RCNN model using the TensorFlow open-source library. Pre-trained weights named “faster_rcnn_inception_v2_coco_2018” are used to initialize the parameter for transfer learning. The model was trained for nineteen hours on the benchmark VEDAI dataset . The experimental codes used in this paper for evaluation and weights are available at https://github.com/sidharthsamanta/srcnn.
Four types of images with spatial resolution (sample image with ground truth demonstrated in Fig. 6) 7.5, 12.5, 15.5 and 30.5 cm were used for testing (Fig. 7). Each type contained three images of 256 px 256 px, the same as the training image dataset. All images were processed under Faster RCNN and the proposed SRCNN to determine the accuracy and the precision of the proposed framework.
i. Image Analysis: Details of the testing images are given below in Tab. 1. The Box Size column in the table is the length of the slider box, which is calculated by using the formula derived in Eq. (1).
ii. Image Pre-processing: The padding amount p is calculated for each image with 5% overlapping by using the mathematical formula from Eq. (6). The Padding Value column of Tab. 1 contains all the padding values for each test image. The first number represents the number of 0s to be added on the right side of the image and the second number represents the number of 0s to be appended at the bottom of the image. 0s can be padded on any side of the image, as there will be no effect on performance.
iii. Object Detection: Now the detector is deployed on top of the sliding window to process the image fragment that falls under its footprint. The process continues until the box reaches the vertical and horizontal end. Fig. 8 illustrates the sliding detection process.
iv. Evaluation: The outcomes of the proposed model with four sets of input images mentioned in Tab. 1 are compared with the Faster RCNN model in Tab. 2. The confusion matrix is used for calculating the accuracy (Eq. 7) and precision (Eq. 8).
a) True Positives (TP): Objects that are present in the ground truth and correctly detected in the output.
b) True Negatives (TN): Objects that are not present in the ground truth and not detected in the output. For object detection and localization, the TN is always considered 0.
c) False Positives (FP): Objects that are not present in the ground truth, but detected in the output.
d) False Negatives (FN): Objects that are not present in the ground truth, but detected in the output.
v. Discussion: As the spatial resolution was the same as the training image data, i.e., 12.5 cm, both models performed identically, as both are the same. But when the spatial resolution increased or decreased, the performance of the stock Faster RCNN started to deteriorate. There was a significant change in accuracy as well as in precision when the Faster RCNN dealt with the images having spatial resolution of 7.5 cm and 15 cm. At resolution 30 cm, it performed worse with 0 accuracies and 0 precision, whereas the proposed SRCNN shows the better results for every spatial resolution.
Detection of an object is a complex task due to ambiguity in object position, orientation and light source. A small modification of the sensor might change the scale of the objects present over the image. This scaling can be normalized by the proposed method, as it segments the image before detection. The proposed SRCNN outperformed the stock Faster RCNN on image samples with completely different spatial resolution values. It is additionally ascertained that the model can work with images of much smaller or far larger dimensions.
The size problem can also be resolved by using an internal slider box during the convolution operation. However, when an image with very large dimensions undergoes a convolution operation directly, it creates a large range of hyperparameters. Storing and processing these hyperparameters could cause a high configuration personal computer to run out of memory. There is a possibility to implement the extended part of SRCNN in a different state-of-the-art framework, such as YOLO or SSD.
Funding Statement: The authors received no specific funding for this study.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|