An AIoT Monitoring System for Multi-Object Tracking and Alerting

: Pig farmers want to have an effective solution for automatically detecting and tracking multiple pigs and alerting their conditions in order to recognize disease risk factors quickly. In this paper, therefore, we pro-pose a novel monitoring system using an Artificial Intelligence of Things (AIoT) technique combining artificial intelligence and Internet of Things (IoT). The proposed system consists of AIoT edge devices and a central monitoring server. First, an AIoT edge device extracts video frame images from a CCTV camera installed in a pig pen by a frame extraction method, detects multiple pigs in the images by a faster region-based convolutional neural network (RCNN) model, and tracks them by an object center-point tracking algorithm (OCTA) based on bounding box regression outputs of the faster RCNN. Finally, it sends multi-pig tracking images to the central monitoring server, which alerts them to pig farmers through a social networking service (SNS) agent in cooperation with an oneM2M-compliant IoT alerting method. Experimental results showed that the multi-pig tracking method achieved the multi-object tracking accuracy performance of about 77%. In addition, we verified alerting operation by confirming the images received in the SNS smartphone application.


Introduction
The factory-type pig farming may cause serious problems such as diseases and odors since it lowers the immunity of pigs and makes environmental management difficult [1,2]. The improper enlargement of the factory-type pig pens and reduction of workers inevitably induces disease vulnerable breeding structure and hygiene management failure [3,4]. Actually, pig farmers in Korea suffered from foot-and-mouth disease (FMD) in 2010 and African swine fever virus (ASFV) in 2019. Especially, the FMD can survive up to 24-36 h in human bronchus, allowing airborne transmission from humans to pigs, so infected farms or visitors should not contact pigs for more than one week after entry [5]. Also, once any disease occurs in pig pens, sick pigs tend not to move as lying alone away from the herd [6]. However, it is almost impossible for pig farmers to monitor individual objects like pigs and visitors every hour, and it is not appropriate for them to passively observe the status of all pig pens [7].
A smart farming aims to scientifically prevent diseases and manage livestock by exploiting automatic monitoring systems with modern cutting-edge technologies. However, they still have some limitations not to be overcome. For instance, object tracking methods only using identification devices such as radio-frequency identification (RFID), global positioning system (GPS), Bluetooth, and WiFi are difficult to pinpoint exact conditions of livestock [8][9][10]. Some of these problems can be solved by converting camera images into meaningful data without human intervention for livestock detection and tracking [11]. As conventional computer vision solutions, there are polygon approximation algorithm [12], laplacian operator [13], and multilevel thresholding segmentation [14]. However, they are very complicated in preprocessing processes. Image classification using convolution neural networks (CNNs), one of artificial intelligence (AI) or deep learning models, can easily surpass conventional methods [15]. Many multi-object detection models as image classification have been proposed such as you only look once (YOLO) [16], single shot multi-box detector (SSD) [17], and faster region-based convolutional neural network (RCNN) [18,19]. Also, multi-object tracking methods have attracted a lot of attention in recent years [20][21][22]. Furthermore, effective alerting solutions are required for pig farmers to recognize disease risk factors based on the results of multi-object detection and tracking as soon as possible [23,24]. Therefore, this paper proposes a novel monitoring system using an Artificial Intelligence of Things (AIoT) technique combining artificial intelligence and Internet of Things (IoT) to automatically detect and track multiple pigs and alert their conditions in pig pens. The rest of this paper is organized as follows. Section 2 describes the proposed AIoT monitoring system including multi-pig detection, multi-pig tracking, and one-M2M-compliant IoT alerting methods in detail. Section 3 shows some experimental results to verify its performance and usefulness. Some concluding remarks are given in Section 4.

Description of a Proposed AIoT Monitoring System
The proposed AIoT monitoring system is illustrated in Fig. 1 that consists of AIoT edge devices and a central monitoring server. We developed an AIoT edge device with four functions: Video frame images extraction, multi-pig detection, multi-pig tracking, and IoT alerting client for automatically detecting and tracking multiple pigs from a CCTV camera installed in a pig pen and alerting their conditions to a pig farmer's smartphone through a central monitoring server having an IoT alerting server and a social networking service (SNS) agent.

Video Frame Images Extraction
The procedure of the video frame images (VFI) extraction method with a network video recorder (NVR) client, implemented in a form of representational state transfer (REST) application programming interface (API), is shown in Fig. 2.
It collects the VFI according to the predefined setup parameters in Tabs. 1 and 2 and then provides them to next multi-pig detection and tracking methods. As shown in Tab. 1, the setup parameters such as checktime, nvrsavetime, ftpip, ftpid, ftppasswd, and camname for saving the VFI from a CCTV camera NVR server to an NVR Client supporting the file transfer protocol are described. In addition, the setup parameters such as imagesavetime and imageformat for extracting the VFI searching results from the VFI interface to the VFI extraction are described in Tab. 2. The whole steps of the VFI extraction method are as follows. The frame extraction status reading and the extraction parameter setup between the VFI interface and the VFI extraction is performed from Steps 1 to 4. From Steps 5 to 8, the VFI extraction requests the start of frame extraction and camera interface to the VFI interface and the CCTV camera NVR interface, respectively. During loop operation from Steps 9 to 11, we can extract the VFI searching images by receiving video files, processing their images, and searching them. Finally, we can finish the VFI extraction from Steps 12 to 15.

Multi-Pig Detection with Faster RCNN
Usually, pigs suspected of having diseases tend to move less and go off away from the herd. So automated detection and tracking of behavioral changes in pigs is important to recognize disease risk factors quickly. Detection of multiple pigs is a preceding step to tracking of them [25]. For multi-pig detection, we build a faster RCNN model trained by using a dataset we made in Common Objects in Context (COCO) format [26] from the VFI searching images. To annotate the images, the annotation tool modified based on the Imglab was used. Although it is able to support all the annotations: bounding box, keypoint, and segmentation, we only consider the bounding box annotation. The faster RCNN model inherits the basic framework of the fast RCNN but calculates region of interest (ROI) by using a region proposal network (RPN) instead of the selective search [27]. It is the leading framework in various applications of object detection since it has very good advantages in effectiveness and efficiency. The ResNet50 architecture is applied as a backbone network or shared layer for feature extraction, which is a residual learning model to solve the problem of vanishing or exploding gradients that may occur as the neural network deepens. After receiving feature maps from the ResNet50, the RPN outputs a set of proposals, each of which has a score of its probability of being an object, called an objectness score, and also the class (or label) of the object. Then, these proposals are refined by a bounding box regression and a box classification (object or background) with sigmoid activation. Note that anchor boxes are responsible for providing a predefined set of bounding boxes of different sizes and ratios as reference when first predicting object locations for the RPN. The loss function of the RPN is defined by Eq. (1). where i represents the index of anchor, p i represents the probability value predicting whether an object exists in anchor i, and p * i is the ground truth (GT) label where 1 is the object and 0 is the background. Also, t i represents the coordinates of the bounding box (x i , y i , w i , h i ) and t * i represents the coordinates of the GT box. For normalization, N cls and N reg are set to the minibatch size and the number of anchor positions, respectively and λ is used as a balancing parameter to prevent the imbalance between N cls and N reg . Here N cls and N reg can be expressed by where L smooth Target camera name in an NVR server - After passing through the RPN, different sized proposed regions will be output. In the ROI layer, different sized regions are set to the same size. Next, the fast RCNN object detection provides final bounding boxes and their object classes (or labels) through a bounding box regressor and an object classifier with softmax activation, following fully connected (FC) layers. The loss function of the fast RCNN is similar to the equation given in Eq. (1) but L cls uses the categorical cross entropy given as As an optimizer to minimize loss functions of the RPN and the fast RCNN, we used the RMSprop in [28].

Multi-Pig Tracking with an Object Center-Point Tracking Algorithm
As shown in Fig. 3, multi-pig tracking is performed by the proposed object center-point tracking algorithm (OCTA) based on the bounding boxes from the faster RCNN model.
Next, in a similar way, it gets the number of center-points N m in the m image frame and then calculates their temporary center-points, d j = x α j , y α j , 1 ≤ j ≤ N m . After that, each center-point is tracked according to Algorithm 1 mapping a previous center-point to a temporary center-point with minimum distance between them. Assuming that the number of center-points N is known without loss of generality, we append (N − N m−1 ) and (N − N m ) zero vectors (0, 0) to the tracking result of a previous frame and the temporary result of a current frame, respectively. Note that the size and its indices of the temporary result are automatically changed when d label in line number 14 was deleted. Consequently, N D is reduced such as N D ← N D −1.
We design the proposed algorithm considering three cases. In the "Case 1" in Fig. 3, it calculates Euclidean distance between c i and d j less than the threshold obtained from experimentation (Here, we use 40 as the threshold) and substitutes the threshold with the candidate distance in order to find the nearest distance. When the nearest was found, the corresponding center-point d j becomes an output center-point e i . In the "Case 2", a center-point c i is assigned to an output center-point e i if it does not find any distance less than the given threshold. In the "Case 3" in Fig. 3 a center-point d j is assigned to an output center-point e i when there is no a center-point c i but a center-point d j . It is the case that new objects in a current frame are detected that did not exist in a previous frame.

oneM2M-Compliant IoT Alerting
The resulting images from multi-pig detection and tracking are encoded by Base64 to represent binary image data in an ASCII string format [29]. An IoT alerting client sends the Base64-encoded images to an IoT server. The IoT alerting client and server were modeled as an application dedicated node-application entity (ADN-AE) and an infrastructure node-common service entity (IN-CSE) described in oneM2M specifications, respectively [30]. Actually, they were implemented by using an open-source IoT platform called Mobius [31]. A social network service (SNS) agent modeled as an infrastructure node-application entity (IN-AE) requests a subscription message to the IoT server for receiving an alerting message including the Base64-encoded images from the IoT server when some notification criteria are satisfied. For instance, whenever the IoT server receives a content instance such as the images sent from the IoT client, it immediately notifies the alerting message to the SNS agent. Since the SNS agent was implemented by Telegram, users can visually monitor any situation in a pig pen with their smartphones.

Experimental Results and Discussion
This section describes experimental conditions and results to evaluate the performance of the proposed system. We collected 2182 images from an on-site CCTV camera in an actual indoor pig pen and created a COCO dataset augmented five times with vertical flips, horizontal flips and random rotations. In addition, we split the COCO dataset into 80% train and 20% evaluation sets and used a CentOS 7 workstation including one GTX 1060 GPU to build the faster RCNN model with hyper-parameters such as 100 epochs, 1,000 iterations, 32 number of ROIs per an iteration, etc. For the test performance of the OCTA, we used new one hundred consecutive frame images.
Also, as its performance metric, we considered multiple object tracking accuracy (MOTA) shown in [32]. It can be defined as follows.
where l n denotes the number of objects in a frame index n, a n denotes the number of miss errors, b n denotes the number of false positive errors, and d n denotes the number of mismatch errors.   Fig. 5, cumulative error rates (CERs) of miss, false positive, mismatch, and total errors are shown according to indices of one hundred consecutive frame images. First, the number of miss errors does not increase significantly and its CER curve looks almost constant because the OCTA tracks most objects detected by the faster RCNN model even though it may sometimes miss a few objects in consecutive frame images. Second, the number of false positive errors gradually increases and its CER curve looks like a straight line. This phenomenon is due to the detection failure of the faster RCNN. Accordingly, the OCTA cannot track the undetected objects. Third, the number of mismatch errors occasionally increases and its CER curve looks like a staircase. Sudden occurrence of the mismatch errors can be caused when the OCTA mistakenly swaps two or more detected objects as they pass close to each other or when the OCTA reinitializes with different object indices. Finally, the number of total errors is a sum of miss, false positive and mismatch errors and its CER value in the last frame image is about 23%. Therefore, the MOTA performance of the proposed OCTA is about 77%.
In Fig. 6, two images parsed from alerting messages are shown on a Telegram application in a smartphone. As mentioned before, alerting messages were sent from the IoT alerting server. A pig farmer can recognize any situations in a pig pen such as pig carcasses, abnormal behaviors, environmental conditions, etc. by monitoring output images from multi-pig detection and tracking.

Conclusion
In this paper, we proposed the AIoT monitoring system to efficiently recognize any situations in a pig pen by using the faster RCNN multi-pig detection, the OCTA multi-pig tracking, and oneM2M-compiant IoT alerting methods. We built the faster RCNN model based on the frame images taken from an actual indoor pig pen and tracked the pigs in the frame images with the proposed OCTA that uses bounding box regression outputs of the faster RCNN model. For performance evaluation of the OCTA, we analyzed the CERs of its miss, false positive, mismatch, and total errors and found that false positive errors highly depend on performances of multiobject detection methods, miss errors can be reduced by multi-object tracking methods, occurrence of mismatch errors results from pig behaviors. As a result, the OCTA could achieve the MOTA performance of about 77%. Finally, through some experimental results of the oneM2M-compiant IoT alerting method, we confirmed overall operations of the proposed AIoT monitoring system.