Adaptive scheme for crowd counting using off-the-shelf wireless routers

Since the outbreak of the world-wide novel coronavirus pandemic, crowd counting in public areas, such as in shopping centers and in commercial streets, has gained popularity among public health administrations for preventing the crowds from gathering. In this paper, we propose a novel adaptive method for crowd counting based on Wi-Fi channel state information (CSI) by using common commercial wireless routers. Compared with previous researches on device-free crowd counting, our proposed method is more adaptive to the change of environment and can achieve high accuracy of crowd count estimation. Because the distance between access point (AP) and monitor point (MP) is typically non-fixed in real-world applications, the strength of received signals varies and makes the traditional amplitude-related models to perform poorly in different environments. In order to achieve adaptivity of the crowd count estimation model, we used convolutional neural network (ConvNet) to extract features from correlation coefficient matrix of subcarriers which are insensitive to the change of received signal strength. We conducted experiments in university classroom settings and our model achieved an overall accuracy of 97.79% in estimating a variable number of participants. © 2022 CRL Publishing. All rights reserved.

where, a i t ð Þ is the amplitude attenuation factor, s i t ð Þ is the propagation delay, and f is the frequency of carrier. For each subcarrier of one link, the channel can be modeled by y ¼ Hx þ n, where, y is the received signal, x is the transmitted signal, H is the CSI matrix, and n is the environment noise. In this paper, the CSI matrix H is estimated at the receiver side by evaluating the difference between the predefined transmitted signal x and received signal y after OFDM demodulation using the Atheros CSI Tool [1].
Wireless sensing based on Wi-Fi signals has caught tremendous attentions due to its ubiquity and privacy-preserving features [2][3][4][5][6][7][8]. Many researchers have paid much attention on human crowd counting based on the widely deployed wireless routers in public areas. Human crowd count estimation has also attracted increasing attention in many potential applications, such as intelligent surveillance, crowd management, urban security and business decision-making etc. For example, the accurate human population distribution information of one city can bring benefit for the government management personnel to make population-related decisions more efficiently. Since the outbreak of the world-wide novel coronavirus pandemic, crowd counting in public areas, such as in shopping centers and in commercial streets, has gained popularity among public health administrations for preventing the crowds from gathering. Traditionally, image-based methods are most often used to estimate the human crowd count, but they are limited to the illumination intensity of environment, line-of-sight propagation property of light, and the public consideration of privacy [9][10][11][12][13][14][15][16][17][18]. In this paper, we introduce an adaptive model for human crowd count estimation by exploiting rich CSI data embedded in 802.11n Wi-Fi networks. To test the robustness of the proposed model, we evaluated its performance in four different scenarios, which are shown in Tab. 1.
The CSI data is collected from the AR9344 NIC which is embedded in TP-LINK WDR4310 wireless router based on the Atheros CSI Tool.
After collecting the raw CSI data, Kalman filter with Mahalanobis Distance is used to detect abnormality and smooth out the signal [19][20]. Then, the correlation coefficient matrix of subcarriers is calculated for each data link to generate images. In order to extract fine features of the images, convolutional neural network (ConvNet) is used and the trained classification model achieves a satisfying result on the evaluation dataset in the four scenarios [21].
The remainder of the paper is structured as follows. The Section 2 presents the background and related works of crowd counting and Wi-Fi based wireless sensing. The Section 3 presents the system procedure of human crowd counting system, including data collection and analysis, data preprocessing, feature extraction, and construction of classification model. The Section 4 presents the implementation and evaluation of crowd counting system. The Section 5 presents the conclusion.  [22] designed a Wi-Fi-based real-time calibration-free passive human motion detection system based on the physical layer information using two schemes: short-term averaged variance ration (SVR) and long-term averaged variance ration (LVR). According to the experiment result, a high detection rate and low false positive rate are achieved. In 2016, Domenico et al. [23] proposed one trained-once device-free crowd counting and occupancy estimation using Wi-Fi based on a Doppler spectrum approach in WiMob. The proposed approach analyzes the linear correlation relationship between the shape of the Doppler spectrum and the received signal. In 2017, Zhu et al. [24] proposed an abnormal activity detection system NotiFi which achieved satisfactory performance in accuracy, robustness, and stability. It is based on the fact that the amplitude and phase information of CSI change sensitively whenever the human body occludes the wireless signal from the access point (AP) to the monitor point (MP). Yen-Kai et al. extends crowd counting technique to people-centric Internet of Things (IoT) applications, e.g., security monitoring and energy management for smart homes based on finegrained physical-layer wireless signatures. They achieved an average correct classification rate of 88% in estimating the exact number of the crowd of size up to nine people in general indoor scenarios. In 2014, Xi et al. [25] proposed the Percentage of nonzero Elements (PEM), in the dilated CSI Matrix, and then the monotonic relationship was explicitly formulated by the Grey Verhulst Model. In 2019, Ibrahim et al. [26] proposed CROSS-COUNT, which uses a single Wi-Fi link to estimate the human crowd count based on the temporal link-blockage pattern and achieves a high accuracy with non-labor-intensive data.

Data Collection and Analysis
Each CSI measurement contains several fields, which are shown in Tab. 2.
Each CSI measurement is a Nr Â Nc Â Numtones three-dimensional tensor, where Nr denotes number of antennas of the receiver, Nc denotes number of antennas of the sender, and Numtones denotes number of subcarriers in the frequency band used for communication in the experiment. In this experiment, the and Numtones ¼ 56. The sampling frequency is 30 Hz, and the sampling duration of each group is 60 s. The experiment was conducted in four different room situations, with the room being empty, with 1 person walking at normal speed, with 5 people walking at normal speed, and with 10 people walking at normal speed. A total of four groups of CSI data were collected and each group contains 1,800 CSI measurements. Fig. 1 shows the amplitude change of 56 subcarriers of 300 CSI packets in empty room situation. Fig. 2 shows the amplitude change of four different communication links between AP and MP of a single subcarrier in empty room situation.

Data Preprocessing
Generally, the collected CSI is an estimate of the wireless channel and contains random noise and other inaccuracies. In order to have a better estimate of the wireless channel based on the collected CSI, in this paper, Kalman Filter is used to filter noise and remove outliers. It can be seen in Eqs. (2) and (3).
where, A is one-dimensional state transition matrix and A = [ [1.0]] is implemented in our case. B t ð Þ is the influence of the control action at time t, and u t ð Þ is the control vector at time t. In our case, B t ð Þ and u t ð Þ are not implemented. w t ð Þ is the process noise at time t, C is the observation matrix which maps the true state space into the measured space, v t ð Þ is the measurement noise at time t, x t ð Þ is the estimated system state at time t derived from the state at time t À 1, and y t ð Þ is the measurement at time t. w t ð Þ and v t ð Þ are assumed to be drawn from zero mean normal distribution N 0; R ww ð Þ and N 0; R vv ð Þ respectively, where R ww denotes covariance of process noise and R vv denotes covariance of measurement noise.
The Kalman filter can be divided into two procedures: "Prediction" and "Update".  4) and (5): where,x tjt À 1 ð Þis the priori estimated state of system at time t given measurement at time t À 1, A t À 1 ð Þis the state transition model at time t À 1 applied to the previous posteriori estimated statex t À 1jt À 1 ð Þ , P tjt À 1 ð Þ is the priori estimated covariance, and R ww t ð Þ is the covariance of process noise at time t. The priori state of current time is estimated using the posteriori estimated state from the previous time in the prediction procedure.
Update procedure Eqs. (6)-(10): where, e t ð Þ denotes the innovation, R ee t ð Þ denotes the innovation covariance, K t ð Þ denotes the optimal Kalman gain,x tjt ð Þ denotes the posteriori updated state, andP tjt ð Þ denotes the posteriori updated estimate covariance. Since only the current measurement and the estimated state from the previous time are required to compute the estimate for the current state, Kalman filter is a computationally efficient algorithm for realtime and light-weight applications.
In order to detect and remove outliers, Weighted Mahalanobis Distance MD t ð Þ of a given measurement y t ð Þ and a predicted valuex tjt À 1 ð Þare used in this paper. As shown in Eq. (11): where, D and n are constants, which can be determined by analyzing the statistical feature of the signal.
The R vv can be considered as how much the system can trust on the measurement. The bigger the R vv value is, the less trust the system will have on the measurement. The value of R vv can be adaptively updated based on the amount of noise suffered according to Eq. (12) above.
The amplitudes before and after Kalman filtering of the first subcarrier of link 1 in the empty room situation are shown in Fig. 3.

Feature Extraction
The correlation coefficient matrix is calculated using Eq. (13).  where, cov is the covariance, r X is the standard deviation of X , and r Y is the standard deviation of Y . The Pearson Correlation Coefficient measures linear combination between two variables X and Y which has a value between −1 and +1. A value of −1 means totally negative linear correlation, 0 means no linear correlation between X and Y , and +1 means total positive linear correlation.
Considering the tasks of recognizing the number of people in a room, the window size W ¼ 10s and step size S ¼ 0:5s were selected when using the sliding window size method to produce the samples of each scenario for classification. The total number of CSI measures of one scenario is N ¼ 1800 and a total number of N À W ð Þ=S þ 1 ¼ 108 windows can be generated from the collected data of one scenario.
In this paper, only the amplitude information of CSI is used, as the amplitude correlation of subcarriers is sensitive to the number change of people in a closed room based on the experiment. The data shape of single window is Nr Â Nc Â Numtones Â W , which is 2 Â 2 Â 56 Â 300 in this case. For simplicity, select the first antenna of receiver and the first antenna of sender in the beginning and apply the same method to the other three links later. Calculate the Person Correlation Coefficient of any two subcarriers in one window according to Eq. (14).
Generate gray level image with 56 Â 56 pixels from correlation coefficient matrix M . The gray level image of four different classes is shown in Fig. 4. Since there are 2 Â 2 links, the total number of images generated from the collected data is 2 Â 2 Â 108 Â 4 ¼ 1728.

Construction of ConvNet Classification Model
The structure of ConvNet constructed in this paper is shown in Tab. 3.

The Convolutional Layer
The input is a tensor with shape N I Â H I Â W I Â D I ð Þ , where N I is the number of images, H I is the height of the image, W I is the width of the image, and D I is the depth of the image. After passing through a convolutional layer, the tensor becomes abstracted to a feature map with shape N I Â FH I Â FW I Â FC I ð Þ , where FH I is the feature map height, FW I is the feature map width, and FC I is the feature map channels. The shape of convolutional kernel is 3 Â 3 for all three convolutional layers and the number of input channels and output channels are 1; 8 ð Þ, 8; 16 ð Þ, 16; 32 ð Þ for conv_1, conv_2, and conv_3 respectively.

The Polling Layer
Pooling is a form of non-linear down-sampling, which partitions the input image into a set of nonoverlapping sub-regions. The max pooling unit uses the function f ¼ max A 1; 1 ð Þ; A 1; 2 ð Þ; …; A m; n ð Þ ð Þ , where A denotes the matrix of the sub-region with shape m by n, to generate single value from the partitioned sub-region. Pooling layer can decrease the spatial size of image and reduce the number of parameters significantly. Commonly, the filter with size 2 Â 2 and a stride of 2 along both width and height is selected, and 75% of the activations will be discarded.

The Relu Layer
The rectifier is an activation function defined as Eq. (15).
It maps negative values to zero and keeps the non-negative values unchanged. The rectified linear unit increases the nonlinear properties of the decision function.

The Learning Rate
Learning rate is a hyperparameter in an optimization algorithm, which determines the step size at each iteration while moving towards a minimum of the cost function. A high learning rate will probably make the learning jump over the minima. On the opposite, a low learning rate generally takes too much time to converge and even makes the learning progress stuck in the local minimum. Therefore, there should be a trade-off when selecting the learning rate for a specific problem. In this paper, a common value 0.01 of learning rate was selected when training the ConvNet.

Batch Normalization
Batch normalization is a method which uses re-centering and re-scaling to accelerate the training progress and make the neural network more stable. The batch normalization improves the performance by smoothing the objective function.
Batch normalization fixes the means and variances of the inputs of each layer.
where B denotes the mini-batch of size m of the entire training set, l B denotes the mean of mini-batch B, and r 2 B denotes the variance of mini-batch B. For a ConvNet, whose input layer has the shape N I Â H I Â W I Â D I ð Þ , the batch normalization procedure is shown in Eqs. (16) and (17), and each element in the matrix x should be normalized separately.
x ¼ where j 2 1; W I ; k 2 ½1; H I and i 2 ½ 1; m ½ ; l B j; k ð Þ and r 2 B j; k ð Þ are the mean and variance of each element in the matrix x respectively; e is an arbitrarily small constant added for numerical stability. In the end, thê x i j; k ð Þ will have zero mean and unit variance.

Softmax Function
SoftMax function is a generalized multiple dimensions version of logistic function which is a common S-shape curve. The equation of logistic function is f x ð Þ ¼ L 1þe Àk xÀx 0 ð Þ , where x 0 is the value of the midpoint, L is the curve's maximum value and k is the logistic steepness of the curve. When x 0 ¼ 0; L ¼ 1; k ¼ 1, f x ð Þ is the standard logistic function. Similarly, SoftMax function takes as input of a vector v and normalizes it into a probability distribution. After the normalization, each component in v will be in range 0; 1 ð Þ and all components will sum up to 1. Typically, the value of component in v can be interpreted as probability and the larger value corresponds to higher probability. The SoftMax function r : R K ! R K can be defined as follows:

Layout of Experiment Classroom
The experiment was conducted in a university classroom and the layout is shown in Fig. 5. The MP was set in the front of the classroom and the AP was set in the back. The distance between AP and MP is 10 m. Students of certain number walked with normal speed in the aisle. The AP is controlled remotely from outside of the classroom to collect CSI data.

Specification of the Experiment Device
In this experiment, one TL-WDR4310 wireless router flashed with customized OpenWRT firmware was used to collect CSI data. Tab. 4 displays the specifications of the experiment device.

Atheros-CSI-Tool
The CSI data was collected using the Atheros-CSI-Tool which is an open source 802.11n measurement and experimentation tool. Based on this tool, detailed PHY wireless communication information was extracted from the Atheros Wi-Fi NICs, including CSI, data rate, the received packet payload, RSSI, etc. All functionalities of Atheros-CSI-Tool are implemented in software without any modification of the firmware. In this experiment, Atheros-CSI-Tool was implemented in the Wi-Fi router with customized OpenWRT firmware.

Training the ConvNet Classification Model
The ConvNet is implemented using MATLAB Deep Learning Toolbox. Fig. 6 is the graph of training progress. Fig. 7 is the algorithm for estimating crowd count.  The confusion matrix of the evaluation is shown in Fig. 8. It can be seen that the model shows a perfect accuracy when recognizing in Classes 1 and 2 and makes minimal mistakes when distinguishing Class 3 with Class 4. The overall accuracy in all four classes is 97.8%, while the accuracy of recognizing in Classes 1 and 2 is 100% and the accuracy of recognizing in Classes 3 and 4 is 94.5% and 97.7% respectively.  Two different methods are compared with our proposed method. The comparison bar graph of overall accuracy is shown in Fig. 9. The Threshold-based methods utilize statistical property of the amplitude of CSI, such as variance and mean to recognize the number of people. The Eigenvalue-based methods extract the first several maximum eigenvalues of the correlation matrix of subcarriers. Support Vector Machine implemented with LIBSVM is used to train and evaluate the two methods above [27]. Fig. 10 shows the accuracy of recognizing each class with different methods. It is observed that Threshold-based method almost fails when deployed into different environments except for Scenario 3. The Eigenvalue-based method still shows relatively high performance but the accuracy of recognizing each scenario is lower than our proposed method.

Conclusion
In this paper, we presented the design, implementation, and evaluation of a novel lightweight and adaptive passive crowd counting method based on ConvNet. The system addresses the challenges found in the literature such as lack of robustness, low generalization ability, and high computational cost. The main idea is to generate images with fairly low resolution from the correlation coefficient matrix and

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.