|Computers, Materials & Continua |
Binocular Vision Positioning Method for Safety Monitoring of Solitary Elderly
1School of Mechanical Engineering, Nanjing University of Science and Technology, Nanjing, 210000, China
2Department of Electrical and Electronic Engineering, Colorado State University, Colorado, United States
*Corresponding Author: Yu Wang. Email: email@example.com
Received: 26 July 2021; Accepted: 13 September 2021
Abstract: In nowadays society, the safety of the elderly population is becoming a pressing concern, especially for those who live alone. There might be daily risks such as accidental falling or treatment attack on them. Aiming at these problems, indoor positioning could be a critical way to monitor their states. With the rapidly development of the imaging techniques, wearable and portable cameras are very popular, which could be set on human individual. And in view of the advantages of the visual positioning, the authors propose a binocular visual positioning algorithm to real-timely locate the elderly indoor. In this paper, the imaging model has been established with the corrected image data from the binocular camera; then feature extraction has been completed to provide reference to adjacent image matching based on the binary robust independent elementary feature (BRIEF) descriptor, finally the camera movement and the states of the elderly have been estimated to distinguish their falling risk. In the experiments, the real-sense D435i sensors were adopted as the binocular cameras to obtain indoor images, and three experimental scenarios have been carried out to test the proposed method. The results show that the proposed algorithm can effectively locate the elderly indoor and improve the real-time monitoring capability.
Keywords: Indoor positioning; binocular vision; feature matching; solitary elderly; safety monitoring
Due to the decrease of the physical function of the elderly and the influence of various chronic diseases, it is of great social significance to improve the safety monitoring of their daily behaviors, especially for those who live alone. Indoor positioning technology with high accuracy and reliability can help to determine the individual position in real time. Various technologies for indoor positioning, such as, wireless Wi-Fi, Bluetooth, ultrasonic positioning, radio frequency identification (RFID) and ultra-wideband (UWB), have been proposed in the past decades. Wi-Fi  and Bluetooth  mainly rely on the signal strength, which are easily get interfered. Ultrasonic positioning is greatly affected by multipath effects and non-line-of-sight propagation. RFID positioning system  has strong anti-interference, but it requires a large amount of hardware, such as the deployment of recognizers and antennas, which is too complex to be applied. UWB technology [4,5] has the advantages of high positioning accuracy and strong penetration, while the delay module needs to be accurately calibrated. The vision-based technology generally employs the camera to collect visual information, and the image processing algorithms are applied to achieve the indoor positioning. It is merit in low-cost, high accuracy, and robustness in complex environment, makings it popular in various fields.
Throughout the development of the visual positioning, Sattler et al.  proposed a visual positioning method based on image database, which is capable of fast recovering the indoor coordinates through image characteristics, but it requires a number of environmental images in advance, so that it is not adaptive to unknown environment. Davision et al.  proposed a real-time monocular vision slam system, which is able to perform real-time positioning in unknown environment, but it requires external information to construct the spatial scale, while the error estimation of depth is still rather large. Binocular vision could obtain the depth information through stereo matching. The RGB-D detection calculates the distance between the potential object and camera by sending infrared light or pulse to the object, which is capable of reducing the uncertainty of the monocular scale, but it is too expensive to be widely used. By contrast, the low-cost binocular scheme can achieve the same function as long as two cameras are combined.
For binocular indoor monitoring of the elderly, the cameras are mounted on the individual, the most critical part is the positioning of the carrier in successive images. The popular implementations of binocular positioning mainly include the optical flow method and feature point method. The optical flow method is under the illumination invariance assumption , it uses the optical information of the image pixels to calculate the speed between two frames, and then estimates the camera motion with polar geometry. But the practice cannot completely agree with the illumination invariance assumption, resulting in low accuracy in use. Feature-based methods mainly refer to the point feature, the edge feature and the block feature. Among them, the point feature method outperforms the other two methods in identification and anti-noise ability . It usually goes through feature extraction and feature matching to achieve the image association [10–12]. And the feature points of good quality are repeatable and unique, which are inevitable in popular feature-point method such as the scale-invariant feature transform (SIFT) algorithm, speeded-up robust features (SURF) algorithm, oriented features from accelerated segment test (FAST) and rotated brief (ORB) algorithm . The SIFT  algorithm fully considers the illumination, rotation impact, scale invariance, view angle changing, affine transformation, and noise stability, but its computation is too complex to perform real-time application. SURF  is a kind of an upgrade of the SIFT, which not only has SIFT's advantage of high accuracy and robustness, but also has lower computational complexity as the descriptor dimension of SURF is lower than that of the SIFT. However, SURF is still not fast enough for real-time operation. In comparison, the use of ORB  can guarantee the real-time binocular positioning, it uses improved FAST angle [17–19] extraction algorithm to increase the rapidity with enhancement of the illumination invariance, scale invariance and rotation invariance. In view of its good robustness and real-time capability, the ORB algorithm has been adopted in this paper.
The feature matching of the binocular positioning can be divided into two aspects, namely the matching of the successive frames and the matching between the left and right images. For the matching of the successive frame matching, the widely-used direct matching method has large computation burden, so that this paper adopts the approximate nearest-neighbor algorithm to simplify the process. The depth information of the feature point can be obtained from the left and right images, also known as the stereo matching. Which can be divided into the global stereo matching algorithm and local stereo matching algorithm . The global stereo matching algorithm that uses the global information of the image has a heavy computation burden, which is not suitable for the real-time application. The classic area-based algorithm based on regional gray levels detection [21,22] has good matching accuracy and fast speed but it is weak in anti-interference, which is adopted to perform binocular matching in this paper. With the feature matching and association of successive images, the relative positions and the motion status of carrier can be estimated by projecting the three dimensional (3D) perspective-n-point (PnP) to two dimensional (2D) plane.
Given the needs to the accurate and real-time stereo matching of the indoor positioning for the elderly living alone, this paper contributes to propose a binocular positioning based on the feature point extraction algorithm that unconventionally uses the BRIEF descriptor to construct the cost function to obtain accurate depth information. The proposed algorithm is dedicated to find the feature correspondence between two images by tracking the ORB feature points, and the PnP 3D-2D model has been built to perform the real-time motion estimation of the individual. After all, the position and movement tracking of the elderly could be timely monitored. Thus, it is possible to send out warning messages for potential risks. Through a series of experiments, the feasibility and effectiveness of the proposed indoor positioning technique have been evaluated.
The organization of this paper is as follows. Section 2 presents the general principles of the binocular positioning and the schematic process of the proposed design. And it describes the feature matching, the stereo matching method, and the implementation details of the BRIEF descriptor in Section 3. Section 4 verifies the proposed algorithm through a set of indoor tests. Section 5 concludes the paper.
2 Formulation of the Binocular Positioning Problem
The main flow diagram of the binocular positioning system is shown in Fig. 1. Firstly, the image sequence is acquired through the binocular camera, and stereo correction is then performed on the acquired left and right images. Secondly, the local stereo matching algorithm based on the BRIEF descriptor is employed to complete the estimation of depth information, and the ORB feature points in the image are extracted. The camera coordinate system of the camera is used as the world coordinate system of the system to calculate the world coordinates of the key points; then the approximate nearest-neighbor algorithm is used to achieve the matching between two adjacent frames of images, and the random sample consensus (RANSAC) algorithm is applied to eliminate the approximate nearest neighbor algorithm to a certain extent. Finally, the PnP 3D-2D model is used to estimate the pose of the carrier [23,24].
The detailed algorithm diagram is shown in Fig. 2. The image is stereo corrected according to the calibrated parameters of the binocular camera. And then the key points of the corrected left-side images would be extracted. If the current frame is the first image since the system started, the key points in the image are directly extracted, and the world coordinates frame of the left-side camera is used to calculate the coordinates of the key points, the current position and the posture matrix. Otherwise, the current left-side image is tracked by key points to find the correspondence of the pixels to further estimate the real-time pose. With the obtained pose, new key points could be matched to evaluate the depth information and restore the camera coordinates and world coordinates.
In the key point tracking process, a time threshold ΔT has been designed. It is assumed that the elderly generally moves slowly. That is to say, there would plenty of common key points in sequential images if the elderly is in a normal situation. If the elderly is accidental falling, there would be hardly key points that can be tracked. So that the key point tracking between two frames in the threshold is used to evaluate whether the elderly is in danger of falling.
3 Binocular Positioning Algorithms
The binocular camera model is shown in Fig. 3. The focal lengths and the imaging planes of the two cameras are the same, the two optical axes are parallel, and the pixels in each row of the image are precisely aligned.
The camera coordinate of a certain point in space is Pc(Xc, Yc, Zc), where Zc is the depth. The projections of the points on the left and right cameras are the pixel coordinates pl(ul, v) and pr(ur, v), and their coordinates in the camera coordinate system are p′l(xl, y), p′r(xr, y), respectively. The depth of the point can be solved by the principle of similar triangles:
Based on this model, the stereo alignment algorithm, feature extraction algorithm, feature point matching algorithm and motion estimation algorithm are studied gradually to achieve the indoor positioning of the elderly. In addition, the feature point matching algorithm has been employed to monitor the falling of the elderly.
3.1 Stereo Alignment Algorithm
In actual situations, it is often difficult for two cameras to achieve ideal coplanar and line alignment conditions. Generally, the rotation matrix R and the displacement vector t of the right camera relative to the left camera are used to represent the relative pose between the two cameras. For a spatial point Pw(Xw,Yw,Zw), the coordinates in the left and right camera coordinate systems are Pcl(Xcl,Ycl,Zcl) and Pcr(Xcr,Ycr,Zcr), respectively. Then the relationship between Pcl and Pcr can be expressed as
The expressions of Pcl and Pcr relative to the world coordinate system coordinate Pw are:
In Eq. (4), Rwcl, twcland Rwcr, are external parameters of the left and right camera positions. The combination of three formulas can be solved:
Because of the projection errors, the rotation matrix R and displacement vector t calculated for each pair of points are different. To alleviate the problem, the camera is calibrated, such that the median of R and t calculated for each group of images is used as the initial value for the maximum likelihood estimation. By minimizing the reprojection error, accurate calibration results can be obtained.
where E is the re-projection error, K is the internal parameter of the camera, k1, k2, k3 are the radial distortion parameters, and p1, p2 are the tangential distortion parameters.
With the estimated rotation matrix R and the displacement vector t of the right camera relative to the left camera, the corrections on the binocular camera are performed. And R can be decomposed into two matrices rl and rr according to Eq. (7). rl represents the half angle rotation matrix of the left camera, rr represents the half angle rotation matrix of the right camera, the rotating direction of them is opposite.
Afterwards, the left and right image planes are rotated about rl and rr, respectively, to achieve co-planarity, and the effect is shown in Fig. 4.
However, the baselines of the two image planes are not parallel, so that the transformation matrix Rrect= [e1T, e2T, e3T]T, has been constructed with the displacement vector to align the baselines. Where, e1=t/||t||, e2= [−ty, tx, 0]T/, e3= e1e2.
After all, the transformation matrix of the stereo correction can be expressed as
3.2 Feature Extraction Algorithm
In this paper, the feature matching of adjacent images has been completed by extracting ORB points, which are composed of key points and descriptors. Likewise, the feature detection algorithm can also be divided into two parts, namely, the key point extraction and descriptor construction. The key point extraction algorithm also known as the oFAST algorithm , which is evolved from the FAST algorithm , introduces the image pyramid and the gray-scale centroid method to guarantee the invariance of the FAST feature scale and rotation.
The main steps of the oFAST algorithm are presented as follows:
Step 1: A circle at point P with a radius of three pixels is determined as the center. The boundary of the circle passes through 16 pixel grids, marked as P1∼P16, and the brightness value of point P is Ip, as shown in Fig. 5.
Step 2: Set the threshold △t for brightness variation. From P1 to P16, the brightness of the 1st, 5th, 9th, and 13th pixels are first detected on the circle. Only if over three quarters’ brightness of the pixels are greater than Ip+t or less than Ip−t, the current pixel is considered as a corner point. With successive 12 pixels’ brightness greater than Ip+t or less than Ip-t, point P is recorded as a candidate feature point.
Step 3: Repeat the above two steps and perform the same operation on all pixels.
Step 4: Remove locally dense candidate feature points, and calculate the FAST credit of each candidate point through Eq. (9).
In is the value of the pixels on the circle, Sb is the set of pixels whose brightness value is greater than Ip+t, and Sl is the set of pixels whose gray value is less than Ip−t. Then the gray values of adjacent candidate feature points are compared, the candidate points with a larger V gray value are kept as key points, and the candidate points with a smaller gray value are removed.
Step 5: Construct a Gaussian pyramid and add the scale invariance of key points. Set the number of pyramid levels n = 8, and the scale factor s = 1.2. The original image can be scaled and the pixel value I ′ of the image in each layer can be obtained with Eq. (10):
Step 6: The direction vector is constructed by the gray-centroid method, to strengthen the rotation invariance of key points. In the neighborhood image block of the key point, the moment of the image block mpq is defined as:
The centroid of the image block can be determined by the moment C:
Connecting the geometric center O and the centroid C of the image block to obtain a direction vector , the direction of the feature point can be defined as:
After the extraction of the oFAST key points, the descriptor of each point needs to be calculated, where the improved BRIEF is used. BRIEF is a binary descriptor. Its description vector consists of 0 and 1, which encode the size relationship between two pixels near the key point (denoted as p and q): if p is greater than q, it takes 1; otherwise, it takes 0. If we take 128 pairs of (p, q) on a key-point-center circle with a radius of a certain number of pixels, a 128-dimensional vector consisting of 0 and 1 can be obtained. BRIEF uses randomly selected points for comparison, which is very fast and convenient to store, and it is superior in real-time image matching. In ORB, the rotation-aware BRIEF descriptor is improved by adding a twiddle factor on the basis of the BRIEF descriptor.
3.3 Feature Point Matching Algorithm
For the feature point matching of the two images, the computation load is heavy through directly comparing the Hamming distance of each feature point, which hardly satisfies the real-time requirements. The approximate nearest-neighbor algorithm integrated in the Fast Library for Approximate Nearest Neighbors Open-source library is faster and adaptive for real-time occasions, but mismatches may occur more or less. The RANSAC algorithm can eliminate the mismatch by effectively calculating the homograph matrix. The homograph matrix is a conversion matrix that describes the mapping relationship between the corresponding points of two pictures on the plane, it is defined as follows [25,26]:
where (u, v, 1)Tand (u′, v′, 1)T are the pixel coordinates, H is the homograph matrix.
A pair of points can determine two equations, that is to say, at least 4 pairs of matching points are needed to determine the H matrix. The actual logarithm of the initial matching point is much greater than 4, so that the RANSAC is used to obtain the resolution. The main steps are as follows:
Step 1: Randomly select 4 pairs of matching points to fit the model (that is, estimate the homograph matrix H);
Step 2: Due to the matching errors, the data points have certain fluctuations. Assuming that the error envelope is δ, taking the matrix H in step 1 as a benchmark, to calculate the residual matching error, find the points within the error envelope, and record the point number n;
Step 3: Randomly select 4 pairs of points again, and repeat the operations of Step 1 and Step 2 until the iteration stops;
Step 4: Find the homograph matrix H that satisfies the largest n.
When tracking the feature points in successive frames, a time threshold was set to count the feature point quantity in the interval to evaluate the falling risk of the elderly. It is considered that the elderly generally moves slowly, so that if there are plenty of common feature points in in successive frames, the elderly is thought be safe; otherwise, if there are few common feature points in in successive frames, the elderly has falling risk.
For the corresponding feature points matching of the left and right images, this paper uses the area-based algorithm based on the BRIEF descriptor. The traditional block matching algorithm is shown in Fig. 6. After stereo correction, the left image is used as a reference, and the pixel coordinate of a key point is set to (ul, v), its gray scale is I(ul, v). This key point is taken as the center, and the M×N area (denoted as window W) around center is viewed as the matching unit. According to the binocular imaging model, the corresponding point (ur, v) in the right image must be on the left side of the key point. Taking the right image point which has the same coordinates in the left image as the start, sliding the window W from right to left along the row, to compare every pixel in turn, so that the image similarity can be calculated. The point with the largest similarity value is regarded as the matching point.
Because of the existence of sheltering, the key points in the left image may not be able to find matching points in the right image. In the matching process, the sum of absolute differences S1 is used as the similarity function to measure the matching degree of two points and the surrounding window, it is expressed in Eq. (15).
In which, Il and Ir are the brightness values of the left and right pixels, respectively. However, owing to the illumination impact, the brightness of the pixel is susceptible to external interference, which may introduce errors to affect the matching accuracy. In the process of extracting ORB feature points, the BRIEF descriptor has been obtained, and the BRIEF vector is used as the feature information instead of the brightness value to construct the cost function, which can effectively help to improve the matching accuracy. The similarity measurement function S2 is shown in Eq. (16):
where L and R mainly refer to the BRIEF features of the left and right pixels. When the value of the similarity measurement function is at the peak, the matching is ended, and the matching point is obtained.
3.4 Motion Estimation Algorithm
With the tracks of the key points, the matching relations of them can be obtained to further estimate the camera motion. That is, after the stereo matching, the three-dimensional camera coordinate Pc of the key points can be obtained. For the k-1th frame image with known external parameters, the world coordinates of the key points can be recovered accordingly. The key points are then tracked to obtain the pixel coordinates of the key points in the subsequent image frames. According to the correspondence between the pixel coordinates and the world coordinate system, the camera pose at the k-th frame can be restored. Therefore, the camera motion recovery is depicted as a 2D-3D multi-point perspective problem, or the PnP problem .
The reprojection error, comparing the pixel coordinates (observed projection after matching) with 3D point projection of real-time pose estimation, is produced. For which, a nonlinear optimization has been adopted to find a possible solution. As shown in Fig. 7, through feature point matching, p1 and p2 are the projections of the same spatial point P on the two images before and after, while the pose of the camera is unclear. After substituting the initial value, there is a certain distance between the projection p2^ of P on the next frame of image and the actual p2. Therefore, the pose of the camera needs be adjusted to reduce this difference. And there are many points to deal with, the error of each point is hardly zero.
In Fig. 7, the homogeneous coordinates of the spatial point P are P =[X,Y,Z,1]T, and the pixel coordinates of its projection in the image I1 are p1 = [u1, v1]T, the pixel coordinates of the reprojection in image I2 are p2∧ = [u2′, v2′]T, and the observation value of the spatial point P in the image I2 is p2 = [u2, v2]T, while e = p2-p2∧ represents the reprojection error. The ideal re-projection process is expressed by Eq. (17):
where s2 represents the depth of the spatial point P in the camera coordinate system where the image I2 is located, and K represents the camera internal parameters, which represents the posture transformation matrix of the camera from image I1 to image I2, which can also be represented by T, and ξ represents the Lie algebra corresponding to T. There is usually a certain error with the true value during reprojection. The definition of this error is shown in Eq. (18):
There are often more than one feature point observed in a camera pose. Assuming that there are N feature points, it constitutes the least squares problem of finding the camera pose ξ:
4 Experimental Results and Analysis
The experimental platform of this paper uses a laptop computer (Lenovo Xiaoxin Air15), and the running environment is the Ubuntu 16.04 operating system under the VirtualBox virtual machine. Using Intel's RealSense camera D435i, the camera is a global shutter, the frame rate is 30 fps, the image resolution is 1280 × 720, and the camera baseline is 5 cm. The experimental scene is indoor, and the sensor is handheld to move in the scenario to estimate the pose. The camera installation and its connection with computer is shown in Fig. 8.
4.1 ORB Feature Matching Experiments
During the experiment, the ORB feature points were extracted and matched on two adjacent frames of images. Fig. 9(a) shows the result with the approximate nearest neighbor algorithm where the mismatch has not been eliminated, and Fig. 9(b) shows the result with the combination of the nearest neighbor and RANSAC algorithm, where the mismatch has been eliminated. It can be seen that there are many matching errors as circled by the red boxes in Fig. 9(a), but fewer mismatches in the same area in Fig. 9(b). Therefore, it illustrates that the combination algorithm has better matching ability, which is beneficial to the improvement of the accuracy of the binocular positioning.
Then, the hand-held camera performed a slow linear motion, the sampling interval was 1 s. From the observation of the feature tracking between two frames, as shown in Fig. 10a, it is clearly that plenty of key points had to be tracked.
Afterwards, the hand-held camera had been swung quickly to simulate the falling situation of the elderly, and two image frames were sampled at 1 s. The result in Fig. 10b shows that there are almost no key points tracked in the latter image, indicating that the two images with an interval of 1 s have almost no common viewpoint, which is an abnormal movement of the elderly. At this time, it can be judged that the elderly might be in falling danger.
4.2 Indoor Positioning Experiment
The positioning experiment was also carried out indoor, including the linear reciprocating motion and arbitrary motion, to evaluate the positioning capability of the binocular scheme designed in this paper.
4.2.1 Linear Motion Scenario
In this scenario, the handheld camera was kept at a certain height while moving straightly between two points at three different distances: 1 m, 3 m, and 5 m, respectively. The camera coordinate of the first image after the system initialization was defined as the world coordinate. The camera coordinate system was defined as: the facing direction of the camera lens was the positive direction of the z-axis, the x-axis pointed right the camera, the y-axis and x, z constitute the right-hand coordinate system. And the y-axis and x-axis, and the z-axis constitute the right-hand coordinate system, as shown in Fig. 8. The starting point coordinates were (0, 0) m, the end point coordinates were (0, 1) m, (0, 3) m, (0, −5) m. The images collected during the experiments are shown in Fig. 11.
The trajectories of the linear motion at the three distances are shown in Fig. 12. The positioning results are shown in Tab. 1.
It can be seen from Tab. 1 that when the carrier performs linear reciprocating motions of different lengths, the positioning error is within 65 cm. As the running length increases, the positioning error does not diverge significantly.
4.2.2 Arbitrary Movement Scenario
After holding the camera indoor for arbitrary movement, it returns to the start point, and the result of ORB feature point extraction of the collected images is shown in Fig. 13. The pose estimation of the odometer is shown in Fig. 14. Assuming the position of the start is (0, 0, 0) m, the calculated position of the end is (−0.03765, −0.1251, −0.01099) m, the positioning error is 0.1311 m.
Combining the experimental results of the carrier's linear reciprocating motion and arbitrary trajectory motion in the room, it can be seen that the binocular positioning algorithm designed in this paper has a good positioning capability for indoor environments. As for the fluctuating magnitude in y-axis is mainly caused by the walking up and down of the human body.
5 Concluding Remark
Aiming at the pressing concern of the indoor monitoring of the alone-living elderly, this paper proposed a positioning algorithm based on the binocular visual scheme through feature extraction, feature matching, and motion estimation, to finally obtain a high accuracy location of the indoor elderly. And feature matching is focused and modified. On one hand, the RANSAC algorithm has been adopted to eliminate the mismatch caused by the approximate nearest neighbor method; on the other hand, a cost function based on the BRIEF descriptor has been proposed as the feature information to improve the stereo matching accuracy. On this basis, the feature point comparison of two image frames within a certain time interval is used to determine whether the elderly is in falling danger. Three sets of experiments are carried out to verify the feasibility of the proposed method. Through the feature matching experiment, it can be intuitively seen that the RANSAC algorithm can effectively eliminate the mismatch; the contrast of the walking and falling situations in feature matching experiment also demonstrates the tracking efficacy of two images in the designed time threshold and verifies the feasibility of the falling danger evaluation; furthermore, the effectiveness and accuracy of the improved method with the BRIEF descriptor is verified by the indoor positioning experiments in different situations. It is worthy to mention that, there were position and attitude drifts due to accumulated errors in the measurement system. The further study will continue to work on this issue by adding auxiliary navigation, taking the inertial measurement unit for instance, to improve accuracy.
Funding Statement: This work was supported by the National Natural Science Foundation of China (No. 61803203).
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|