Cluster Self-Refinement for Enhanced Online Multi-Camera People Tracking

Jeongho Kim
Research Engineer, Nota AI

 

Summary

  • Online multi-camera system for efficient individual tracking

  • Accurate ID management with Cluster Self-Refinement (CSR)

  • Improved performance with enhanced pose estimation

 

Introduction

  • In this paper, we introduce our online MCPT methodology, which achieved third place in Track1 of the AICITY CHALLENGE workshop at CVPR2024. Multi-camera people tracking (MCPT) involves detecting and tracking individuals across multiple cameras to understand and analyze their movements and behaviors. The MCPT process typically follows these steps: 1) As shown in Figure 1, input footage from multiple cameras into a people detection model to detect the locations of individuals, represented by bounding boxes and coordinates. 2) Assign local IDs to detected individuals using a single-camera tracking algorithm, storing appearance and location information for each ID. 3) Match information from each single-camera tracking to assign a global ID.

  • MCPT can be classified into online and offline MCPT based on the timing of the video frames used for analysis. Online MCPT utilizes only past frames to analyze the current frame, making it applicable to all video sources, including real-time streaming. However, if an incorrect prediction is made, it can affect subsequent predictions, leading to continuous errors. Conversely, offline MCPT uses both past and future frames, allowing for post-analysis corrections of incorrect predictions, resulting in higher performance but inapplicable to real-time streaming.

  • Our research proposes applying the evaluation and correction methods of offline MCPT to online MCPT. Additionally, we propose an online MCPT system with enhanced performance by maximizing the use of pose estimation models.

  • Our code is available at Github.

Cluster Self-Refinement for Enhanced Online Multi-Camera People Tracking

Figure 1. This is an example of Multi-Camera People Tracking. It involves tracking people across various cameras by mapping them to the same identities. The image in the center depicts a 2D map of the location, showing the estimated positions of people as captured by the cameras. The numbers provided represent their global IDs.

 

Significance/Importance of the Paper

  • Achieved a high performance with the online MCPT method and achieved 3rd place in AICITY CHALLENGE track1.

  • We were able to obtain good performance by applying the methodology used in the offline MCPT to the online MCPT.

Cluster Self-Refinement for Enhanced Online Multi-Camera People Tracking

Figure 2. Overview of our system’s architecture.

 

Summary of Methodology

  • We participated in the 2023 AICITY CHALLENGE on the same task and wrote a paper that ranked 10th (Kim et al., 2023). Since the method we proposed at that time was also online MCPT, we used it as a baseline to improve our performance.

  • We proposed Online Cluster Self-Refinement, which allows the online MCPT system to periodically check the appearance features that have been stored so far and the tracklets that have been tracked.

Cluster Self-Refinement for Enhanced Online Multi-Camera People Tracking

Figure 3. Overview of Cluster Self-Refinement. The left side depicts the refinement of appearance features, utilizing agglomerative clustering to check if different people are stored and, if correct, refine the appearance features in the cluster tracklet. The right side illustrates overlapped cluster refinement, addressing situations where one person has more than one global ID. The CSR procedure is carried out at regular intervals, as denoted by the red circle shown above.

1. Appearance Feature Refinement

A cluster tracklet should only store the appearance features of a single person. If the appearance features of several people are stored within a tracklet, as illustrated in Fig. 3, there’s a risk of ID Switching occurring with another cluster tracklet, which could compromise the quality of future tracking. Therefore, we first use agglomerative clustering with cosine distance as the metric to divide the stored appearance features in the tracklet into two feature clusters. Then, by measuring the cosine distance between the two feature clusters and finding it exceeds a certain threshold, we infer that the cluster tracklet consists of different people, leading to the deletion of appearance features from the feature cluster stored later. Even if more than three different people are present in the cluster tracklet, using agglomerative clustering reduces the likelihood, as shown in Figure 3, of a single person being split into both feature clusters. After deleting one feature cluster, even if two people remain in the remaining cluster, the periodic execution of CSR ensures that eventually, only the appearance features of a single person remain in a cluster tracklet.

2. Overlapped Cluster Refinement

This step checks whether the person in the newly added cluster tracklet is the same as the person in the tracklet already being tracked. This is done by measuring the distance between the appearance features of the newly added tracklet and the tracklet that is already being tracked, and the distance between the mapped points in virtual space, and if the distance is below a certain value, the newly added tracklet is determined to be the same person and deleted.

These two refinements are performed periodically to ensure better tracking.

  • We leveraged a pose estimation model to develop a more advanced system. By mapping individuals' positions into a virtual space, we could determine that people in similar locations are likely the same person. Specifically, we mapped the foot positions of individuals visible in the current camera frame to the virtual space. However, body parts, including feet, often became obscured by structures. Previously, the lower part of the bounding box obtained from people detection was assumed to be the foot position, leading to inaccurate mapping. To address this, we pre-calculated the ratios of body parts using a pose estimation model and used these ratios to estimate the positions of obscured feet, resulting in more accurate mapping.

  • Moreover, to assign global IDs, we stored the appearance features of individuals identified as the same person across different bounding boxes. We then compared these stored features with those in new frames to match identities. However, confusion could arise if a bounding box contained multiple individuals. To prevent this, our system, utilizing the pose estimation model, was configured not to store bounding boxes if the number of body keypoints exceeded a certain threshold, indicating the presence of more than one person.

 

Experimental Results

  • As shown in Table 1, we can see that applying Cluster Self-Refinement (CSR) significantly improves performance from the baseline. In particular, we see an increase in AssA due to fewer false ID matches, as intended. We were also able to improve performance with Enhanced Utilizing Pose Estimation (EUP).

  • We submitted our proposed system to the AI City Challenge Track 1 for public evaluation and won 3rd place out of 17 participating teams with a HOTA score of 60.93%.

Cluster Self-Refinement for Enhanced Online Multi-Camera People Tracking

Table 1. The results of the ablation study on using a CSR and EUP. CSR and EUP stand for Cluster Self-Refinement and Enhanced Utilizing Pose estimation, respectively.

Cluster Self-Refinement for Enhanced Online Multi-Camera People Tracking

Table 2. Public leaderboard for the Challenge Track 1

 

Conclusion

Since this study only sought high performance in a short time, only heavy and high-performance models were used. For real-world use, we aim to replace them with more efficient models and develop a system with high performance in the process.


If you have any further inquiries about this research, please feel free to reach out to us at following email address: 📧 contact@nota.ai.

Furthermore, if you have an interest in AI optimization technologies, you can visit our website at 🔗 netspresso.ai.

Previous
Previous

Deploying an Efficient Vision-Language Model on Mobile Devices

Next
Next

Road Object Detection Robust to Distorted Objects at the Edge Regions of images