**Wooksu Shin**
Research Engineer, Nota AI

Introduction

Fish-eye cameras, capable of capturing wide areas, enable efficient traffic monitoring with only a few cameras. Nevertheless, it still remains challenging to successfully detect objects in images from such cameras. In this work, we discuss key reasons why object detectors frequently make incorrect predictions in such images and propose methods to address them. More specifically, we address the issues of objects being represented as smaller at the edges of images and the distortion of non-target objects (e.g., street signs), which are recognized as target objects (e.g., vehicles). Furthermore, in this work, we propose a road object detector capable of achieving high performance by additionally applying various techniques known to generally enhance detection performance. Our proposed detector achieved second place in Track 4 of the 2024 AI City Challenge with an F1 score of 0.6196.

This work will also be presented at the CVPR 2024 Workshop on the 8th AI City Challenge.
Our code is available at Github.

Key Messages of the Paper

In this work, we discuss why state-of-the art object detectors frequently make incorrect predictions in images captured from fish-eye cameras.
We address two domain-specific issues: objects being represented as smaller at the edges of images, and the distortion of non-target objects (e.g., street signs) being recognized as target objects (e.g., vehicles).

Significance/Importance of the Paper

By significantly improving the accuracy of road object detection using a fish-eye camera (from an F1 score of 0.4734 to 0.6196), we contribute to enhancing the feasibility of deploying fish-eye cameras in the field to cover wide areas with fewer cameras.

Summary of Methodology

To improve the accuracy of road object detection in images captured by fish-eye cameras, we use a combination of domain-specific techniques and methods that are generally known to be effective for object detection.

Domain-specific methods

We have observed two major challenges in object detection from road images captured with fish-eye cameras. The first issue is that objects appear smaller at the edges of the image. The second issue is that the distortion of object shapes at the edges can lead to non-target objects to be mistakenly recognized as target objects.

To effectively detect such small objects, we propose to use a sliced inference technique called SAHI (Slicing Aided Hyper Inference) (Akyon et al., 2022). As shown in Figure 1, the original image is partitioned into slices of predefined sizes (i.e., the red box in Figure 1) and each slice is resized to the input size of the detection model for inference. SAHI enables effective object detection for small objects at the inference stage for the following reasons: When the original image is smaller than the input size of the model, resizing one slice to the model's input size can make the object size larger than resizing the entire original image to the model's input size. Conversely, when the original image is larger than the input size of the model, shrinking the entire original image to the model's input size is necessary. However, since one slice is usually smaller than the model's input size, resizing it to the model's input size still enlarges the size of each object. Even if the size of one slice is larger than the model's input size, it does not reduce the size of each object compared to resizing the original image to the model's input size. SAHI performs inference by moving slices in a manner similar to convolutional operations, both horizontally and vertically, and aggregates the predicted boxes from each slice.
To address the issue of distorted objects as shown in Figure 2 (a), we propose a semi-supervision learning method in which our detector learns non-target objects. To do this end, we construct a training dataset where pseudo labels are assigned to the most observed objects, excluding the target objects and then train the model using this data. In other words, we enhance the model's object discrimination capability by training it not only on the target detection objects but also on other objects. To assign pseudo labels to as many object categories as possible, we leverage the Co-DETR (Zong et al., 2023) model that are trained with the large vocabulary instance segmentation (LVIS) dataset (Gupta et al., 2019). This dataset includes samples of a total of 1,203 types of objects. Since the Co-DETR model currently ranks first for this dataset, we expected to minimize noises in the training data when the model assigns pseudo labels to all objects. By training the model with pseudo-labeled training data, we can prevent issues like street signs being mis-predicted as cars, as shown in Figure 2 (b).

Figure 1. An example of sliced inference. The region corresponding to the red box in the above image is resized to match the model's input size (as shown in the bottom image) before being inputted. As a result, the small objects within the yellow box are greatly enlarged, allowing the model to detect the target objects more accurately.

Figure 2. (a) depicts a mis-prediction of a distorted non-target object. A street sign is distorted, creating visual similarity with the outline of a car, detectors occasionally mis-predict it as a car. After training on these non-target objects, the issue of incorrect predictions is resolved as shown in (b).

General method for object detection
In this work, we use methods that are generally known to be effective for object detection tasks to further improve the performance.

Data augmentation: Since images captured by fish-eye cameras often exhibit rotation, rotation augmentation is also applied during the training process.
Histogram equalization: This technique is to transform an input image with a narrow range of pixel values, resulting in a high-contrast output image with a wider range of pixel values (see Figure 3). In other words, it smooths out the pixel distribution of mostly dark or bright images, making them brighter or slightly darker. Histogram equalization is only used during the inference stage.

Figure 3. Changes in pixel distribution with histogram equalization

Ensembling detectors

In this work, various techniques that we have discussed are combined in different ways to create various detectors, as shown in Table 1. The weighted boxes fusion (WBF) (Solovyev et al., 2021), as an ensemble approach, is employed to aggregate the predicted bounding boxes from different detectors. As shown in Figure 4, WBF computes the average coordinates of multiple bounding boxes predicting the same object to generate a single bounding box. The confidence score for the generated bounding box is determined by the average confidence score of the bounding boxes used to create it.

Table 1. Ensembled detectors that we use in this work. Swin-L (Liu et al., 2021), ViT-L (Dosovitskiy et al., 2021) indicate backbones of DETR models, DINO (Zhang et al., 2023) is an architecture of DETR series. Co-DINO (Swin-L) was pretrained with Objects365 (Shao et al., 2019) and COCO (Lin et al., 2014) datasets. Co-DINO (ViT-L) was pretrained with Objects365 and LVIS. All models are fine-tuned with FishEye8K (Gochoo et al., 2023) dataset in this work.

Figure 4. Examples of weighted boxes fusion (WBF). The above two images show bounding boxes predicted by different models. In the image below, these bounding boxes are combined into a single box using WBF.

Experimental Results

The effectiveness of using task-specific methods: As described in Table 2, it can be observed that the problem of small objects in the edge regions of images can be effectively addressed through sliced inference techniques. Additionally, by assigning pseudo labels to objects that are not targets for detection and then training the model (i.e., semi-supervised learning), it is possible to better distinguish between detection targets and non-detection targets.

Table 2. The results of ablation study on using sliced inference and semi-supervision, respectively

Leaderboard: As described in Table 3, we were able to achieve high performance by combining task-specific methods as well as general methods. We achieved 2nd place in Challenge Track 4.

Table 3. Public Top 10 leaderboard for the Challenge Track 4

Conclusion

In this work, we proposed methods to address performance issues caused by distorted or small objects at the edges of images captured by fish-eye cameras. By leveraging various general methods together, we were able to rank highly in Challenge Track 4. Nevertheless, our proposed detector has the drawback of requiring very high computational complexity during the inference stage. This is because we utilized a large number of complex models to enhance performance, which is not practical. To address this problem, we plan to compress ensembled models into a lightweight single model through techniques such as knowledge distillation and network pruning.

If you have any further inquiries about this research, please feel free to reach out to us at following email address: 📧 contact@nota.ai.

Furthermore, if you have an interest in AI optimization technologies, you can visit our website at 🔗 netspresso.ai.

Road Object Detection Robust to Distorted Objects at the Edge Regions of images