||Object Detection and Classification Using RetinaNet with Clustered Anchors and Multi-Subnets
||Institute of Computer Science and Information Engineering
One-stage object detector
Deep convolutional neural network
Feature pyramid network
在「電腦視覺」領域中「物體偵測與分類」一直是個廣受討論的議題，在許多領域中也有很多實際的應用，近年來由於深度學習的興起，許多使用深度卷積神經網路來完成物體偵測及分類任務的方法也相繼被提出；在眾多方法中，RetinaNet可說是在精準度和速度兩者間取捨最為恰當的網路，然而RetinaNet著重於損失函數的改良，並未針對網路架構本身有太多著墨。本研究以 RetinaNet為基礎，針對RetinaNet的網路架構做修改，進而提高其精準度。本研究針對目前的RetinaNet提出了三種改良方式，首先是改良特徵金字塔網路(Feature Pyramid Network)，特徵金字塔網路的好處在於能在不直接對原圖做縮放的前提下，使用單張影像獲得多重大小的特徵圖，用以偵測不同大小的物體，而本研究透過串聯的方式來取代原本的元素相加(element-wise adding)操作，結合兩組特徵圖，使得兩組特徵資訊都能夠被充分保留而不會互相干擾；其次是使用ｋ－平均演算法來對原始物體的定界框(Bounding box)長寬進行分群，進而將分群後群集中心的長寬做為參考錨版(anchor)的長寬比，使得錨版的設計與資料集更加貼合；最後則是使用多組子網路來分別對不同大小的特徵圖進行物體的偵測及分類。根據實驗結果，上述所提出的三個方法確實能提高平均精度均值及平均召回率；除此之外，本研究亦展示了在刀具瑕疵檢測上的實際應用，利用切塊的方式來解決訓練資料不足的問題，而在測試階段則會將整張刀具瑕疵拼圖切割為許多網格，再分別對每個網格區塊進行瑕疵偵測及分類，最後將各區塊的偵測分類結果對應回原圖，以達到對整張刀具瑕疵拼圖進行瑕疵偵測的效果。
Object detection and classification has been a popular issue in Computer Vision area for a long time. There are many deep learning based methods proposed in recent years. Among all the methods, RetinaNet has the best trade-off between detection precision and speed. It proposed a novel loss function called Focal Loss, which forces the network to focus on hard training examples. In this research, the emphasis is put on the network architecture itself instead.
Three modifications are proposed to improve the network architecture in order to achieve higher average precision and average recall. First, the Feature Pyramid Network is modified. The purpose of Feature Pyramid Network is to produce multi-scale feature maps in network without explicitly resizing the input image. The multi-scale feature maps can later be used to detect objects with different scales. The element-wise adding operation inside the original Feature Pyramid Network is replaced by concatenation in this research. This design better maintains the feature information from both sets of feature maps. Second, k-means algorithm is applied to all the ground-truth bounding boxes to cluster their widths and heights. The ratios of the widths and heights of the cluster centers will be used as the anchor ratios, which is called “clustered anchors” in this work. Such design makes the anchors better suit the dataset. Finally, the multi-subnets are attached to different level feature maps to detect and classify objects with different scales. According to the experimental results, the three modifications mentioned above can truly improve the average precision and average recall. Besides, this research also shows a real application on tool defect detection. A patch-based method is adopted. The entire stitching image of tool will first be divided into patches and the detection and classification will be done on each patch respectively. Afterwards, the results of each patch will be combined to form the final detection results.
Table of Contents VII
List of Figures IX
List of Tables XI
Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Related Works 2
1.3 Contribution 5
Chapter 2 RetinaNet with Clustered Anchors and Multi-Subnets 9
2.1 ResNet-50 12
2.2 Modified Feature Pyramid Network 14
2.3 Clustered Anchors 17
2.4 Multi-Subnets 22
Chapter 3 Training and Inference Framework of Modified-RetinaNet 27
3.1 Training Process of Modified RetinaNet 27
3.2 Inference Process of Modified RetinaNet 33
Chapter 4 Experimental Results 39
4.1 COCO Dataset 39
4.2 COCO Detection Metrics: Average Precision and Average Recall 40
4.3 False Positive Analysis 46
4.4 False Negative Analysis 51
Chapter 5 Application on Tool Defect Detection 58
5.1 Tool Defect Dataset 58
5.2 Tool Defect Detection Framework 62
5.3 Tool Defect Detection Results 63
Chapter 6 Conclusion and Future Works 71
 J. Deng, W. Dong, R. Socher, L. Li, K. Li and F. Li, “ImageNet: A large-scale hierarchical image database,” The IEEE Conference on Computer Vision and Pattern Recognition, pp. 248-255, 2009.
 M. Everingham, L. Van Gool, C. K. Williams, J. Winn and A. Zisserman, “The PASCAL Visual Object Classes (VOC) Challenge,” International Journal of Computer Vision, Vol. 88, Issue 2, pp. 303-338, 2010.
 R. Girshick, J. Donahue, T. Darrell and J. Malik, "Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580-587, 2014.
 R. Girshick, "Fast R-CNN," Proceedings of the IEEE International Conference on Computer Vision, pp. 1440-1448, 2015.
 D. Hoiem, Y. Chodpathumwan and Q. Dai, “Diagnosing Error in Object Detectors,” Proceedings of the 12th European conference on Computer Vision, pp. 340-353, 2012.
 K. He, X. Zhang, S. Ren, J. Sun, “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 37, No. 9, pp. 1904-1916, 2015.
 J. Hosang, R. Benenson, P. Dollar, and B. Schiele, “What makes for effective detection proposals?,” arXiv:1502.05082, 2015.
 K. He, "Deep Residual Learning for Image Recognition," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, 2016.
 K. He, G. Gkioxari, P. Dollár and R. Girshick, "Mask R-CNN," Proceedings of the IEEE International Conference on Computer Vision, pp. 2980-2988, 2017.
 J. Huang et al., "Speed/accuracy trade-offs for modern convolutional object detectors," Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3296-3297, 2017.
 T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” Proceedings of European Conference on Computer Vision, pp. 740-755, 2014.
 J. Long, E. Shelhamer, T. Darrell, “Fully Convolutional Networks for Semantic Segmentation,” The IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431-3440, 2015.
 W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. Reed, “SSD: Single shot multibox detector,” Proceedings of European Conference on Computer Vision, pp. 21-37, 2016.
 T. Lin, P. Goyal, R. Girshick, K. He, P. Dollar, “Focal Loss for Dense Object Detection,” Proceedings of International Conference on Computer Vision, pp. 2980-2988, 2017.
 T. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117-2125, 2017.
 S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks," Advances in Neural Information Processing Systems, pp. 91-99, 2015.
 J. Redmon, A. Farhad, “YOLO9000: Better, Faster, Stronger,” arXiv preprint arXiv:1612.08242, 2016.
 J. Redmon, S. Divvala, R. Girshick and A. Farhadi, “You only look once: Unified, real-time object detection,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779-788, 2016.
 J. Redmon, A. Farhadi, “YOLOv3: An Incremental Improvement,” arXiv preprint arXiv:1804.02767, 2018.
 A. Shrivastava, A. Gupta and R. Girshick, “Training Region-Based Object Detectors With Online Hard Example Mining,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761-769, 2016.
 J. R. Uijlings, K. E. van de Sande, T. Gevers and A. W. Smeulders, “Selective search for object recognition,” International Journal of Computer Vision, pp. 154-171, 2013.