The Title of the Paper: TRASH DETECTION FOR COMPUTER VISION USING SCALED-YOLOv4 ON WATER SURFACE
In 2020 Scaled-YOLOv4 was introduced. It is one of the best object detection models outclassing its peers in MS COCO test-dev. In this study, the proponents used Scaled-YOLOv4 as their object detection model. The model will be used in the environment of Pasig River, Philippines in detecting plastic and paper. The model's performance will be tested using a dilapidated trash dataset. Object detection models usually face difficulties in detecting the object because of deformation, occlusion, illumination conditions, and cluttered background. The proponents' Scaled-YOLOv4 model produced 63% average precision, 67% precision for plastic, 59% precision for paper. The model can be used in detecting trash materials found on the surface of the Pasig River.
CCS CONCEPTS • Computer Science • Artificial Intelligence • Machine Learning • Object Detection Model Additional Keywords and Phrases: Scaled-YOLOv4, Computer Vision, COCO Dataset
In Pasig City, Philippines there is a river ferry service which is the only water-based transportation in Metro Manila that spans from Pinagbuhatan Pasig City, Mandaluyong City, Makati City, and Intramuros Manila City 1. It is owned by a private company called SCC Nautical Transport Services Incorporated. It is more similar to a water taxi than a ferry and other water vessels also use the Pasig River as a route for transportation.
In previous years Pasig river is known for being one of the most polluted rivers in the Philippines 2, Trashes in the water surface cause major damages to water vessels passing through bodies of water 31. The damage done to these water vessels cost a lot of money in repairing it 30. The river is now clean due to the recent efforts of the Pasig River Rehabilitation Commission 3. Having a clean river doesn't mean that it will stay like that forever, a viable solution is having an automated trash detection on the surface of the water. These can be a way to minimize water debris and pollution. The detected trash can be documented on a weekly basis and notify the local government if there's an alarming increase in trash on the surface of the water.
The problem of trashes found on the water's surface is that the object has deformities and the object's shape changes over time. For example, a plastic bottle and paper cup can be dilapidated into different forms, making object detection challenging. There are a lot of studies related to object detection on the surface of the water using YOLOv3, Faster RCNN, RetinaNet, and RCNN [4, 10, 7, 9]. These object detectors produce acceptable results in terms of the average precision of their model.
The proponents will use Scaled-YOLOv4 since it's one of the best neural network for object detection 6. It produces an average precision of 55.8% in MS COCO test-dev 6. The main focus of this study is to test the performance of the Scaled-YOLOv4 CSP configuration while using a dilapidated trash dataset. The Scaled YOLOv4 object detection model will be used in the environment of Pasig River Philippines, specifically in Manila City from Santa Ana to Intramuros. The Object detection model will only be used to detect two classes plastic and paper since it is the common trash found in the surface of the river.
1.1 Statement of the Problem
The main concern of this study is to test the performance of Scaled-YOLOv4-CSP configuration in detecting paper and plastic found on the surface of Pasig River, Philippines. Specifically, the study aims to answer the following questions:
1. How will the dilapidated trash dataset affect the performance of the Scaled-YOLOv4 model in detecting objects?
2. What will the Scaled-YOLOv4 model's two classes yield in terms of precision?
3. How will the environment hinder the Scaled-YOLOv4 model in detecting objects?
1.2 Objectives of the Study
The following are the objectives of the study in using Scaled-YOLOv4 in detecting paper and plastic on the surface of Pasig River, Philippines:
1. To assess the results of the Scaled-YOLOv4 model when it is using a dilapidated trash dataset.
2. To evaluate the produced average precision of each class in the Scaled-YOLOv4 model.
3. To determine what environmental factors will significantly hinder the Scaled-YOLOv4 model in detecting objects.
1.3 Significance of the Study
This study contributes and focuses on identifying trash materials found in Pasig River mainly Manila City from Santa Ana to Intramuros. The proponents will use Scaled-YOLOv4, which is a state-of-the-art object detection model that will be trained to specifically detect plastic and paper 6. This study will be a significant beneficiary to the following:
FUTURE RESEARCHERS. This study will provide information on how the Scaled-YOLOv4 model detects paper and plastic using a dilapidated trash dataset. The data gathered will help determine the pros and cons of the overall training performance in detecting paper and plastic in the dilapidated trash dataset. The ideas presented may be used to conduct new research or validity of other findings.
FUTURE DEVELOPERS. The presented data and results will enable future developers in initiating a different approach to training the Scaled-YOLOv4 model using a dilapidated trash dataset, because of the effects of the dilapidated trash dataset and the environment that will hinder the performance of the model.
FUTURE CS STUDENTS. This study will be beneficial to students of computer science that are leaning towards Artificial Intelligence. The information presented will enable CS students to gain insight into Scaled-YOLOv4 providing the background, outcome, and data when using a dilapidated trash dataset, gaining knowledge and skills in the field of computer science.
CS PROFESSORS. The outcome of the data and results of this study will contribute for CS professors to gain knowledge about the Scaled-YOLOv4 model. In terms of evaluation of the percentage of getting the average precision in classifying paper and plastic while using a dilapidated trash dataset.
1.4 Scope and Limitation
This study aims to use a Scaled-YOLOv4 object detection on the surface of Pasig River, Philippines, specifically in Manila City from Santa Ana to Intramuros. The Scaled-YOLOv4 object detection will only detect two classes: paper and plastic. The proponents will provide the dilapidated trash dataset consisting of 1000 images that contain plastic and paper. The model will use YOLOv4-CSP as its backbone, CSP-PANet as its neck, and YOLOv4-CSP as its head. YOLOv4-CSP is the optimal configuration for a single GPU, and it is used on mediumsized datasets. The Scaled YOLOv4 model will be tested using video detection, and image detection. The scope of this study is to test the performance of the Scaled-YOLOv4-CSP configuration when the dataset is dilapidated trash.
2 RELATED WORKS
This section contains several studies that discuss trash sorting using deep learning, object recognition, image classification, and damages to boats caused by water pollution. Several of these studies will contribute to the betterment of the research.
2.1 Deep Learning
The studies of Sousa et al., (2019) 14, Sandhya Devi et al., (2018) 16, Sakr et al., (2016) 19, Sudha et al., (2016) 20, John et al., (2019) 18 all agreed to an automated recognition system to lessen the manual work using deep learning for waste detection, also utilizing artificial intelligence to classify the waste into different categories. Using computer vision techniques is a very useful application in order to automate the waste handling tasks. The importance of sorting wastes and recycling is known to affect the environment and economy in a good way, and having the system automated makes it efficient. The results in taking this approach aids in reducing the pollution levels and focuses on the segregation of waste. Collin Ching (2019) 12 and Garcia (2015) 13 goes deeper into the topic by explaining the algorithms that work on object recognition with the use of deep learning. The accuracy of Collin Ching's system reached 92.1% and the efficiency of Garcia's system reached 98.33% with the use of k-NN algorithm.
Other studies apply deep learning on different platforms. Mittal et al., (2016) 15 uses a smartphone app to detect and localize regions containing garbage in user-clicked unconstrained geo-tagged real-world images. Their study uses Garbage in Images dataset, and uses Patch Generation, GarbNet model to optimize and improve the dataset. Ravindhiran et al., (2017) 21 will integrate the algorithm into a robot, which will make it a house friendly technology. Managing waste in individual homes might be relatively easier, even though it consumes a considerable amount of time. The studies of Dwivedi et al., (2016) 17 discusses automated waste segregation that can be largely implemented in various places and municipal corporations, taking into consideration various factors such as reduction in manpower, avoiding risk at hazardous places, improving accuracy, and increasing speed of waste management.
2.2 YOLO (You Only Look Once)
The studies of Xuandong Xu (2018) 10 and Redmon et al., (2016) 11 discuss the new approach to object detection which is called You Only Look Once (YOLO). Compared to faster rCNN, YOLO runs a lot faster because of it's simpler architecture. YOLO architecture is similar to Googlenet because it took inspiration from it. It is a simple process which contains a single convolutional network that predicts multiple class probabilities.
Shinde et al., (2018) 21 and Bouchard (2020) 24 use YOLO as their approach to recognize specific patterns in real time. YOLO uses a unique neural network using the characteristics of the entire image to predict multiple boxes, each containing a specific object. Their study demonstrates that YOLO is an effective method and comparatively fast for recognition and localization in the Liris Human Activities dataset, or in any other dataset at this point. Another study by Jonathan Hui (2018) 23 talks about the different versions of YOLO algorithms. Starting with YOLO, YOLOv2 and YOLOv3 but the function is the same, it acts as both feature extractor and classifier for object detection.
Bochkovskiy et al., (2020) 5 used YOLOv4 and their study focuses on finding the optimal speed and accuracy for the object detector. The object detector that they proposed is faster in terms of FPS and more accurate. Another study by Wang et al., (2021) 6 used Scaled YOLOv4 which is an improved version of the previous one in terms of average precision and speed.
2.3 Real Time Object Detection
The Studies of Tripathi et al., (2018) 27 and Sai Chadalawada (2020) 28 use real-time object recognition in order to detect the object shown from the live feed. In one of the studies, a hybrid Convolutional Neural Network was proposed for a more efficient object detection. The drawback of using CNN is that it requires a large amount of data. Chadalawada's study YOLOv3 and concludes that it is the best algorithm for real-time detection. Another study by Devaki et al., (2019) 29 discusses various approaches for real-time object detection with its advantages and disadvantages. They concluded that faster RCNN is the better option than the others they have tried because it solves the bottleneck of running Selective Search Method on each image. Younis et al., (2020) 32 used a deep learning pre-trained model MobileNet for Single Shot Multibox Detector (SSD), and the algorithm's precision is around 99.76% when it comes to vehicles.
2.4 Damages to Boats caused by Water Pollution
A waterway is a body of water where boats can navigate through. It is important to keep the waterway clean and free from water pollution. The Maritime Safety of Queensland (2018) 30 stated that Garbage is hazardous to marine life and other users of waterways. Ropes and plastic material can get caught in propellers and block water intakes causing major damage or even loss of income while a boat is out of service for repairs. The Division of Boating and Waterways (2018) 31 also stated that trash poses a serious threat to safety in waterways. Marine debris can wrap around boat propellers and clog boat intakes, causing costly engine damage and becoming a safety hazard.
2.5 Automated Waste Sorting
According to Kokoulin and et al., (2020) 33 plastic is a material for reusing, however just 10% of plastic waste is processed in Russia. Two methods for waste recognition are applied in two partition stages: spectrometry and computer vision. The usage of municipal solid waste sorting systems utilizing the close to infrared and obvious range spectrometers can essentially expand the effectiveness of partition measure in examination with manual sorting. But the quality of separation is still low because this plastic cannot be separated by color using this method and a lot of impurities are included.
3.1 Machine Specifications
The proponents are going to use Google Colaboratory PRO for creating their Scaled-YOLOv4 model. Google Colaboratory is a product of Google research that allows developers to write and execute python code. It offers pre-installed libraries, saved on the Cloud, Collaboration Feature, Free GPU, and TPU.
Abbildung in dieser Leseprobe nicht enthalten
Fig. 1 Google Colab PRO GPU specification 23
3.2 Theoretical Framework
The study is anchored on the theoretical support of Wang et al. (2021) 6 and their study. They proposed a network scaling approach that modifies the depth, width, resolution, and structure of the network. Scaled- YOLOv4 has different scaling configurations, which are YOLOv4-tiny and YOLOv4-large. The large configuration model achieves a 55.5% average precision (73.4% AP50) for the MS COCO dataset at a speed of 16 FPS. The tiny configuration model achieves a 22.0% average precision (42.0% AP50) at a speed of 443 FPS. Another configuration is YOLOv4-CSP which achieves a 47.5% average precision. This will be relevant to the study because it shows the superiority of Scaled-YOLOv4. The proponents used it as their object detection model because it performed well.
3.3 Conceptual Framework
The study proposes the utilization of Scaled-YOLOv4 for trash detection. The customized image dataset will be inputted into the model. In the backbone, the feature map will be extracted from the image. The backbone of the model is YOLOv4-CSP which is a convolutional neural network used for object detection. The neck is a subset of the backbone, and it enhances the discriminability of the feature. CSP (Cross Stage Partial) cuts down computation by 40% to reduce memory cost. The neck used is PAN (Path Aggregation Network), and it enhances the process of instance segmentation by preserving spatial information. An SPP (Spatial Pyramid Pooling) module is inserted in the middle of the neck, increasing the receptive field without reducing the network operation speed. The head is used for detection and classification. The configuration for the head will be YOLOv4-CSP.
Abbildung in dieser Leseprobe nicht enthalten
Fig. 2 Conceptual Framework
3.4 Background of the COCO Dataset
Solowetz et al. (2020) stated 22, "The COCO dataset is the gold standard benchmark for evaluating object detection models. The COCO (Common Objects in Context) dataset contains over 120,000 images for training and testing, with 80 object class labels commonly observed in everyday life. Generally, if a model does well on the COCO dataset, it is believed that the model will generalize well to new custom domains - in the spirit of YOLO9000, for 9000 objects." The proponents' used the Scaled YOLOv4 state-of-the-art object detection model
Abbildung in dieser Leseprobe nicht enthalten
Fig. 3 Comparison of Scaled-YOLOv4 and other state-of-the-art object detectors 6
3.5 Proposed Method
Scaled YOLOv4 is used in this study because it yields a higher average precision than other object detection models. According to MS Coco Detection, Scaled YOLOv4 tops EfficientDet in terms of Average Precision, as shown in Fig. 3. In Fig. 4, Scaled YOLOv4 shows superiority among the other models because of its 47.5% Average Precision 6.
Abbildung in dieser Leseprobe nicht enthalten
Fig. 4 Comparison of Object Detectors 6
3.6 Trash Detection
The proposed model should detect two types of dilapidated trash found on the surface of Pasig River, which are paper and plastic. The dataset gathered by the proponents will be used to train the object detection model. This method can be applied to detect trash floating on the river's surface. Trash detection is challenging since the object has deformities that change its state and size making it hard to detect objects accurately. The proponents will assess the performance of the model in terms of AP (average precision), mAP@.5 (mean average precision at 50% IoU threshold, mAP@.5:.95 (mean average precision at 50% to 90% IoU threshold), mAP (mean average precision), recall, and precision. These metrics will determine how well the model performed whether it is acceptable or not. This will also be the basis in answering the research problem of the study. Precision will determine how many bounding boxes are actual objects, and recall shows how many objects are detected. The training loss will also be a basis for how the model performs after training for multiple epochs.
3.7 Architecture of the Model
Fig. 4 Scaled-YOLOv4 Architecture 6
It is a convolutional neural network for object detection that uses a feature map to extract from the image. The purpose of using YOLOv4-CSP is to remove computational bottlenecks in the DenseNet and improve learning by passing on an unedited version of the feature map. Pooling features are applied in the network. It reduces the number of parameters and computations in the network.
Abbildung in dieser Leseprobe nicht enthalten
Fig. 5 Computational blocks of reversed Dark layer and reversed CSP dark layers 6.
Path Aggregation Network (PANet) is incorporated into the model because of its ability to preserve spatial information during instance segmentation. The image's feature complexity increases as it passes through the neural network but decreases its spatial resolution. Bottom-up is the path augmentation used by PANet, which helps shorten the path by about ten layers. PANet uses the information from Fully Convolutional Network and Fully Connected layers to provide a more accurate prediction. The PANet is then "CSP-ized" in order to reduce the memory cost by cutting the computation by 40%.
Spatial Pyramid Pooling maps the input size down to a fixed output. Images that pass through here are partitioned into coarser levels. Then the features are aggregated. The features are pooled to generate fixed- size output, which is then fed to Fully Connected layers of the PANet.
Mish is a low-cost activation function that avoids saturation by being unbounded above. Saturation slows the training down, and the property unbounded above speeds up the training process. Bounded below is another property that helps in regularization effects. This activation function is non-monotonic, which can help preserve negative values to stabilize the network gradient flow. It is popular because it has outperformed different activation functions such as ReLU.
The proponents will use YOLOv4-CSP among the other head configurations. It is for customizing a medium size model. The head determines object loss and classification loss of the model. Object detection and classification take place in the head of the model.
3.8 Data Gathering
The photos are taken using a camera in the environment of Pasig River, Philippines, because even though it is already clean, people tend to throw their trash near the river. The pictures of trash were not collected in a controlled environment because the amount of dilapidated trash in the Pasig River can't be simulated. Some of the images are taken in the morning, and the others are taken in the afternoon. The images taken in the morning are affected by illumination conditions. Water lilies also cover some of the trash which hinders the performance of the model. The images are taken at different angles to simulate various scenarios. Some of the photos will be coming from the internet using images from news broadcasting sites and various local photographers who sent their pictures.