In our algorithm work, we will train a variety of models. There are many reasons for the poor practical application effect of our models. Some common reasons are unreasonable model structure, unreasonable loss function and unreasonable setting of super parameters. But in addition to these reasons, I think the core reason is the quality of data itself.
I believe every developer in the automatic driving industry will have a deep understanding of this. On the presentation of CVPR, LYFT team expressed the feeling of “high quality labeled data is the key”. This is also the theme of this article.
Let’s intuitively feel the influence of annotation quality on model training performance through an experiment
Experiment topic: comparative experiment on the effect of different quality labels on model performance
The experimental framework is as follows
The left side of the picture is the process of training, and the right side is the process of testing
Part of the logic of training:
Firstly, an original Kitty dataset can be obtained by matching the pictures of Kitty dataset with the original label, and then a graviti Kitty dataset can be obtained by matching the pictures with the graviti label. Then, the two datasets are respectively used to train a 2D target detection model. Here, the classic fast RCNN model is used, and the models are called original Model and graviti model.
The logic of the test part:
Two accurate third-party datasets, waymo and cityscape, are used to test the two models.
First, introduce the training set and test set used in this experiment
(1) the Waymo dataset is a multi sensor automatic driving data set collected by Waymo’s autopilot. Its camera data is collected continuously, of which one hundred scenes provide 2D box annotation.
(2) Kitti data set is a visual algorithm evaluation data set for automatic driving scene jointly released by karlsruhr Institute of technology and Toyota University of technology at Chicago. It includes “2D box annotation and 3D point cloud data” and other sub data sets. In the experiment, Kitti’s 2D box annotation data is mainly used.
(3) Cityscape data set is a data set published by citydcape team, which is dedicated to the semantic understanding of urban streetscape. It covers the streetscape information of 50 cities in Germany. The data set provides 5000 fine labeled “semantic segmentation and instance segmentation integrated Annotation”.
Let’s take a look at the specific results
(1) This is the visualization result of waymo’s test set’s 2D box annotation. We can see that its annotation box fits the target very well.
(2) This is a training sample of Kitti object2d. The red box is the original label of Kitti dataset, and the blue box is the annotation of graviti. You can see that all the blue boxes are more accurate than the red box
(3) The following figure is a sample of cityscape. In the experiment, we need to convert the pixel information of the instance into 2D box information, as shown in the white box in the figure
The experimental results were as follows
This is the test result of two models on waymo data set. There are three color boxes in the figure. Among them, red is the prediction result of groundtruth, blue is the prediction result of graviti model, and green is the prediction result of original model
From the picture, we can intuitively see that the blue frame of the two white cars is closer to the red ground truth than the green frame
These are the test results of two cityscape models. The red box is groudtruth, the blue box is the prediction result of graviti model, and the green one is the prediction result of original model. Take the second car on the right as an example, which is the silver SUV. You can see that the blue frame in the front of the car is more inward than the red frame, and the green frame is more outward than the red frame.
The front is the visual test results, and the following is the experimental results from the perspective of quantitative analysis with the help of PR curve:
Briefly introduce the meaning of PR curve:
The ordinate of PR curve is accuracy, and the abscissa is recall. The combination of these two indicators can comprehensively evaluate the prediction quality of a model.
Accuracy refers to “how many of the predicted 2D frames are correct and effective”. Of course, the definition of “correct” is a flexible concept. For example, “I can think that as long as the IOU of the prediction frame and ground truth is greater than 0.5, it is a correct prediction.” Recall refers to “all ground truth targets – few are correctly predicted.”. If you don’t know the details of this chart, you can first think that the higher the curve to the right, the better the quality of the corresponding model.
In this chart: the red line curve is the PR curve of graviti model on different data sets, and the green curve is the PR curve of original model on different data sets.
If you look at these four charts horizontally, the top two are the PR curve results when our IOU truth threshold is set to 0.5, and the bottom two are the PR curve results when our IOU truth threshold is set to 0.75.
Take the example above on the left: the red curve is higher to the right than the green curve – which also means that the testing performance of graviti model is better than that of original model.
If we look at it vertically, the test results on cityscape dataset on the left and waymo dataset on the right.
Take cityscape data set as an example: when the IOU threshold is adjusted from 0.5 to 0.75, that is, when the requirement is increased, both curves are shifted to the lower left corner.
It’s very easy to understand. It’s similar to the teacher’s difficulty in grading the test paper. Naturally, the scores of all candidates will be reduced to a certain extent. However, we can observe that the gap between the red curve and the green curve widens, which indicates that when I raise my IOU threshold, the score decline of graviti model is far less than that of original model.
Similar conclusions can be drawn from waymo.
To sum up, whether from the visual effect or PR curve, we can get more consistent experimental conclusions. The conclusions are as follows:
First, the quality of annotation will directly affect the quality of the model
Good labeling will train better effect!
Second, the more accurate the annotation is, the closer the prediction result is to the true value
When the threshold value of IOU is set higher, the model can still get better performance!
Let’s briefly summarize the problems encountered in this experiment
1. The problem of format unification between different data sets
Among many famous public datasets, there are almost no two of them with the same annotation format.
The data set format used in the experiment is also very diverse, such as
Waymo’s 2D box annotation: first of all, it is in tfrecord format, and the 2D box displays xywh information;
The 2D box label of Kitti dataset is in TXT format, and the 2D box displays the information of xyxy;
Cityscape’s 2D box annotation needs to be manually transferred from semantic segmentation to 2D box annotation.
Of course, this is also a normal phenomenon in the early development of the industry. Just like all kinds of mobile phone charging interfaces that appeared in the market around 2000 years ago, graviti is committed to finding a general data annotation format In order to free us from the busy and complex data processing, and to focus more on our algorithm work.
2. Select sample
After the data set format is unified, we need to filter the specific samples to be used. For example, the waymo data set mentioned above is the data collected continuously, but we don’t need to select all the continuous images as the test set.
So in this experiment, we only extract a part of the images as our test set at intervals. For example, we only extract one of the three close scenes as our test set, so it will take a lot of time to filter the sample.
3. Category unification
In this experiment, waymo, Kitty, cityscape three public data sets and graviti Kitty annotation set are used.
The following table lists the classification details of the labels among the four data sets. From this figure, we can see that no matter the type of labels, the number of labels, and the classification criteria of labels are not the same. For example, waymo’s vehicle category includes Kitti’s van and tram categories, and the bus category in cityscape is not exactly the same as Kitti’s van category.
(4) Data set download abroad
I believe we all have a deep feeling that many foreign datasets need special tools to download, and the network speed is very moving. In view of some pain points in the use of open datasets by Algorithm Engineers, we will launch the function of open datasets in late August. We will provide public data set index, domestic site download, and we will also provide data set annotation and annotation visualization, so that you can quickly understand the visualization of data set annotation,
If you are interested in our products, you can log on to the official website: http://http://www.graviti.cn/
You can also scan the left side of the code to enter our communication group. Welcome to give us more feedback. On the right is our official account to understand more developments.