Home > Article > Technology peripherals > OpenOOD update v1.5: Comprehensive and accurate out-of-distribution detection code library and testing platform, supporting online rankings and one-click testing
Out-of-distribution (OOD) detection is crucial for the reliable operation of open-world intelligent systems, but current object-oriented detection methods suffer from "evaluation inconsistencies" (evaluation inconsistencies).
Previous Work OpenOOD v1 unifies the evaluation of OOD detection, but there are still limitations in scalability and usability.
The development team recently proposed OpenOOD v1.5 again. Compared with the previous version, the new OOD detection method evaluation has been significantly improved in ensuring accuracy, standardization and user-friendliness.
Picture
Paper: https://arxiv.org/abs/2306.09301
OpenOOD Codebase: https://github.com/Jingkang50/OpenOOD
OpenOOD Leaderboard: https://zjysteven.github.io/OpenOOD/
It is worth noting that OpenOOD v1.5 extends its evaluation capabilities to large-scale datasets such as ImageNet, investigates the important but untapped full-spectrum OOD detection, and introduces new features including online leaderboards and easy-to-use evaluators.
This work also contributes to in-depth analysis and insights from comprehensive experimental results, thereby enriching the knowledge base of OOD detection methods.
With these enhancements, OpenOOD v1.5 aims to drive the progress of OOD research and provide a more powerful and comprehensive evaluation benchmark for OOD detection research.
For a trained image classifier, a key capability that allows it to work reliably in the open world is Detect unknown, out-of-distribution (OOD) samples.
For example, we used a set of cat and dog photos to train a cat and dog classifier. For in-distribution (ID) samples, that is, cat and dog pictures here, we naturally expect the classifier to accurately identify them into the corresponding categories.
For OOD samples outside the distribution, that is, any pictures other than cats and dogs (such as airplanes, fruits, etc.), we hope that the model can detect that they are unknown, Novel objects/concepts, so they cannot be assigned to any category of cats or dogs within the distribution.
This problem is out-of-distribution detection (OOD detection), which has attracted widespread attention in recent years, and new work is emerging one after another. However, while the field is expanding rapidly, it has become difficult to track and measure the development status of the field due to various reasons.
The rapid development of various deep learning tasks is inseparable from a unified test data set (just like CIFAR, ImageNet for image classification, PASCAL VOC, COCO for object detection).
Unfortunately, however, the field of OOD detection has always lacked a unified and widely adopted OOD data set. This leads to the fact that in the figure above, when we look back at the experimental settings of existing work, we will find that the OOD data used is very inconsistent (for example, for CIFAR-10, which is ID data, some work uses MNIST and SVHN as OOD , some works use CIFAR-100, Tiny ImageNet as OOD). Under such circumstances, direct and fair comparisons of all methods face significant difficulties.
In addition to OOD detection, other terms such as "Open-Set Recognition (OSR)" (Open-Set Recognition, OSR) and "Novelty Detection" also often appear in the literature .
They essentially focus on the same problem, with only minor differences in the details of some experimental settings. However, different terminology can lead to unnecessary branches between methods. For example, OOD detection and OSR were once regarded as two independent tasks, and there were very few methods between different branches (although they were solving the same problem). are compared together.
In many works, researchers often directly use samples in the OOD test set to adjust parameters or even train models. Such an operation would overestimate the method's OOD detection capability.
The above problems are obviously detrimental to the orderly development of the field. We urgently need a unified benchmark and platform to test and evaluate existing and future OOD detection methods.
OpenOOD came into being under such challenges. Its first version has taken an important step, but it has problems of small scale and usability that need to be improved.
Therefore, in the new version of OpenOOD v1.5, we have further strengthened and upgraded it, trying to create a comprehensive, accurate, and easy-to-use testing platform for the majority of researchers.
In summary, OpenOOD has the following important features and contributions:
This code base understands and modularizes model structure, data preprocessing, post-processing, training, testing, etc. to facilitate reuse and development. Currently, OpenOOD implements nearly 40 state-of-the-art OOD detection methods for image classification tasks.
Picture
As shown in the figure above, with just a few lines of code, OpenOOD's evaluator can give the OOD detection test of the provided classifier and post-processor on the specified ID data set. result.
The corresponding OOD data is determined and provided internally by the evaluator, which ensures the consistency and fairness of the test. The evaluator also supports both standard OOD detection (standard OOD detection) and full-spectrum OOD detection (full-spectrum OOD detection) scenarios (more on this later).
Using OpenOOD, we compared the performance of nearly 40 OOD detection methods on four ID data sets: CIFAR-10, CIFAR-100, ImageNet-200, and ImageNet-1K, and The results were made into a public ranking list. We hope to help everyone understand the most effective and promising methods in the field at any time.
Based on the comprehensive experimental results of OpenOOD, we provide many new findings in the paper. For example, although it seems to have little to do with OOD detection, data augmentation can actually effectively improve the performance of OOD detection, and this improvement is orthogonal and complementary to the improvement brought by specific OOD detection methods.
In addition, we found that the performance of existing methods in full-spectrum OOD detection is not satisfactory, which will also be an important problem to be solved in the future field.
This section will briefly and popularly describe the goals of standard and full-spectrum OOD detection. For a more detailed and formal description, you are welcome to read our paper.
Picture
First some background. In the image classification scenario we consider, the in-distribution (ID) data is defined by the corresponding classification task. For example, for the CIFAR-10 classification, the ID distribution corresponds to its 10 semantic categories.
The concept of OOD is formed relative to ID: pictures corresponding to any semantic category other than the ID semantic category and different from the ID category are out-of-distribution OOD images. At the same time, we need to discuss the following two types of distributional shifts.
Semantic Shift: Distribution changes at the deep semantic level, corresponding to the horizontal axis of the above figure. For example, the semantic categories are cats and dogs during training, and the semantic categories are airplanes and fruits during testing.
Covariate Shift: The distribution changes at the surface statistical level (while the semantics remain unchanged), corresponding to the vertical axis of the above figure. For example, during training, there are clean and natural photos of cats and dogs, while during testing, there are noise-added or hand-drawn images of cats and dogs.
With the above background, combined with the above picture, you can better understand the standard and full-spectrum OOD detection.
Objective (1): Train a classifier on the ID distribution so that it can accurately classify ID data . It is assumed here that there is no covariate shift between the test ID data and the training ID data.
Goal (2): Based on the trained classifier, design an OOD detection method so that it can distinguish ID/OOD from any sample. The corresponding thing in the above figure is to distinguish (a) from (c) (d).
Objective (1): Similar to standard OOD detection, but the difference is that covariate shift is considered, that is, regardless of To test whether there is a covariate shift in the ID image compared to the training image, the classifier needs to be accurately classified into the corresponding ID category (for example, the cat and dog classifier should not only accurately classify "clean" cat and dog images, but also be able to generalize to noisy, on blurry cat and dog pictures).
Goal (2): Also consider covariate-shifted ID samples, which need to be distinguished from OOD samples together with normal (no covariate shift) ID samples. Correspond to the distinction between (a) (b) and (c) (d) in the above figure.
Familiar friends may have discovered that target (1) in full-spectrum OOD detection actually corresponds to another very important research topic-out-of-distribution generalization (OOD generalization) ).
It needs to be clarified that OOD in OOD generalization refers to samples with covariate shift, while OOD in OOD detection refers to samples with semantic shift.
These two kinds of shifts are very common in the real world. However, the existing OOD generalization and standard OOD detection only consider one of them and ignore it. Another kind.
In contrast, full-spectrum OOD detection naturally considers both offsets together in the same scenario, more accurately reflecting our view of an ideal classifier in the open world. performance expectations.
In version 1.5, OpenOOD has tested nearly 40 methods on 6 benchmark data sets ( 4 for standard OOD detection and 2 for full-spectrum OOD detection) have been tested uniformly and comprehensively.
The methods and data sets implemented are described in the paper, and everyone is welcome to check it out. All experiments can also be reproduced in the OpenOOD code base. Here we discuss directly the findings derived from the comparison results.
Picture
In the above table, it is not difficult to find that no method can consistently give outstanding performance on all benchmark data sets.
For example, post-hoc inference methods ReAct and ASH perform well on the large data set ImageNet, but have no advantage over other methods on CIFAR.
On the contrary, some training methods that add constraints in training, such as RotPred and LogitNorm, are better than post-processing methods on small data sets, but on ImageNet Not outstanding.
As shown in the table above, although data enhancements are not specifically designed for OOD detection, they can effectively improve the performance of OOD detection. What is even more surprising is that the improvements brought by data augmentation and the improvements brought by specific OOD post-processing methods amplify each other.
Take AugMix as an example here. When it is combined with the simplest MSP post-processor, it reaches 77.49% in ImageNet-1K near-OOD detection rate, which is only lower than the cross-entropy loss without data enhancement (corss- entropy loss) training, the detection rate is 77.38% higher than 1.47%.
However, when AugMix is combined with the more advanced ASH post-processor, the corresponding detection rate is 3.99% higher than the cross-entropy baseline and reaches the highest in our tests of 82.16%. Such results show that the combination of data enhancement and post-processing has great potential to further improve OOD detection capabilities in the future.
It can be clearly seen from the above figure that when the scene switches from standard OOD detection to full-spectrum OOD detection (that is, covariate-shifted ID images are added to the test ID data ), the performance of most methods shows significant degradation (greater than 10% decrease in detection rate).
This means that the current method tends to mark covariate-shifted ID images whose actual semantics have not changed as OOD.
This behavior is contrary to human perception (and also the target of full-spectrum OOD detection): Suppose a human tagger is tagging cat and dog pictures, and at this time show him/her For a noisy, blurry picture of a cat or dog, he/she should still recognize that it is a cat/dog, and that it is in-distribution ID data rather than unknown out-of-distribution OOD data.
Generally speaking, current methods cannot effectively solve full-spectrum OOD detection, and we believe this will be an important issue in the future field.
In addition, there are many findings that are not listed here, such as data enhancement is still effective for full-spectrum OOD detection, etc. Once again, everyone is welcome to read our paper.
We hope that OpenOOD’s code base, testers, rankings, benchmark data sets and detailed test results can bring together various Researchers work together to advance the field. I look forward to everyone using OpenOOD to develop and test OOD detection.
We also welcome any form of contribution to OpenOOD, including but not limited to providing feedback, adding the latest methods to the OpenOOD code base and leaderboards, extending future versions of OpenOOD, etc. .
Reference: https://arxiv.org/abs/2306.09301
The above is the detailed content of OpenOOD update v1.5: Comprehensive and accurate out-of-distribution detection code library and testing platform, supporting online rankings and one-click testing. For more information, please follow other related articles on the PHP Chinese website!