Home > Article > Technology peripherals > The first binary neural network BNext with an accuracy of more than 80% on ImageNet came out, a five-year journey of -1 and +1
Two years ago, when MeliusNet came out, Machine Heart published a technical article "The binary neural network that beats MobileNet for the first time, - 1 and 1’s three-year arduous journey》, reviewed the development history of BNN. At that time, XNOR.AI, which was founded on the early BNN work XNOR-Net, was acquired by Apple. Everyone had imagined whether this low-power, high-performance binary neural network technology would soon open up broad application prospects.
However, in the past two years, it has been difficult for us to obtain more information about the application of BNN technology from Apple, which strictly keeps the technology confidential, and neither academia nor industry has appeared. Other particularly eye-catching application cases. On the other hand, as the number of terminal devices skyrockets, edge AI applications and markets are growing rapidly: it is expected that 500 to 125 billion edge devices will be produced by 2030, and the edge computing market will skyrocket to US$60 billion. There are several currently popular application areas: AIoT, Metaverse and robotic terminal equipment. Relevant industries are accelerating the implementation of technology. At the same time, AI capabilities have been embedded in many core technical links in the above fields, such as the widespread application of AI technology in three-dimensional reconstruction, video compression, and real-time robot perception of scenes. Against this background, the industry's demand for edge-based high-energy-efficiency, low-power AI technology, software tools, and hardware acceleration has become increasingly urgent.
Currently, there are two main bottlenecks restricting the application of BNN: first, the inability to effectively narrow the accuracy gap with traditional 32-bit deep learning models; second, the lack of performance on different hardware High-performance algorithm implementation. Speedups in machine learning papers often don't translate to the GPU or CPU you're using. The second reason may arise from the first reason. BNN cannot achieve satisfactory accuracy and therefore cannot attract widespread attention from practitioners in the fields of system and hardware acceleration and optimization. The machine learning algorithm community often cannot develop high-performance hardware code on its own. Therefore, to achieve both high accuracy and strong acceleration, BNN applications or accelerators will undoubtedly require the collaboration of developers from these two different fields.
For example, the Meta recommendation system model DLRM uses 32-bit floating point numbers to store weights and activation parameters, and its model The size is approximately 2.2GB. A binary version of the model with a small reduction in accuracy (
The second significant advantage of BNN is that the calculation method is extremely efficient. It only uses 1 bit, that is, two states, to represent variables. This means that all operations can be completed only by bit operations. With the help of AND gates, XOR gates and other operations, traditional multiplication and addition operations can be replaced. Bit operations are the basic unit in the circuit. Students who are familiar with circuit design should understand that effectively reducing the area of the multiplication and addition calculation unit and reducing off-chip memory access are the most effective ways to reduce power consumption, and BNN focuses on both memory and calculation. All have unique advantages. WRPN [1] demonstrated that on customized FPGA and ASIC, BNN can achieve 1000 times power saving compared to full precision. More recent work BoolNet [2] demonstrated a BNN structural design that can use almost no floating point operations and maintain pure binary information flow, which achieves excellent power consumption and accuracy trade-offs in ASIC simulation.
Researchers such as Nianhui Guo and Haojin Yang from the Hasso Plattner Institute of Computer Systems Engineering in Germany proposed the BNext model, becoming the first BNN to achieve a top1 classification accuracy of over 80% on the ImageNet data set. :
##Figure 1 Performance comparison of SOTA BNN based on ImageNet
Paper address: https://arxiv.org/pdf/2211.12933.pdf
Author First, based on the Loss Landscape visualization form, we deeply compared the huge difference in optimization friendliness between the current mainstream BNN model and the 32-bit model (Figure 2). It was proposed that the rough Loss Landscape of BNN hinders the current research community from further exploring the performance boundaries of BNN. One of the main reasons.
Based on this assumption, the author tries to use novel structural design to improve the optimization friendliness of the BNN model, and constructs a binary neural network architecture with a smoother Loss Landscape to reduce the sensitivity to high Difficulty of optimizing precision BNN models. Specifically, the author emphasizes that model binarization greatly limits the feature patterns that can be used for forward propagation, forcing binary convolution to only extract and process information in a limited feature space, and this restricted feed-forward propagation mode The optimization difficulties caused by it can be effectively alleviated through two levels of structural design: (1) constructing a flexible contiguous convolution feature calibration module to improve the model's adaptability to binary representation; (2) exploring efficient bypass structures to Alleviate the information bottleneck problem caused by feature binarization in feedforward propagation.
Figure 2 Visual comparison of Loss Landscape for popular BNN architecture (2D contour perspective)
Based on the above analysis, the author proposed BNext, the first binary neural network architecture to achieve > 80% accuracy in the ImageNe image classification task. The specific network architecture design is shown in Figure 4 shown. The author first designed a basic binary processing unit based on the Info-Recoupling (Info-RCP) module. To address the information bottleneck problem between adjacent convolutions, the preliminary calibration design of the binary convolution output distribution is completed by introducing additional Batch Normalization layers and PReLU layers. Then the author constructed a quadratic dynamic distribution calibration design based on the inverse residual structure and Squeeze-And-Expand branch structure. As shown in Figure 3, compared with the traditional Real2Binary calibration structure, the additional inverse residual structure fully considers the feature gap between the binary unit input and output, avoiding suboptimal distribution calibration based entirely on input information. This two-stage dynamic distribution calibration can effectively reduce the difficulty of feature extraction in subsequent adjacent binary convolution layers.
Figure 3 Convolution module design comparison chart
Secondly, the author proposes an enhanced binary Basic Block module combined with Element-wise Attention (ELM-Attention). The author completed the basic construction of the Basic Block by stacking multiple Info-RCP modules, and introduced additional Batch Normalization and continuous residual connections to each Info-RCP module to further alleviate the information bottleneck problem between different Info-RCP modules. Based on the analysis of the impact of the bypass structure on the optimization of the binary model, the author proposes to use the Element-wise matrix multiplication branch to perform distribution calibration on the output of the first 3x3 Info-RCP module of each Basic Block. The additional spatial attention weighting mechanism can help Basic Block perform forward information fusion and distribution with a more flexible mechanism, improving the smoothness of the model Loss Landscape. As shown in Figure 2.e and Figure 2.f, the proposed module design can significantly improve the model Loss Landscape smoothness.
Figure 4 BNext architecture design. "Processor represents the Info-RCP module, "BN" represents the Batch Normalization layer, "C" represents the basic width of the model, "N" and "M" represent the depth scale parameters of different stages of the model.
Table 1 BNext series. “Q” represents the input layer, SEbranch and output layer quantization settings.
The author combined the above structural design with the popular MoboleNetv1 benchmark model, and constructed four BNext model series of different complexity (Table 1) by changing the proportional coefficient of model depth and width: BNex-Tiny, BNext -Small, BNext-Middle, BNext-Large.
Due to the relatively rough Loss Landscape, current binary model optimization generally relies on finer supervision information provided by methods such as knowledge distillation to get rid of widespread suboptimal convergence. For the first time, the author of BNext considered the possible impact of the huge gap in the prediction distribution between the teacher model and the binary student model during the optimization process, and pointed out that teacher selection based solely on model accuracy will lead to counter-intuitive student overfitting results. To solve this problem, the author proposes knowledge-complexity (KC) as a new teacher-selection metric, taking into account the correlation between the effectiveness of the output soft labels of the teacher model and the complexity of the teacher model parameters.
As shown in Figure 5, based on knowledge complexity, the author conducted complexity measurement and comparison of popular full-precision model series such as ResNet, EfficientNet, and ConvNext. Ranking, combined with BNext-T as a student model, preliminarily verified the effectiveness of this metric, and the ranking results were used for knowledge distillation model selection in subsequent experiments.
Figure 5 Counter-intuitive overfitting effect and the impact of knowledge complexity under different teacher selections
On this basis, the author of the paper further considered the optimization problems caused by the early prediction distribution gap during the strong teacher optimization process, and proposed Diversified Consecutive KD. As shown below, the author modulates the objective function in the optimization process through the knowledge integration method of strong and weak teachers combination. On this basis, the knowledge-boosting strategy is further introduced, using multiple predefined candidate teachers to evenly switch weak teachers during the training process, guiding the combined knowledge complexity in a curricular manner from weak to strong, and reducing the prediction distribution. Optimization interference caused by differences.
In terms of optimization techniques, the BNext authors fully consider the gains that data augmentation may bring in modern high-precision model optimization and provide the first In view of the analysis results of the possible impact of existing popular data augmentation strategies in binary model optimization, experimental results show that existing data augmentation methods are not fully suitable for binary model optimization, which is specific to binary models in subsequent research. Optimized data enhancement strategy design provides ideas.
Based on the proposed architecture design and optimization method, the author conducted method verification on the large-scale image classification task ImageNet-1k. The experimental results are shown in Figure 6.
Figure 6 Comparison of SOTA BNN methods based on ImageNet-1k.
Compared with existing methods, BNext-L pushed the performance boundary of binary models to 80.57% for the first time on ImageNet-1k, achieving a 10% accuracy improvement over most existing methods. Compared with PokeBNN from Google, BNext-M is 0.7% higher with similar parameters. The author also emphasizes that the optimization of PokeBNN relies on higher computing resources, such as a Bacth Size of up to 8192 and a TPU of 720 Epochs. Computational optimization, while BNext-L only iterated 512 Epochs with a conventional Batch Size of 512, which reflects the effectiveness of the BNext structural design and optimization method. In comparisons based on the same baseline model, both BNext-T and BNext-18 have greatly improved accuracy. In comparison with full-precision models such as RegNetY-4G (80.0%), BNext-L demonstrates matching visual representation learning capabilities while using only limited parameter space and computational complexity, which makes it ideal for edge deployment. The downstream visual task model based on the binary model feature extractor provides rich imagination space.
BNext The authors mentioned in the paper that they and their collaborators are actively implementing and verifying this high-precision BNN architecture on GPU hardware operation efficiency, and plans to expand to other wider hardware platforms in the future. However, in the opinion of the editor, the community has regained confidence in BNN and attracted the attention of more geeks in the system and hardware fields. Perhaps the more important significance of this work is to reshape the imagination of BNN's application potential. In the long term, as more and more applications migrate from cloud-centric computing paradigms to decentralized edge computing, the massive number of edge devices in the future will require more efficient AI technology, software frameworks, and hardware computing platforms. However, the current most mainstream AI models and computing architectures are not designed and optimized for edge scenarios. Therefore, until the answer to edge AI is found, I believe that BNN will always be an important option full of technical challenges and huge potential.
The above is the detailed content of The first binary neural network BNext with an accuracy of more than 80% on ImageNet came out, a five-year journey of -1 and +1. For more information, please follow other related articles on the PHP Chinese website!