Home >Technology peripherals >AI >Implement edge training with less than 256KB of memory, and the cost is less than one thousandth of PyTorch
When it comes to neural network training, everyone’s first impression is of the GPU server cloud platform. Due to its huge memory overhead, traditional training is often performed in the cloud and the edge platform is only responsible for inference. However, such a design makes it difficult for the AI model to adapt to new data: after all, the real world is a dynamic, changing, and developing scenario. How can one training cover all scenarios?
In order to enable the model to continuously adapt to new data, can we perform training on the edge (on-device training) so that the device can continuously learn on its own? In this work, we only used less than 256KB of memory to implement on-device training, and the overhead was less than 1/1000 of PyTorch. At the same time, we performed well on the visual wake word task (VWW) Achieved cloud training accuracy. This technology enables models to adapt to new sensor data. Users can enjoy customized services without uploading data to the cloud, thereby protecting privacy.
On-device Training allows pre-trained models to adapt to new environments after deployment. By training and adapting locally on mobile, the model can continuously improve its results and customize the model for the user. For example, fine-tuning language models allows them to learn from input history; adjusting vision models allows smart cameras to continuously recognize new objects. By bringing training closer to the terminal rather than the cloud, we can effectively improve model quality while protecting user privacy, especially when processing private information such as medical data and input history.
However, training on small IoT devices is essentially different from cloud training and is very challenging. First, the SRAM size of AIoT devices (MCU) is usually limited (256KB). This level of memory is very difficult to do inference, let alone training. Furthermore, existing low-cost and high-efficiency transfer learning algorithms, such as only training the last layer classifier (last FC) and only learning the bias term, often have unsatisfactory accuracy and cannot be used in practice, let alone in modern applications. Some deep learning frameworks are unable to translate the theoretical numbers of these algorithms into measured savings. Finally, modern deep training frameworks (PyTorch, TensorFlow) are usually designed for cloud servers, and training small models (MobileNetV2-w0.35) requires a large amount of memory even if the batch-size is set to 1. Therefore, we need to co-design algorithms and systems to achieve training on smart terminal devices.
We found that on-device training has two unique challenges: (1) The model is on the edge device It's quantitative. A truly quantized graph (as shown below) is difficult to optimize due to low-precision tensors and lack of batch normalization layers; (2) the limited hardware resources (memory and computation) of small hardware do not allow full backpropagation, which The memory usage can easily exceed the limit of the microcontroller's SRAM (by more than an order of magnitude), but if only the last layer is updated, the final accuracy will inevitably be unsatisfactory.
#In order to cope with the difficulty of optimization, we propose Quantization-Aware Scaling (QAS) to automatically scale the gradient of tensors with different bit precisions (as follows) shown on the left). QAS can automatically match gradients and parameter scales and stabilize training without requiring additional hyperparameters. On 8 data sets, QAS can achieve consistent performance with floating-point training (right picture below).
In order to reduce the memory footprint required for backpropagation, we propose Sparse Update to skip the gradient calculation of less important layers and sub-sheets. We develop an automatic method based on contribution analysis to find the optimal update scheme. Compared with previous bias-only, last-k layers update, the sparse update scheme we searched has 4.5 times to 7.5 times memory savings, and the average accuracy on 8 downstream data sets is even higher.
In order to convert the theoretical reduction in the algorithm into actual numerical values, we designed the Tiny Training Engine (TTE): it transfers the work of automatic differentiation to to compile time, and use codegen to reduce runtime overhead. It also supports graph pruning and reordering for real savings and speedups. Sparse Update effectively reduces peak memory by 7-9x compared to Full Update, and can be further improved to 20-21x total memory savings with reordering. Compared with TF-Lite, the optimized kernel and sparse update in TTE increase the overall training speed by 23-25 times.
In this article, we propose the first implementation on a single-chip computer Training solution (using only 256KB RAM and 1MB Flash). Our algorithm system co-design (System-Algorithm Co-design) greatly reduces the memory required for training (1000 times vs. PyTorch) and training time (20 times vs. TF-Lite), and achieves higher accuracy on downstream tasks Rate. Tiny Training can empower many interesting applications. For example, mobile phones can customize language models based on users’ emails/input history, smart cameras can continuously recognize new faces/objects, and some AI scenarios that cannot be connected to the Internet can also continue to learn (such as agriculture). , marine, industrial assembly lines). Through our work, small end devices can perform not only inference but also training. During this process, personal data will never be uploaded to the cloud, so there is no privacy risk. At the same time, the AI model can continuously learn on its own to adapt to a dynamically changing world!
The above is the detailed content of Implement edge training with less than 256KB of memory, and the cost is less than one thousandth of PyTorch. For more information, please follow other related articles on the PHP Chinese website!