Implement edge training with less than 256KB of memory, and the cost is less than one thousandth of PyTorch-AI-php.cn

Home

Technology peripherals

Implement edge training with less than 256KB of memory, and the cost is less than one thousandth of PyTorch

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Apr 08, 2023 pm 01:11 PM

Memorytrain

When it comes to neural network training, everyone’s first impression is of the GPU server cloud platform. Due to its huge memory overhead, traditional training is often performed in the cloud and the edge platform is only responsible for inference. However, such a design makes it difficult for the AI model to adapt to new data: after all, the real world is a dynamic, changing, and developing scenario. How can one training cover all scenarios?

In order to enable the model to continuously adapt to new data, can we perform training on the edge (on-device training) so that the device can continuously learn on its own? In this work, we only used less than 256KB of memory to implement on-device training, and the overhead was less than 1/1000 of PyTorch. At the same time, we performed well on the visual wake word task (VWW) Achieved cloud training accuracy. This technology enables models to adapt to new sensor data. Users can enjoy customized services without uploading data to the cloud, thereby protecting privacy.

Implement edge training with less than 256KB of memory, and the cost is less than one thousandth of PyTorch

Website: https://tinytraining.mit.edu/
Paper: https://arxiv.org/abs/2206.15472
Demo: https://www.bilibili.com/ video/BV1qv4y1d7MV
Code: https://github.com/mit-han-lab/tiny-training

Background

On-device Training allows pre-trained models to adapt to new environments after deployment. By training and adapting locally on mobile, the model can continuously improve its results and customize the model for the user. For example, fine-tuning language models allows them to learn from input history; adjusting vision models allows smart cameras to continuously recognize new objects. By bringing training closer to the terminal rather than the cloud, we can effectively improve model quality while protecting user privacy, especially when processing private information such as medical data and input history.

However, training on small IoT devices is essentially different from cloud training and is very challenging. First, the SRAM size of AIoT devices (MCU) is usually limited (256KB). This level of memory is very difficult to do inference, let alone training. Furthermore, existing low-cost and high-efficiency transfer learning algorithms, such as only training the last layer classifier (last FC) and only learning the bias term, often have unsatisfactory accuracy and cannot be used in practice, let alone in modern applications. Some deep learning frameworks are unable to translate the theoretical numbers of these algorithms into measured savings. Finally, modern deep training frameworks (PyTorch, TensorFlow) are usually designed for cloud servers, and training small models (MobileNetV2-w0.35) requires a large amount of memory even if the batch-size is set to 1. Therefore, we need to co-design algorithms and systems to achieve training on smart terminal devices.

Implement edge training with less than 256KB of memory, and the cost is less than one thousandth of PyTorch

Methods and Results

We found that on-device training has two unique challenges: (1) The model is on the edge device It's quantitative. A truly quantized graph (as shown below) is difficult to optimize due to low-precision tensors and lack of batch normalization layers; (2) the limited hardware resources (memory and computation) of small hardware do not allow full backpropagation, which The memory usage can easily exceed the limit of the microcontroller's SRAM (by more than an order of magnitude), but if only the last layer is updated, the final accuracy will inevitably be unsatisfactory.

Implement edge training with less than 256KB of memory, and the cost is less than one thousandth of PyTorch

#In order to cope with the difficulty of optimization, we propose Quantization-Aware Scaling (QAS) to automatically scale the gradient of tensors with different bit precisions (as follows) shown on the left). QAS can automatically match gradients and parameter scales and stabilize training without requiring additional hyperparameters. On 8 data sets, QAS can achieve consistent performance with floating-point training (right picture below).

Implement edge training with less than 256KB of memory, and the cost is less than one thousandth of PyTorch

In order to reduce the memory footprint required for backpropagation, we propose Sparse Update to skip the gradient calculation of less important layers and sub-sheets. We develop an automatic method based on contribution analysis to find the optimal update scheme. Compared with previous bias-only, last-k layers update, the sparse update scheme we searched has 4.5 times to 7.5 times memory savings, and the average accuracy on 8 downstream data sets is even higher.

Implement edge training with less than 256KB of memory, and the cost is less than one thousandth of PyTorch

In order to convert the theoretical reduction in the algorithm into actual numerical values, we designed the Tiny Training Engine (TTE): it transfers the work of automatic differentiation to to compile time, and use codegen to reduce runtime overhead. It also supports graph pruning and reordering for real savings and speedups. Sparse Update effectively reduces peak memory by 7-9x compared to Full Update, and can be further improved to 20-21x total memory savings with reordering. Compared with TF-Lite, the optimized kernel and sparse update in TTE increase the overall training speed by 23-25 times.

Implement edge training with less than 256KB of memory, and the cost is less than one thousandth of PyTorch

Conclusion

In this article, we propose the first implementation on a single-chip computer Training solution (using only 256KB RAM and 1MB Flash). Our algorithm system co-design (System-Algorithm Co-design) greatly reduces the memory required for training (1000 times vs. PyTorch) and training time (20 times vs. TF-Lite), and achieves higher accuracy on downstream tasks Rate. Tiny Training can empower many interesting applications. For example, mobile phones can customize language models based on users’ emails/input history, smart cameras can continuously recognize new faces/objects, and some AI scenarios that cannot be connected to the Internet can also continue to learn (such as agriculture). , marine, industrial assembly lines). Through our work, small end devices can perform not only inference but also training. During this process, personal data will never be uploaded to the cloud, so there is no privacy risk. At the same time, the AI model can continuously learn on its own to adapt to a dynamically changing world!

The above is the detailed content of Implement edge training with less than 256KB of memory, and the cost is less than one thousandth of PyTorch. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

Are You At Risk Of AI Agency Decay? Take The Test To Find OutApr 21, 2025 am 11:31 AM

This article explores the growing concern of "AI agency decay"—the gradual decline in our ability to think and decide independently. This is especially crucial for business leaders navigating the increasingly automated world while retainin

How to Build an AI Agent from Scratch? - Analytics VidhyaApr 21, 2025 am 11:30 AM

Ever wondered how AI agents like Siri and Alexa work? These intelligent systems are becoming more important in our daily lives. This article introduces the ReAct pattern, a method that enhances AI agents by combining reasoning an

Revisiting The Humanities In The Age Of AIApr 21, 2025 am 11:28 AM

"I think AI tools are changing the learning opportunities for college students. We believe in developing students in core courses, but more and more people also want to get a perspective of computational and statistical thinking," said University of Chicago President Paul Alivisatos in an interview with Deloitte Nitin Mittal at the Davos Forum in January. He believes that people will have to become creators and co-creators of AI, which means that learning and other aspects need to adapt to some major changes. Digital intelligence and critical thinking Professor Alexa Joubin of George Washington University described artificial intelligence as a “heuristic tool” in the humanities and explores how it changes

Understanding LangChain Agent FrameworkApr 21, 2025 am 11:25 AM

LangChain is a powerful toolkit for building sophisticated AI applications. Its agent architecture is particularly noteworthy, allowing developers to create intelligent systems capable of independent reasoning, decision-making, and action. This expl

What are the Radial Basis Functions Neural Networks?Apr 21, 2025 am 11:13 AM

Radial Basis Function Neural Networks (RBFNNs): A Comprehensive Guide Radial Basis Function Neural Networks (RBFNNs) are a powerful type of neural network architecture that leverages radial basis functions for activation. Their unique structure make

The Meshing Of Minds And Machines Has ArrivedApr 21, 2025 am 11:11 AM

Brain-computer interfaces (BCIs) directly link the brain to external devices, translating brain impulses into actions without physical movement. This technology utilizes implanted sensors to capture brain signals, converting them into digital comman

Insights on spaCy, Prodigy and Generative AI from Ines MontaniApr 21, 2025 am 11:01 AM

This "Leading with Data" episode features Ines Montani, co-founder and CEO of Explosion AI, and co-developer of spaCy and Prodigy. Ines offers expert insights into the evolution of these tools, Explosion's unique business model, and the tr

A Guide to Building Agentic RAG Systems with LangGraphApr 21, 2025 am 11:00 AM

This article explores Retrieval Augmented Generation (RAG) systems and how AI agents can enhance their capabilities. Traditional RAG systems, while useful for leveraging custom enterprise data, suffer from limitations such as a lack of real-time dat

See all articles