search
HomeTechnology peripheralsAIAnt Group NextEvo fully open-sources AI Infra technology to enable large model training for 'autonomous driving”

Recently, NextEvo, the AI ​​innovation R&D department of Ant Group, announced a comprehensive open source AI Infra technology, which can greatly improve the efficiency of large-scale model training. According to data, this technology can increase the effective proportion of training time to more than 95% and realize the automation of the training process. This breakthrough progress has significantly promoted the efficiency of AI research and development.

蚂蚁集团NextEvo全面开源AI Infra技术,可实现大模型训练“自动驾驶”

Picture: Ant Group’s automated distributed deep learning system DLRover is now fully open source

DLRover is a system designed for large-scale A technical framework designed for distributed training. In many enterprises today, training jobs are often run in complex and varied hybrid deployment clusters. No matter how complex the environment, DLRover can handle it with ease, just like driving on rough terrain.

The rapid development of large model technology in 2023 has given rise to an explosive growth in engineering practice. How to efficiently manage data, optimize training and inference efficiency, and make full use of existing computing power has become a key issue.

To complete a large model with a parameter level of 100 billion, such as GPT-3, it takes 32 years to train once with one card. Therefore, it is very important to make full use of computing power during the training process. To achieve this goal, there are two approaches that can be taken. First, the performance of a purchased GPU can be further improved to reach its full potential. Secondly, previously unavailable computing resources such as CPU and memory can be utilized. To achieve this, this problem can be solved through heterogeneous computing platforms.

DLRover has recently integrated the Flash Checkpoint (FCP) solution, which is used for Checkpoint management during model training. The traditional checkpoint management method has problems such as long time consumption, high-frequency checkpoints reducing the available training time, and excessive loss during recovery of low-frequency checkpoints. By applying the new solution FCP, after training the 100 billion parameter model, the training waste time caused by Checkpoint is reduced by about 5 times, and the persistence time is reduced by about 70 times. This improvement increases the effective training time from 90% to 95%. This means that the model training efficiency of DLRover has been significantly improved.

We have also integrated three new optimizer technologies. The optimizer is a core component of machine learning and is used to update neural network parameters to minimize the loss function. Among them, Ant's AGD (Auto-switchable optimizer with Gradient Difference of adjacent steps) optimizer is 1.5 times faster than the traditional AdamW technology in large model pre-training tasks. AGD has been used in multiple scenarios within ants and achieved remarkable results, and related papers have been included in NeurIPS '23.

蚂蚁集团NextEvo全面开源AI Infra技术,可实现大模型训练“自动驾驶”

Figure: In large model pre-training tasks, AGD can accelerate 1.5 times compared to AdamW

As an automated distributed depth Learning system, DLRover's "autonomous driving" function module also includes: Atorch, a PyTorch distributed training extension library. At the scale of hundreds of billions of parameter models and kilocalories, the computing power utilization rate of training can reach 60%, helping developers Further squeeze hardware computing power.

DLRover uses the concept of “ML for System” to enhance the intelligence of distributed training. It aims to use a system to allow developers to completely get rid of the constraints of resource allocation and focus on model training itself. Without any resource configuration input, DLRover can still provide optimal resource configuration for each training job.

It is understood that Ant Group continues to invest in technology in the field of artificial intelligence. Recently, Ant Group established an internal AI innovation research and development department NextEvo, which is responsible for all core technology research and development of Ant AI, including all of the Bailing model. R&D work involves core technologies such as AI algorithms, AI engineering, NLP, and AIGC, as well as technology R&D and product innovation in the fields of layout of multi-modal large models and digital humans.

At the same time, Ant Group has also accelerated the pace of open source, filled the relevant domestic technology gaps, and promoted the rapid development of the artificial intelligence industry.

DLRover open source address: https://www.php.cn/link/cf372cbe6eae54c6a6dfb3ebbcdc3404

The above is the detailed content of Ant Group NextEvo fully open-sources AI Infra technology to enable large model training for 'autonomous driving”. For more information, please follow other related articles on the PHP Chinese website!

Statement
This article is reproduced at:机器之心. If there is any infringement, please contact admin@php.cn delete
The Hidden Dangers Of AI Internal Deployment: Governance Gaps And Catastrophic RisksThe Hidden Dangers Of AI Internal Deployment: Governance Gaps And Catastrophic RisksApr 28, 2025 am 11:12 AM

The unchecked internal deployment of advanced AI systems poses significant risks, according to a new report from Apollo Research. This lack of oversight, prevalent among major AI firms, allows for potential catastrophic outcomes, ranging from uncont

Building The AI PolygraphBuilding The AI PolygraphApr 28, 2025 am 11:11 AM

Traditional lie detectors are outdated. Relying on the pointer connected by the wristband, a lie detector that prints out the subject's vital signs and physical reactions is not accurate in identifying lies. This is why lie detection results are not usually adopted by the court, although it has led to many innocent people being jailed. In contrast, artificial intelligence is a powerful data engine, and its working principle is to observe all aspects. This means that scientists can apply artificial intelligence to applications seeking truth through a variety of ways. One approach is to analyze the vital sign responses of the person being interrogated like a lie detector, but with a more detailed and precise comparative analysis. Another approach is to use linguistic markup to analyze what people actually say and use logic and reasoning. As the saying goes, one lie breeds another lie, and eventually

Is AI Cleared For Takeoff In The Aerospace Industry?Is AI Cleared For Takeoff In The Aerospace Industry?Apr 28, 2025 am 11:10 AM

The aerospace industry, a pioneer of innovation, is leveraging AI to tackle its most intricate challenges. Modern aviation's increasing complexity necessitates AI's automation and real-time intelligence capabilities for enhanced safety, reduced oper

Watching Beijing's Spring Robot RaceWatching Beijing's Spring Robot RaceApr 28, 2025 am 11:09 AM

The rapid development of robotics has brought us a fascinating case study. The N2 robot from Noetix weighs over 40 pounds and is 3 feet tall and is said to be able to backflip. Unitree's G1 robot weighs about twice the size of the N2 and is about 4 feet tall. There are also many smaller humanoid robots participating in the competition, and there is even a robot that is driven forward by a fan. Data interpretation The half marathon attracted more than 12,000 spectators, but only 21 humanoid robots participated. Although the government pointed out that the participating robots conducted "intensive training" before the competition, not all robots completed the entire competition. Champion - Tiangong Ult developed by Beijing Humanoid Robot Innovation Center

The Mirror Trap: AI Ethics And The Collapse Of Human ImaginationThe Mirror Trap: AI Ethics And The Collapse Of Human ImaginationApr 28, 2025 am 11:08 AM

Artificial intelligence, in its current form, isn't truly intelligent; it's adept at mimicking and refining existing data. We're not creating artificial intelligence, but rather artificial inference—machines that process information, while humans su

New Google Leak Reveals Handy Google Photos Feature UpdateNew Google Leak Reveals Handy Google Photos Feature UpdateApr 28, 2025 am 11:07 AM

A report found that an updated interface was hidden in the code for Google Photos Android version 7.26, and each time you view a photo, a row of newly detected face thumbnails are displayed at the bottom of the screen. The new facial thumbnails are missing name tags, so I suspect you need to click on them individually to see more information about each detected person. For now, this feature provides no information other than those people that Google Photos has found in your images. This feature is not available yet, so we don't know how Google will use it accurately. Google can use thumbnails to speed up finding more photos of selected people, or may be used for other purposes, such as selecting the individual to edit. Let's wait and see. As for now

Guide to Reinforcement Finetuning - Analytics VidhyaGuide to Reinforcement Finetuning - Analytics VidhyaApr 28, 2025 am 09:30 AM

Reinforcement finetuning has shaken up AI development by teaching models to adjust based on human feedback. It blends supervised learning foundations with reward-based updates to make them safer, more accurate, and genuinely help

Let's Dance: Structured Movement To Fine-Tune Our Human Neural NetsLet's Dance: Structured Movement To Fine-Tune Our Human Neural NetsApr 27, 2025 am 11:09 AM

Scientists have extensively studied human and simpler neural networks (like those in C. elegans) to understand their functionality. However, a crucial question arises: how do we adapt our own neural networks to work effectively alongside novel AI s

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

EditPlus Chinese cracked version

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

SecLists

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.