


A massive array of arithmetic circuits powers NVIDIA GPUs to enable unprecedented acceleration of AI, high-performance computing, and computer graphics. Therefore, improving the design of these arithmetic circuits is critical to improving GPU performance and efficiency. What if AI learned to design these circuits? In a recent NVIDIA paper, "PrefixRL: Optimization of Parallel Prefix Circuits using Deep Reinforcement Learning," researchers demonstrated that AI can not only design these circuits from scratch, but also that AI-designed circuits are better than those designed by state-of-the-art electronic design automation (EDA) tools. Circuits are smaller and faster.
##Paper address: https://arxiv.org/pdf/2205.07000.pdf
The latest NVIDIA Hopper GPU architecture has nearly 13,000 AI-designed circuit examples. Figure 1 below: The 64b adder circuit designed by PrefixRL AI on the left is 25% smaller than the circuit designed by the most advanced EDA tool on the right of Figure 1.
Arithmetic circuits in computer chips are composed of networks of logic gates such as NAND, NOR and XOR) and wires. An ideal circuit should have the following attributes:
- Small: smaller area, more circuits can be packaged on the chip;
- Fast: lower latency, improved chip performance;
- lower power consumption.
In this NVIDIA study, researchers focused on circuit area and latency. They found that power consumption was closely related to the area of the circuit of interest. Circuit area and delay are often competing properties, so it is desirable to find a Pareto frontier for a design that effectively trades off these properties. In short, the researchers hope that the circuit area is minimized at each delay.
Therefore, in PrefixRL, researchers focus on a popular class of arithmetic circuits—parallel prefix circuits. Various important circuits in the GPU such as accelerators, increments, and encoders are prefix circuits, and they can be designated as prefix graphs at a higher level.
Then the question is: Can AI agents design good prefix maps? The state space of all prefix graphs is very large O(2^n^n) and cannot be explored using brute force methods. Figure 2 below shows an iteration of PrefixRL with a 4b circuit instance.
The researchers used Circuit Generator to convert the prefix diagram into a circuit with wires and logic gates. Next, these generated circuits are optimized through a physical synthesis tool that uses physical synthesis optimizations such as gate size, duplication, and buffer insertion.
Due to these physical synthesis optimizations, the final circuit properties (delay, area, and power) are not directly converted from the original prefix graph properties (such as levels and node count). This is why the AI agent learns to design prefix graphs but optimizes the properties of the final circuit generated from the prefix graphs.
Researchers treat arithmetic circuit design as a reinforcement learning (RL) task, in which an agent is trained to optimize the arithmetic circuit Area and delay properties. For the prefix circuit, they designed an environment where the RL agent can add or remove nodes in the prefix graph, and then perform the following steps:
- The prefix graph is normalized to always Maintain correct prefix sum calculations;
- Generate circuits from normalized prefix graphs;
- Use physical synthesis tools to perform physical synthesis optimization of circuits ;
- Measure the area and delay characteristics of the circuit.
In the following animation, the RL agent builds the prefix graph step by step by adding or deleting nodes. At each step, the agent is rewarded with improvements in circuit area and latency.
#The original image is an interactive version.
Fully convolutional Q-learning agent
The researchers use the Q-learning (Q-learning) algorithm to train the circuit design of the agent. As shown in Figure 3 below, they decompose the prefix graph into a grid representation, where each element in the grid is uniquely mapped to a prefix node. This grid represents the inputs and outputs used for the Q-network. Each element in the input grid represents whether the node exists or not. Each element in the output grid represents the Q-value of adding or removing a node.
The researcher uses a fully convolutional neural network architecture because the input and output of the Q learning agent are grid representations. The agent predicts Q-values for the area and delay attributes separately because the rewards for area and delay are separately observable during training.
Figure 3: 4b prefix graph representation (left) and fully convolutional Q-learning agent architecture (right).
Raptor for distributed training
PrefixRL requires a lot of calculations. In the physics simulation, each GPU requires 256 CPUs, and training 64b tasks requires Over 32,000 GPU hours. This time, NVIDIA has developed an internal distributed reinforcement learning platform, Raptor, which takes full advantage of NVIDIA's hardware advantages and can perform this kind of industrial-level reinforcement learning (Figure 4 below).
Raptor has features that improve the scalability and speed of training models, such as job scheduling, custom networks, and GPU-aware data structures. In the context of PrefixRL, Raptor enables hybrid allocation across CPUs, GPUs, and Spot Instances. The networks in this reinforcement learning application are diverse and benefit from the following:
- Raptor switches between NCCLs for peer-to-peer transfer of models Parameters are transferred directly from the learner GPU to the inference GPU;
- Redis is used for asynchronous and smaller messages such as rewards or statistics;
- For JIT compiled RPC, used to handle high-volume and low-latency requests, such as uploading experience data.
Finally, Raptor provides GPU-aware data structures such as replay buffers with multi-threaded services to receive experiences from multiple workers, batch data in parallel and Preload it on the GPU.
Figure 4 below shows that the PrefixRL framework supports concurrent training and data collection, and utilizes NCCL to efficiently send the latest parameters to participants (actors in the figure below).
Figure 4: Researchers use Raptor for decoupled parallel training and reward calculation to overcome circuit synthesis delays.
Reward Calculation
The researchers use a trade-off weight w (range is [0,1]) to combine the area and delay goals. They train various agents with different weights to obtain the Pareto frontier, thereby balancing the area, delay trade-off.
Physically synthesized optimization in a RL environment can generate a variety of solutions that trade off area and latency. Researchers drive physical synthesis tools using the same trade-off weights used to train specific agents.
Performing physics-synthesized optimization within a loop of reward calculations has the following advantages:
- RL agents learn to directly optimize the final circuit properties of target technology nodes and libraries ;
- RL agent includes the peripheral logic of the target algorithm circuit during the physical synthesis process, thereby jointly optimizing the performance of the target algorithm circuit and its peripheral logic.
However, doing physical synthesis is a slow process (~35 seconds for 64b adder), which can significantly slow down RL training and exploration.
The researchers decouple reward calculation from state updates because the agent only needs the current prefix graph state to take action, without circuit synthesis or previous rewards. Thanks to Raptor, they can offload lengthy reward calculations to a pool of CPU workers to perform physics synthesis in parallel, while actor agents can execute in the environment without waiting.
When the CPU worker returns the reward, the transformation can be embedded in the replay buffer. Comprehensive rewards are cached to avoid redundant calculations when a state is encountered again.
Results and Outlook
Figure 5 below shows the area and delay of a 64b adder circuit designed using PrefixRL and the Pareto-dominated adder circuit from the most advanced EDA tools.
The best PrefixRL adders achieve 25% less area than EDA tool adders at the same latency. These prefix graphs mapped to Pareto optimal adder circuits after physical synthesis optimization have irregular structures.
Figure 5: Arithmetic circuits designed by PrefixRL are smaller than circuits designed by state-of-the-art EDA tools and faster.
(left) circuit architecture; (right) corresponding 64b adder circuit characteristics diagram
As far as we know, this is the first method to use deep reinforcement learning agents to design arithmetic circuits. NVIDIA envisions a blueprint for applying AI to real-world circuit design problems, building action spaces, state representations, RL agent models, optimizing against multiple competing goals, and overcoming slow reward calculations.
The above is the detailed content of NVIDIA uses AI to design GPU arithmetic circuits, which reduce the area by 25% compared to the most advanced EDA, making it faster and more efficient. For more information, please follow other related articles on the PHP Chinese website!

ai合并图层的快捷键是“Ctrl+Shift+E”,它的作用是把目前所有处在显示状态的图层合并,在隐藏状态的图层则不作变动。也可以选中要合并的图层,在菜单栏中依次点击“窗口”-“路径查找器”,点击“合并”按钮。

ai橡皮擦擦不掉东西是因为AI是矢量图软件,用橡皮擦不能擦位图的,其解决办法就是用蒙板工具以及钢笔勾好路径再建立蒙板即可实现擦掉东西。

虽然谷歌早在2020年,就在自家的数据中心上部署了当时最强的AI芯片——TPU v4。但直到今年的4月4日,谷歌才首次公布了这台AI超算的技术细节。论文地址:https://arxiv.org/abs/2304.01433相比于TPU v3,TPU v4的性能要高出2.1倍,而在整合4096个芯片之后,超算的性能更是提升了10倍。另外,谷歌还声称,自家芯片要比英伟达A100更快、更节能。与A100对打,速度快1.7倍论文中,谷歌表示,对于规模相当的系统,TPU v4可以提供比英伟达A100强1.

ai可以转成psd格式。转换方法:1、打开Adobe Illustrator软件,依次点击顶部菜单栏的“文件”-“打开”,选择所需的ai文件;2、点击右侧功能面板中的“图层”,点击三杠图标,在弹出的选项中选择“释放到图层(顺序)”;3、依次点击顶部菜单栏的“文件”-“导出”-“导出为”;4、在弹出的“导出”对话框中,将“保存类型”设置为“PSD格式”,点击“导出”即可;

Yann LeCun 这个观点的确有些大胆。 「从现在起 5 年内,没有哪个头脑正常的人会使用自回归模型。」最近,图灵奖得主 Yann LeCun 给一场辩论做了个特别的开场。而他口中的自回归,正是当前爆红的 GPT 家族模型所依赖的学习范式。当然,被 Yann LeCun 指出问题的不只是自回归模型。在他看来,当前整个的机器学习领域都面临巨大挑战。这场辩论的主题为「Do large language models need sensory grounding for meaning and u

ai顶部属性栏不见了的解决办法:1、开启Ai新建画布,进入绘图页面;2、在Ai顶部菜单栏中点击“窗口”;3、在系统弹出的窗口菜单页面中点击“控制”,然后开启“控制”窗口即可显示出属性栏。

ai移动不了东西的解决办法:1、打开ai软件,打开空白文档;2、选择矩形工具,在文档中绘制矩形;3、点击选择工具,移动文档中的矩形;4、点击图层按钮,弹出图层面板对话框,解锁图层;5、点击选择工具,移动矩形即可。

引入密集强化学习,用 AI 验证 AI。 自动驾驶汽车 (AV) 技术的快速发展,使得我们正处于交通革命的风口浪尖,其规模是自一个世纪前汽车问世以来从未见过的。自动驾驶技术具有显着提高交通安全性、机动性和可持续性的潜力,因此引起了工业界、政府机构、专业组织和学术机构的共同关注。过去 20 年里,自动驾驶汽车的发展取得了长足的进步,尤其是随着深度学习的出现更是如此。到 2015 年,开始有公司宣布他们将在 2020 之前量产 AV。不过到目前为止,并且没有 level 4 级别的 AV 可以在市场


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

SAP NetWeaver Server Adapter for Eclipse
Integrate Eclipse with SAP NetWeaver application server.

Dreamweaver Mac version
Visual web development tools

ZendStudio 13.5.1 Mac
Powerful PHP integrated development environment

Atom editor mac version download
The most popular open source editor

SublimeText3 Linux new version
SublimeText3 Linux latest version
