Home >Technology peripherals >AI >How artificial intelligence can make hardware develop better

How artificial intelligence can make hardware develop better

王林forward: 2023-04-13 08:13:021879browse

Computer hardware has been an inactive market for many years. The dominant x86 microprocessor architecture has reached the limits of performance gains that can be achieved through miniaturization, so manufacturers are primarily focused on packing more cores into a chip.

For the rapid development of machine learning and deep learning, GPU is the savior. Originally designed for graphics processing, GPUs can have thousands of small cores, making them ideal for the parallel processing capabilities required for AI training.

The essence of artificial intelligence is that it benefits from parallel processing, and about 10 years ago it was discovered that GPUs, which are designed to display pixels on a screen, are well suited for this because they are parallel processing engines that can Put in a lot of cores.

That’s good news for Nvidia, which saw its market capitalization surge from less than $18 billion in 2015 to $735 billion before the market contracted last year. Until recently, the company had virtually the entire market to itself. But many competitors are trying to change that.

In terms of artificial intelligence workloads, it has been mainly Nvidia’s GPUs so far, but users are looking for technologies that can take it to the next level. As high-performance computing and AI workloads continue to converge, we We will see a wider variety of accelerators emerge.

Accelerating the development of new hardware

The big chip manufacturers are not standing still. Three years ago, Intel acquired Israeli chipmaker Havana Labs and made the company the focus of its artificial intelligence development efforts.

The Gaudi2 training optimization processor and Greco inference processor launched by Havana last spring are said to be at least twice as fast as Nvidia’s flagship processor A100.

In March this year, Nvidia launched its H100 accelerator GPU with 80 billion transistors and support for the company's high-speed NVLink interconnect. It features a dedicated engine that can accelerate the execution of Transformer-based models used in natural language processing by six times compared to the previous generation. Recent tests using the MLPerf benchmark show that H100 outperforms Gaudi2 in most deep learning tests. Nvidia is also seen as having an advantage in its software stack.

Many users choose GPUs because they have access to an ecosystem of centralized software. The reason why NVIDIA is so successful is because they have established an ecosystem strategy.

Hyperscale cloud computing companies are entering the field even before chipmakers. Google LLC’s Tensor processing unit is an application-specific integrated circuit that was launched in 2016 and is currently in its fourth generation. Amazon Web Services launched its inference processing accelerator for machine learning in 2018, claiming it offers more than twice the performance of GPU-accelerated instances.

Last month, the company announced the general availability of cloud instances based on its Trainium chips, saying that in deep learning model training scenarios, with comparable performance, their cost ratio based on GPU's EC2 is 50% lower. The efforts of both companies are mainly focused on delivery through cloud services.

While established market leaders focus on incremental improvements, many of the more interesting innovations are taking place among startups building AI-specific hardware. Venture capitalists attracted the majority of the $1.8 billion invested in chip startups last year, more than double the amount in 2017, according to the data.

They are chasing a market that could bring huge gains. The global artificial intelligence chip market is expected to grow from US$8 billion in 2020 to nearly US$195 billion by 2030.

Smaller, Faster, Cheaper

Few startups want to replace x86 CPUs, but that’s because of the leverage to do so Relatively small. Chips are no longer the bottleneck, communication between different chips is a huge bottleneck.

The CPU performs low-level operations such as managing files and assigning tasks, but a purely CPU-specific approach is no longer suitable for extensions. The CPU is designed for everything from opening files to managing memory caches. Activities must be universal. This means that it is not well suited for the massively parallel matrix arithmetic operations required for AI model training.

Most activity in the market revolves around coprocessor accelerators, application-specific integrated circuits, and, to a lesser extent, field-programmable gate arrays that can be fine-tuned for specific uses.

Everyone is following Google's line of developing co-processors that work in conjunction with the CPU to target algorithms by hard-coding them into the processor rather than running them as software. Specific parts of the AI workload.

Acceleration equation

The acceleration equation is used to develop so-called graphics stream processors for edge computing scenarios such as self-driving cars and video surveillance. The fully programmable chipset takes on many of the functions of a CPU but is optimized for task-level parallelism and streaming execution processing, using only 7 watts of power.

The architecture is based on a graph data structure, where relationships between objects are represented as connected nodes and edges. Each machine learning framework uses graph concepts, maintaining the same semantics throughout the chip's design. The entire graph including the CMM but containing custom nodes can be executed. We can speed up anything parallel in these graphs.

Its graphics-based architecture solves some of the capacity limitations of GPUs and CPUs and can more flexibly adapt to different types of AI tasks. It also allows developers to move more processing to the edge for better inference. If companies can pre-process 80% of the processing, they can save a lot of time and costs.

These applications can bring intelligence closer to data and enable rapid decision-making. The goal of most is inference, which is the field deployment of AI models, rather than the more computationally intensive training tasks.

A company is developing a chip that uses in-memory computing to reduce latency and the need for external storage devices. Its artificial intelligence platform will provide flexibility and the ability to run multiple neural networks while maintaining high accuracy.

Its data processing unit series is a massive parallel processor array with a scalable 80-core processor that can execute dozens of tasks in parallel. The key innovation is the tight integration of a tensor coprocessor inside each processing element and support for direct tensor data exchange between elements to avoid memory bandwidth bottlenecks. This enables efficient AI application acceleration because pre- and post-processing are performed on the same processing elements.

Some companies focus on inferring deep learning models using thumbnail-sized chipsets, which the company claims can perform 26 trillion operations per second while consuming less power. to 3 watts. In part, it is achieved by breaking down each network layer used to train a deep learning model into the required computing elements and integrating them on a chip specifically built for deep learning.

The use of onboard memory further reduces overhead. The entire network is inside the chip and there is no external memory, which means the chip can be smaller and consume less energy. The chip can run deep learning models on near-real-time high-definition images, enabling a single device to run automatic license plate recognition on four lanes simultaneously.

Current Development of Hardware

Some startups are taking more of a moonshot approach, aiming to redefine AI model training and the entire platform it runs on.

For example, an AI processor optimized for machine learning can manage up to 3.5 million per second with nearly 9,000 concurrent threads and 900 megabytes of in-processor memory. billion processing operations. The integrated computing system is called the Bow-2000IPU machine and is said to be capable of 1.4 petaflops of operations per second.

What makes it different is its three-dimensional stacked chip design, which enables it to package nearly 1,500 parallel processing cores in a single chip. All of these businesses are capable of running completely different businesses. This differs from widely used GPU architectures, which prefer to run the same operations on large blocks of data.

As another example, some companies are solving the problem of interconnection, that is, the wiring between connecting components in integrated circuits. As processors reach their theoretical maximum speeds, the path to move the bits becomes increasingly a bottleneck, especially when multiple processors access memory simultaneously. Today's chips are no longer the bottleneck of the interconnect.

The chip uses nanophotonic waveguides in an artificial intelligence platform that it says combines high speed and large bandwidth in a low-energy package. It is essentially an optical communications layer that can connect multiple other processors and accelerators.

The quality of AI results comes from the ability to simultaneously support very large and complex models while achieving very high throughput responses, both of which are achievable. This applies to anything that can be done using linear algebra, including most applications of artificial intelligence.

Expectations for its integrated hardware and software platform are extremely high. Enterprises have seized on this point, such as R&D platforms that can run artificial intelligence and other data-intensive applications anywhere from the data center to the edge.

The hardware platform uses custom 7nm chips designed for machine and deep learning. Its reconfigurable dataflow architecture runs an AI-optimized software stack, and its hardware architecture is designed to minimize memory accesses, thereby reducing interconnect bottlenecks.

The processor can be reconfigured to adapt to AI or high-performance computing HPC workloads. The processor is designed to handle large-scale matrix operations at a higher performance level, which is ideal for A plus for clients with changing workloads.

Although CPUs, GPUs and even FPGAs are well suited for deterministic software such as transactional systems and ERP, machine learning algorithms are probabilistic, meaning the results are not known in advance. This requires a completely different hardware infrastructure.

The platform minimizes interconnect issues by connecting 1TB of high-speed double data rate synchronous memory to the processor, essentially masking it with 20x faster on-chip memory The latency of the DDR controller, so this is transparent to the user, allows us to train higher parameter count language models and the highest resolution images without tiling or downsampling.

Tiling is a technique used for image analysis that reduces the need for computing power by splitting an image into smaller chunks, analyzing each chunk, and then recombining them. need. Downsampling trains a model on a random subset of the training data to save time and computing resources. The result is a system that is not only faster than GPU-based systems, but also capable of solving larger problems.

Summary

With many businesses seeking solutions to the same problems, a shakeout is inevitable, but no one Expect this shakeout to come soon. GPUs will be around for a long time and will probably remain the most cost-effective solution for AI training and inference projects that don’t require extreme performance.

Nevertheless, as models at the high end of the market become larger and more complex, there is an increasing need for functionally specific architectures. Three to five years from now, we will likely see a proliferation of GPUs and AI accelerators, which is the only way we can scale to meet demand at the end of this decade and beyond.

Leading chipmakers are expected to continue doing what they do well and gradually build on existing technologies. Many companies will also follow Intel's lead and acquire startups focused on artificial intelligence. The high-performance computing community is also focusing on the potential of artificial intelligence to help solve classic problems such as large-scale simulations and climate modeling.

The high-performance computing ecosystem is always looking for new technologies they can absorb to stay ahead of the curve, and they are exploring what artificial intelligence can bring to the table. Lurking behind the scenes is quantum computing, a technology that remains more theoretical than practical but has the potential to revolutionize computing.

Regardless of which new architecture gains traction, the surge in artificial intelligence has undoubtedly reignited interest in the potential for hardware innovation to open up new frontiers in software.

The above is the detailed content of How artificial intelligence can make hardware develop better. For more information, please follow other related articles on the PHP Chinese website!

架构封装数据结构栈堆线程并发对象算法人工智能 transformer FPGA 数据中心

Statement：

This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete

Previous article：How to fine-tune very large models with limited GPU resourcesNext article：How to fine-tune very large models with limited GPU resources

See more

How artificial intelligence can make hardware develop better

Accelerating the development of new hardware

Smaller, Faster, Cheaper

Acceleration equation

Current Development of Hardware

Summary

Related articles