Home  >  Article  >  Technology peripherals  >  Like TensorFlow, will NVIDIA's CUDA monopoly be broken?

Like TensorFlow, will NVIDIA's CUDA monopoly be broken?

WBOY
WBOYforward
2023-04-13 13:04:021650browse

​In the past decade, the landscape of machine learning software development has undergone significant changes. Many frameworks have sprung up, but most rely heavily on NVIDIA's CUDA and get the best performance on NVIDIA's GPUs. However, with the arrival of PyTorch 2.0 and OpenAI Triton, Nvidia’s dominance in this field is being broken.

Google had great advantages in machine learning model architecture, training, and model optimization in the early days, but now it is difficult to fully utilize these advantages. On the hardware side, it will be difficult for other AI hardware companies to weaken Nvidia's dominance. Until PyTorch 2.0 and OpenAI Triton emerge, the default software stack for machine learning models will no longer be Nvidia’s closed-source CUDA.

Like TensorFlow, will NVIDIA's CUDA monopoly be broken?

TensorFlow vs. PyTorch

A similar competition occurs in machine learning frameworks. A few years ago, the framework ecosystem was quite fragmented, but TensorFlow was the front-runner. On the surface, Google seems to be firmly in the machine learning framework industry. They designed the AI ​​application-specific accelerator TPU with TensorFlow, thus gaining a first-mover advantage.

Like TensorFlow, will NVIDIAs CUDA monopoly be broken?

## However, it now appears that PyTorch has won and Google failed to capitalize on its first-mover advantage Translate into a dominant position in the emerging ML industry. Google seems to be somewhat isolated in the machine learning community these days, as it doesn't use PyTorch and GPUs, instead using its own software stack and hardware. In fact, Google has developed a second machine learning framework - JAX, which directly competes with TensorFlow. This is a typical "Google behavior".

Some people believe that due to the rise of large language models, especially OpenAI’s large language models and various language models built using the OpenAI API, Google’s progress in search and natural language processing Dominance is waning. Perhaps this view is too pessimistic, after all, the infrastructure of most current models is still the transformer developed by Google.

So, why is PyTorch a big winner? The main reason is that PyTorch has higher flexibility and usability compared to TensorFlow. The main difference between PyTorch and TensorFlow is the use of Eager mode instead of Graph mode.

Eager mode can be said to be a standard script execution method, no different from ordinary Python code. This makes debugging and understanding the code easier because users can see the results of intermediate operations and how the model is running.

In contrast, the Graph pattern is divided into two phases. The first stage represents the computational graph on which operations are to be performed, where the nodes represent operations or variables, and the edges between nodes represent the data flow between them. The second stage is a delayed execution of an optimized version of the computational graph.

This two-stage approach makes understanding and debugging the code more challenging because the user cannot see what is happening until the graph execution ends. This is similar to "interpreted" vs. "compiled" languages, such as Python vs. C. Debugging Python is easier because it is an interpreted language.

While TensorFlow now also uses Eager mode by default, the research community and most large tech companies choose to use PyTorch.

Machine learning training component

If machine learning model training is simplified to its simplest form, the main factors that affect machine learning model training are: Two points:

    Computation (FLOPS): run dense matrix multiplication within each layer;
  • memory bandwidth.

Previously, the main factor affecting machine learning training time was the calculation time, waiting for the system to perform matrix multiplication. As Nvidia GPUs continue to evolve, this will soon no longer be a major issue.

NVIDIA leveraged Moore's Law to improve FLOPS by orders of magnitude, but the main architectural changes were tensor cores and lower-precision floating-point formats. In comparison, not much has changed on the storage front.

Like TensorFlow, will NVIDIAs CUDA monopoly be broken?

##In 2018, the most advanced model was BERT, and NVIDIA V100 was the most advanced GPU. At that time, matrix multiplication had already is no longer the main factor in improving model performance. Afterwards, models grew by 3 to 4 orders of magnitude in number of parameters, while the fastest GPUs grew by 1 order of magnitude in FLOPS.

Even in 2018, pure compute-bound workloads accounted for 99.8% of FLOPS but only 61% of runtime. Compared to matrix multiplication, normalization and pointwise ops use only 1/250 and 1/700 of the FLOPS of matrix multiplication, but they consume nearly 40% of the model run time.

Like TensorFlow, will NVIDIAs CUDA monopoly be broken?

##Memory wall

As the model scale continues Soaring, large language models (LLMs) require over 100 GB of memory just for model weights. The product recommendation networks deployed by Baidu and Meta require tens of terabytes of memory to store their massive embedding tables. Most of the time in large model training/inference is not spent computing matrix multiplications, but waiting for data to be transferred. Obviously, the question is why architects don't put more memory closer to the compute, and the answer is obvious - cost.

Like TensorFlow, will NVIDIAs CUDA monopoly be broken?

The nearest shared memory pool is usually SRAM on the same chip. Some machine learning ASICs try to leverage huge SRAM pools to hold model weights. But even Cerebras' roughly $5,000,000 wafer-scale chip only has 40GB of SRAM. The memory capacity is insufficient to accommodate the weights of a 100B parameter model.

Nvidia designed its chips with much less on-chip memory—40MB in the A100 and 50MB in the H100. A 1GB SRAM on a TSMC 5nm chip requires about 200 square millimeters of silicon, and implementing the associated control logic/structure would require over 400 square millimeters of silicon. Given that the A100 GPU costs over $10,000 and the H100 is closer to $20,000, this approach is not feasible from a financial perspective. Even ignoring Nvidia's roughly 75% profit margin on data center GPUs, SRAM memory still costs around $100/GB for a fully production product.

In addition, the cost of on-chip SRAM memory will not decrease much as traditional Moore's Law process technology shrinks. The same 1GB memory uses TSMC's next-generation 3nm process technology, but the cost is higher. While 3D SRAM will help reduce SRAM costs to some extent, this is only temporary.

The next step in the memory hierarchy is tightly coupled off-chip memory DRAM. DRAM has an order of magnitude higher latency than SRAM (~100ns vs 10ns), but it's also much cheaper. DRAM has been following Moore's Law for decades. When Gordon Moore coined the term, Intel's main business was DRAM. His predictions for transistor density and cost were generally true for DRAM prior to 2009. But DRAM costs have barely improved since 2012.

Like TensorFlow, will NVIDIAs CUDA monopoly be broken?

However, people's demand for memory has only increased. DRAM now accounts for 50% of the total cost of servers, gradually forming the so-called "memory wall." Comparing NVIDIA's 2016 P100 GPU to the latest H100 GPU, we see that the memory capacity increased to 5 times (16GB → 80GB) and the FP16 performance increased to 46 times (21.2 TFLOPS → 989.5 TFLOPS).

Although memory capacity is an important bottleneck, another bottleneck - memory bandwidth is also very critical. Increases in memory bandwidth are often obtained through parallelism. While standard DRAM costs just a few dollars/GB today, to get the massive bandwidth needed for machine learning, Nvidia uses HBM memory — a device composed of 3D stacked DRAM layers that requires a more expensive package. HBM costs approximately $10-20/GB, including packaging and volume costs.

The problem of cost constraints on memory bandwidth and capacity is particularly evident in Nvidia's A100 GPU. Without extensive optimization, the A100 can only have very low FLOPS utilization.

Even if researchers have done a lot of optimization, the FLOPS utilization rate of large language models can only reach about 60%. A large portion of the time is spent waiting for data from another compute/memory, or recomputing results in a timely manner to reduce memory bottlenecks.

From A100 to H100, FLOPS increases to more than 6 times, but the memory bandwidth only increases to 1.65 times. This has led many to worry that utilization of the H100 will be low. The A100 required a lot of tricks to get around the memory wall, and the H100 requires even more tricks to achieve.

H100 brings distributed shared memory and L2 multicast to the Hopper architecture. The idea is to allow data in one SM to be written directly into the SRAM (shared memory/L1 Cache) of another SM. This effectively increases the cache size and reduces the bandwidth required for DRAM reads/writes. Future architectures will reduce the number of operations sent to memory to minimize the impact of memory walls. It is worth noting that larger models tend to achieve higher utilization, since FLOPS needs to scale as the cube of the number of parameters, while memory bandwidth and capacity requirements tend to scale as the quadratic.

Operator fusion

Increasing the GPU's FLOPS will not help if all the time is spent on memory transfers (i.e., being memory bandwidth limited). On the other hand, if all your time is spent executing large matmuls, then even rewriting the model logic into C to reduce overhead will not help.

The reason why PyTorch can outperform TensorFlow is because Eager mode improves flexibility and usability, but moving to Eager mode is not the only benefit. When running in eager mode, each operation is read from memory, calculated, and then sent to memory before processing the next operation. Without extensive optimization, this can significantly increase memory bandwidth requirements.

So for models executed in Eager mode, one of the main optimization methods is operator fusion. Fusion operations compute multiple functions in a single pass to minimize memory reads/writes, rather than writing each intermediate result to memory. Operator fusion improves operator scheduling, memory bandwidth, and memory size costs.

Like TensorFlow, will NVIDIAs CUDA monopoly be broken?

This kind of optimization usually involves writing a custom CUDA kernel, but this is better than using a simple Python scripts are much harder. Over time, more and more operators have been steadily implemented in PyTorch, many of which simply combine multiple common operations into a more complex function.

The addition of operators makes it easier to create models in PyTorch, and Eager mode performs faster due to fewer memory reads/writes. The downside is that PyTorch has exploded to over 2000 operators within a few years.

Like TensorFlow, will NVIDIAs CUDA monopoly be broken?

We can say that software developers are too lazy, but to be honest, who has not been lazy. Once they get used to a new operator in PyTorch, they keep using it. The developer may not even realize that performance is improving but continue to use the operator because it eliminates the need to write more code.

In addition, not all operators can be fused. Deciding which operations to combine and which to allocate to specific computing resources at the chip and cluster levels takes a lot of time. Although the strategies for where operators are fused are generally similar, they can vary greatly due to different architectures.

NVIDIA WAS THE KING

The growth and default position of operators is an advantage for NVIDIA because each operator targets Its architecture is optimized for speed, but is not optimized for any other hardware. If an AI hardware startup wanted to fully implement PyTorch, that would mean supporting a growing list of 2,000 operators with high performance.

Because extracting maximum performance requires so much skill, training large models with high FLOPS utilization on GPUs requires an increasingly high level of talent. Eager mode execution of additive operator fusion means that the software, techniques and models developed are constantly being pushed to accommodate the compute and memory ratios that current generation GPUs have.

Everyone developing a machine learning chip is constrained by the same memory wall. ASICs are limited by supporting the most commonly used frameworks, by default development methods, GPU-optimized PyTorch code, and a mix of NVIDIA and external libraries. In this case, it makes little sense to have an architecture that eschews the various non-computational baggage of the GPU in favor of more FLOPS and a stricter programming model.

However, ease of use comes first. The only way to break the vicious cycle is to make the software that runs models on Nvidia’s GPUs as easy and seamlessly transferable to other hardware as possible. As model architectures stabilize and abstractions from PyTorch 2.0, OpenAI Triton, and MLOps companies like MosaicML become the default, the architecture and economics of chip solutions begin to be the biggest drivers of purchase, rather than the ease of use provided by Nvidia's advanced software sex.

PyTorch 2.0

A few months ago, the PyTorch Foundation was established and separated from Meta. In addition to changes to the open development and governance model, 2.0 was released in early beta and became generally available in March. PyTorch 2.0 brings many changes, but the main difference is that it adds a compilation solution that supports a graphical execution model. This shift will make it easier to properly utilize various hardware resources.

PyTorch 2.0 improves training performance by 86% on NVIDIA A100 and inference performance on CPU by 26%. This significantly reduces the computational time and cost required to train the model. These benefits extend to other GPUs and accelerators from AMD, Intel, Tenstorrent, Luminous Computing, Tesla, Google, Amazon, Microsoft, Marvell, Meta, Graphcore, Cerebras, SambaNova, and more.

For currently unoptimized hardware, PyTorch 2.0 has greater room for performance improvement. Meta and other companies are making such huge contributions to PyTorch because they want to achieve higher FLOPS utilization with less effort on their multi-billion dollar GPU training clusters. This way they also have an incentive to make their software stacks more portable to other hardware, introducing competition into the machine learning space.

With the help of better APIs, PyTorch 2.0 can also support data parallelism, sharding, pipeline parallelism and tensor parallelism, bringing progress to distributed training. Additionally, it supports dynamic shapes natively across the stack, which among many other examples makes it easier to support different sequence lengths for LLMs. The picture below is the first time that a major compiler supports Dynamic Shapes from training to inference:

Like TensorFlow, will NVIDIAs CUDA monopoly be broken?

PrimTorch

Writing a high-performance backend for PyTorch that fully supports all 2000+ operators is no easy task for every machine learning ASIC except NVIDIA GPUs. PrimTorch reduces the number of operators to approximately 250 original operators while maintaining the same usability for PyTorch end users. PrimTorch makes implementation of different non-NVIDIA backends of PyTorch simpler and more accessible. Custom hardware and systems vendors can more easily roll out their software stacks.

TorchDynamo

Turning to graph patterns requires a reliable graph definition. Meta and PyTorch have been trying to make this shift for about 5 years, but every solution they came up with had significant shortcomings. Finally, they solved the problem using TorchDynamo. TorchDynamo will ingest any PyTorch user script, including scripts that call external third-party libraries, and generate FX graphs.

Dynamo reduces all complex operators to approximately 250 primitive operators in PrimTorch. Once the graph is formed, unused operators are discarded and the graph determines which intermediate operators need to be stored or written to memory, and which ones may be fused. This greatly reduces overhead within the model while being "seamless" to the user.

Of the 7000 PyTorch models tested, TorchDynamo has been applied to more than 99% of the models, including models from OpenAI, HuggingFace, Meta, NVIDIA, Stability.AI, etc., without the need for Make any changes to the original code. The 7000 models tested were randomly selected from the most popular projects using PyTorch on GitHub.

Like TensorFlow, will NVIDIAs CUDA monopoly be broken?

Google's TensorFlow/Jax and other graph mode execution pipelines often require users to ensure that their models fit the compiler architecture, So that the picture can be captured. Dynamo changes this by enabling partial graph capture, protected graph capture, and instant recapture.

Partial graph capture allows models to contain unsupported/non-python constructs. When a graph cannot be generated for a model part, a graph break will be inserted and unsupported construction will be performed in eager mode between part graphs.

Protected graph capture checks whether the captured graph is valid for execution. "Protection" means a change that requires recompilation. This is important because running the same code multiple times will not recompile multiple times. On-the-fly recapture allows the graph to be re-captured if the captured graph is not valid for execution.

Like TensorFlow, will NVIDIAs CUDA monopoly be broken?

The goal of PyTorch is to create a unified front-end with a smooth UX that leverages Dynamo Generate graph. The user experience of the solution does not change, but performance can be significantly improved. Capture graphs can be executed more efficiently in parallel on large amounts of computing resources.

Dynamo and AOT Autograd then pass the optimized FX graph to the PyTorch native compiler level TorchInductor. Hardware companies can also feed this graph into their own backend compilers.

TorchInductor

TorchInductor is a Python native deep learning compiler that can generate fast code for multiple accelerators and backends. Inductor will take FX graphs with about 250 operators and reduce them to about 50 operators. Next, the Inductor enters the scheduling phase, where operators are fused and memory planning is determined.

The Inductor then enters "Wrapper Codegen," which generates code that runs on a CPU, GPU, or other AI accelerator. The wrapper Codegen replaces the interpreter part of the compiler stack and can call the kernel and allocate memory. The backend code generation part leverages OpenAI Triton for GPUs and outputs PTX code. For CPUs, the Intel compiler generates C (also works on non-Intel CPUs).

They will support more hardware in the future, but the point is that Inductor greatly reduces the amount of work that compiler teams have to do when making compilers for their AI hardware accelerators. In addition, the code is more optimized for performance, and memory bandwidth and capacity requirements are significantly reduced.

What researchers need is not just a compiler that only supports GPUs, but also a compiler that supports various hardware backends.

OpenAI Triton

OpenAI Triton is a disruptive presence for Nvidia’s closed-source machine learning software. Triton takes data directly from Python or through the PyTorch Inductor stack, the latter being the most common usage. Triton is responsible for converting the input into an LLVM intermediate representation and generating code. NVIDIA GPUs will generate PTX code directly, skipping NVIDIA's closed-source CUDA libraries (such as cuBLAS) and instead using open-source libraries (such as cutlass).

CUDA is popular in the world of accelerated computing, but little known among machine learning researchers and data scientists. Using CUDA can present challenges and require a deep understanding of the hardware architecture, which can slow down the development process. As a result, machine learning experts may rely on CUDA experts to modify, optimize, and parallelize their code.

Triton makes up for this shortcoming, allowing high-level languages ​​to achieve comparable performance to low-level languages. The Triton kernel itself is very clear to the typical ML researcher, which is very important for usability. Triton automates memory coalescing, shared memory management, and scheduling in SM. Triton is not particularly useful for element-wise matrix multiplication, but matrix multiplication can already be done very efficiently. Triton is useful for expensive point-by-point operations and reducing the overhead of complex operations.

OpenAI Triton currently only officially supports NVIDIA GPUs, but this will change in the near future to support multiple other hardware vendors. Other hardware accelerators can be integrated directly into Triton’s LLVM IR, which greatly reduces the time to build an AI compiler stack for new hardware.

Nvidia’s huge software system lacks foresight and cannot take advantage of its huge advantages in ML hardware and software, and it has failed to become the default compiler for machine learning. They lack the focus on usability that allows OpenAI and Meta to create software stacks that are portable to other hardware.

Original link: https://www.semianalysis.com/p/nvidiaopenaitritonpytorch​

The above is the detailed content of Like TensorFlow, will NVIDIA's CUDA monopoly be broken?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete