Home >Technology peripherals >AI >Create efficient deep learning data pipelines with Ray

Create efficient deep learning data pipelines with Ray

WBOY
WBOYforward
2023-11-02 20:17:15798browse

The GPU required for deep learning model training is powerful but expensive. To fully utilize the GPU, developers need an efficient data transfer channel that can quickly transfer data to the GPU when it is ready to compute the next training step. Using Ray can significantly improve the efficiency of the data transmission channel

1. The structure of the training data pipeline

First, let’s take a look at the pseudocode of model training

for step in range(num_steps):sample, target = next(dataset) # 步骤1train_step(sample, target) # 步骤2

In the steps 1, get the samples and labels of the next mini-batch. In step 2, they are passed to the train_step function, which copies them to the GPU, performs a forward and backward pass to calculate the loss and gradient, and updates the optimizer's weights.

Please learn more about step 1. When the data set is too large to fit in memory, step 1 will fetch the next mini-batch from disk or network. In addition, step 1 also includes a certain amount of preprocessing. Input data must be converted into numeric tensors or collections of tensors before being fed to the model. In some cases, other transformations are also performed on the tensors before being passed to the model, such as normalization, rotation around the axis, random shuffling, etc.

If the workflow is executed strictly in sequence, that is If you perform step 1 first and then step 2, the model will always need to wait for the next batch of data input, output, and preprocessing operations. The GPU will not be efficiently utilized, it will sit idle while loading the next mini-batch of data.

To solve this problem, the data pipeline can be viewed as a producer-consumer problem. The data pipeline generates small batches of data and writes them to bounded buffers. The model/GPU consumes small batches of data from the buffer, performs forward/reverse calculations and updates model weights. If the data pipeline can generate small batches of data as quickly as the model/GPU consumes, the training process will be very efficient.

Create efficient deep learning data pipelines with RayPicture

2. Tensorflow tf.data API

Tensorflow tf.data API provides a rich set of functions that can be used Efficiently create data pipelines and use background threads to obtain small batches of data so that the model does not need to wait. Just pre-fetching the data is not enough. If generating small batches of data is slower than the GPU can consume the data, then you need to use parallelization to speed up the reading and transformation of the data. To this end, Tensorflow provides interleave functionality to leverage multiple threads to read data in parallel, and parallel mapping functionality to use multiple threads to transform small batches of data.

Because these APIs are based on multi-threading, they may be restricted by the Python Global Interpreter Lock (GIL). Python's GIL limits bytecode to only a single thread running at a time. If you use pure TensorFlow code in your pipeline, you generally do not suffer from this limitation because the TensorFlow core execution engine works outside the scope of the GIL. However, if the third-party library used does not lift GIL restrictions or uses Python to perform a large number of calculations, then relying on multi-threading to parallelize the pipeline is not feasible

3. Use multi-process parallelization of the data pipeline

Consider the following generator function that simulates loading and performing some calculations to generate mini-batches of data samples and labels.

def data_generator():for _ in range(10):# 模拟获取# 从磁盘/网络time.sleep(0.5)# 模拟计算for _ in range(10000):passyield (np.random.random((4, 1000000, 3)).astype(np.float32), np.random.random((4, 1)).astype(np.float32))

Next, use the generator in a dummy training pipeline and measure the average time it takes to generate mini-batches of data.

generator_dataset = tf.data.Dataset.from_generator(data_generator,output_types=(tf.float64, tf.float64),output_shapes=((4, 1000000, 3), (4, 1))).prefetch(tf.data.experimental.AUTOTUNE)st = time.perf_counter()times = []for _ in generator_dataset:en = time.perf_counter()times.append(en - st)# 模拟训练步骤time.sleep(0.1)st = time.perf_counter()print(np.mean(times))

It was observed that the average time taken was about 0.57 seconds (measured on a Mac laptop equipped with an Intel Core i7 processor). If this were a real training loop, the GPU utilization would be quite low, it would only spend 0.1 seconds doing the computation and then idle for 0.57 seconds waiting for the next batch of data.

To speed up data loading, you can use a multi-process generator.

from multiprocessing import Queue, cpu_count, Processdef mp_data_generator():def producer(q):for _ in range(10):# 模拟获取# 从磁盘/网络time.sleep(0.5)# 模拟计算for _ in range(10000000):passq.put((np.random.random((4, 1000000, 3)).astype(np.float32),np.random.random((4, 1)).astype(np.float32)))q.put("DONE")queue = Queue(cpu_count()*2)num_parallel_processes = cpu_count()producers = []for _ in range(num_parallel_processes):p = Process(target=producer, args=(queue,))p.start()producers.append(p)done_counts = 0while done_counts <p>Now, if we measure the time spent waiting for the next mini-batch of data, we get an average time of 0.08 seconds. Almost 7 times faster, but ideally would like this time to be close to 0. </p><p>If you analyze it, you can find that a lot of time is spent on preparing the deserialization of data. In a multi-process generator, the producer process returns large NumPy arrays, which need to be prepared and then deserialized in the main process. So how to improve efficiency when passing large arrays between processes? </p><h2>4. Use Ray to parallelize the data pipeline</h2><p>This is where Ray comes into play. Ray is a framework for running distributed computing in Python. It comes with a shared memory object store to efficiently transfer objects between different processes. In particular, Numpy arrays in the object store can be shared between workers on the same node without any serialization and deserialization. Ray also makes it easy to scale data loading across multiple machines and use Apache Arrow to efficiently serialize and deserialize large arrays. </p><p>Ray comes with a utility function from_iterators that can create parallel iterators, and developers can use it to wrap the data_generator generator function. </p><pre class="brush:php;toolbar:false">import raydef ray_generator():num_parallel_processes = cpu_count()return ray.util.iter.from_iterators([data_generator]*num_parallel_processes).gather_async()

Using ray_generator, the measured time spent waiting for the next mini-batch of data is 0.02 seconds, which is 4 times faster than using multi-process processing.

The above is the detailed content of Create efficient deep learning data pipelines with Ray. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete