Home >Technology peripherals >AI >Someone finally explained distributed machine learning clearly
Distributed machine learning, also called distributed learning, refers to algorithms and systems that use multiple computing nodes (also called worker nodes, Worker) to perform machine learning or deep learning. It aims to improve performance, protect privacy, and can Scale to larger training data and larger models.
Federated learning can be regarded as a special type of distributed learning, which can further solve some of the difficulties encountered in distributed machine learning, thereby building artificial intelligence applications and products for privacy protection.
In recent years, the rapid development of new technologies has led to unprecedented growth in the amount of data. Machine learning algorithms are increasingly used to analyze data sets and build decision-making systems. And due to the complexity of the problem, such as controlling a self-driving car, recognizing speech, or predicting consumer behavior (see Khandani et al. 2010), algorithmic solutions are not feasible.
In some cases, long run times for model training on a single machine prompt solution designers to use distributed systems to increase the amount of parallelism and I/O bandwidth required for complex applications The training data can easily reach terabytes.
In other cases, when the data itself is distributed or the volume is too large to be stored on a single machine, a centralized solution is not even desirable. For example, large enterprises perform transaction processing on data stored in different locations, or the data volume is too large to be moved and centralized.
In order for these types of data sets to be accessible as training data for machine learning problems, algorithms must be selected and implemented that are capable of parallel computing, adapt to multiple data distributions, and possess fault recovery capabilities.
In recent years, machine learning technology has been widely used. Although various competing methods and algorithms have emerged, the data representations used are structurally very similar. Most calculations in machine learning work are about basic transformations of vectors, matrices, or tensors, which are common problems in linear algebra.
The need to optimize this operation has been a highly active research direction in the field of High Performance Computing (HPC) for decades. Therefore, some technologies and libraries from the HPC community (e.g., BLAS or MPI) have been successfully adopted by the machine learning community and integrated into the system.
At the same time, the HPC community has identified machine learning as an emerging high-value workload and has begun to apply HPC methods to machine learning.
Coates et al. trained a network with 1 billion parameters in just three days on their commercial high-performance computing (COTSHPC) system.
You et al. proposed in 2017 to optimize the training of neural networks on Intel's Knights Landing, a chip designed for high-performance computing applications.
Kurth et al. demonstrated in 2017 how deep learning problems (such as extracting weather patterns) can be optimized and scaled on large-scale parallel HPC systems.
Yan et al. proposed in 2016 that the challenge of scheduling deep neural network applications on cloud computing infrastructure can be solved by modeling workload requirements by borrowing HPC's lightweight analysis and other technologies.
Li et al. in 2017 studied the recovery characteristics of deep neural networks against hardware errors when running on accelerators (accelerators are often deployed in major high-performance computing systems).
As with other large-scale computing challenges, we have two fundamentally different and complementary ways to accelerate workloads: adding more resources to a machine (vertical scaling, such as GPU/ Continuous improvement of TPU computing core), adding more nodes to the system (horizontal expansion, low cost).
The lines between traditional supercomputers, grids and clouds are increasingly blurred, especially when it comes to optimal execution environments for demanding workloads such as machine learning. For example, GPUs and accelerators are more common in major cloud data centers. Therefore, parallelization of machine learning workloads is critical to achieve acceptable performance at scale. However, when transitioning from centralized solutions to distributed systems, distributed computing faces serious challenges in terms of performance, scalability, failure resilience, or security.
Since each algorithm has unique communication patterns, designing a general system that can effectively distribute regular machine learning is a a challenge. Although there are currently a variety of different concepts and implementations of distributed machine learning, we will introduce a common architecture that covers the entire design space. Generally speaking, machine learning problems can be divided into training stages and prediction stages (see Figure 1-5).
▲Figure 1-5 Machine learning structure. During the training phase, the ML model is optimized using training data and adjusting hyperparameters. The trained model is then deployed into the system to provide predictions for new input data
The training phase includes training a machine learning model by inputting a large amount of training data and using commonly used ML algorithms, such as Evolutionary Algorithm (EA) and Rule-based Machine Learning algorithm. For example, decision trees and association rules), topic model (TM), matrix factorization (Matrix Factorization), and algorithms based on stochastic gradient descent (SGD), etc., to update the model.
In addition to choosing an appropriate algorithm for a given problem, we also need to perform hyperparameter tuning for the chosen algorithm. The final result of the training phase is to obtain a trained model. The prediction phase is to deploy the trained model in practice. A trained model receives new data (as input) and generates predictions (as output).
While the training phase of a model is typically computationally intensive and requires large datasets, inference can be performed with less computing power. The training phase and prediction phase are not mutually exclusive. Incremental learning combines the training phase and the prediction phase, using new data in the prediction phase to continuously train the model.
When it comes to distribution, we can divide the problem across all machines in two different ways, namely data or model parallelism (see Figure 1-6). Both methods can also be applied simultaneously.
#▲Figure 1-6 Parallelism in distributed machine learning. Data parallelism is training multiple instances of the same model on different subsets of the training data set, while model parallelism is distributing the parallel paths of a single model to multiple nodes
In Data Parallel (Data Parallel ) method, the data is partitioned as many times as there are worker nodes in the system, and then all worker nodes apply the same algorithm to different data sets. The same model is available to all worker nodes (either through centralization or replication), thus naturally producing a single consistent output. This method can be used for every ML algorithm that satisfies the assumption of independent and identical distribution on the data samples (i.e. most ML algorithms).
In the Model Parallel approach, an exact copy of the entire data set is processed by worker nodes, which operate different parts of the model. Therefore, a model is the aggregation of all model parts. Model parallelism methods cannot be automatically applied to every machine learning algorithm because model parameters usually cannot be partitioned.
One option is to train different instances of the same or similar models and aggregate the output of all trained models using methods like ensembles (such as Bagging, Boosting, etc.). The final architectural decision is the topology of the distributed machine learning system. Different nodes that make up a distributed system need to be connected through specific architectural patterns to achieve rich functionality. This is a common task. However, the choice of mode has implications for the roles that nodes can play, the degree of communication between nodes, and the failure resilience of the entire deployment.
Figure 1-7 shows 4 possible topologies, consistent with Baran’s general classification of distributed communication networks. Centralized architecture (Figure 1-7a) uses a strictly hierarchical approach to aggregation, which occurs at a single central location. The decentralized structure allows for intermediate aggregation, where the replication model is continuously updated when the aggregation is broadcast to all nodes (such as a tree topology) (Figure 1-7b), or using a partitioned model sharded across multiple parameter servers (Figure 1-7b). 1-7c). A fully distributed architecture (Figure 1-7d) consists of a network of independent nodes that integrate the solution, and each node is not assigned a specific role.
▲Figure 1-7 Distributed machine learning topology
The development of distributed machine learning has also created some needs for privacy protection, which has resulted in some content overlap with federated learning. Common encryption methods, such as secure multi-party computation, homomorphic computation, differential privacy, etc., are also gradually used in distributed machine learning. In general, federated learning is an effective method to collaboratively train machine learning models using distributed resources.
Federated learning is a distributed machine learning approach in which multiple users collaborate to train a model while keeping the original data dispersed and not moved to a single server or data center. In federated learning, raw data or data generated by secure processing of raw data is used as training data. Federated learning only allows the transmission of intermediate data between distributed computing resources while avoiding the transmission of training data. Distributed computing resources refer to end-users’ mobile devices or multiple organizations’ servers.
Federated learning introduces code into data instead of data into code, technically solving the basic issues of privacy, ownership and data location. In this way, federated learning enables multiple users to collaboratively train a model while meeting legal data constraints.
This article is excerpted from "Federated Learning: Detailed Explanation of Algorithms and System Implementation" (ISBN: 978-7-111-70349-5), and is published with the permission of the publisher.
The above is the detailed content of Someone finally explained distributed machine learning clearly. For more information, please follow other related articles on the PHP Chinese website!