Someone finally explained distributed machine learning clearly-AI-php.cn

Home

Technology peripherals

Someone finally explained distributed machine learning clearly

王林

Apr 13, 2023 pm 10:52 PM

machine learningalgorithmanalyze data

Distributed machine learning, also called distributed learning, refers to algorithms and systems that use multiple computing nodes (also called worker nodes, Worker) to perform machine learning or deep learning. It aims to improve performance, protect privacy, and can Scale to larger training data and larger models.

Federated learning can be regarded as a special type of distributed learning, which can further solve some of the difficulties encountered in distributed machine learning, thereby building artificial intelligence applications and products for privacy protection.

1. Development history of distributed machine learning

In recent years, the rapid development of new technologies has led to unprecedented growth in the amount of data. Machine learning algorithms are increasingly used to analyze data sets and build decision-making systems. And due to the complexity of the problem, such as controlling a self-driving car, recognizing speech, or predicting consumer behavior (see Khandani et al. 2010), algorithmic solutions are not feasible.

In some cases, long run times for model training on a single machine prompt solution designers to use distributed systems to increase the amount of parallelism and I/O bandwidth required for complex applications The training data can easily reach terabytes.

In other cases, when the data itself is distributed or the volume is too large to be stored on a single machine, a centralized solution is not even desirable. For example, large enterprises perform transaction processing on data stored in different locations, or the data volume is too large to be moved and centralized.

In order for these types of data sets to be accessible as training data for machine learning problems, algorithms must be selected and implemented that are capable of parallel computing, adapt to multiple data distributions, and possess fault recovery capabilities.

In recent years, machine learning technology has been widely used. Although various competing methods and algorithms have emerged, the data representations used are structurally very similar. Most calculations in machine learning work are about basic transformations of vectors, matrices, or tensors, which are common problems in linear algebra.

The need to optimize this operation has been a highly active research direction in the field of High Performance Computing (HPC) for decades. Therefore, some technologies and libraries from the HPC community (e.g., BLAS or MPI) have been successfully adopted by the machine learning community and integrated into the system.

At the same time, the HPC community has identified machine learning as an emerging high-value workload and has begun to apply HPC methods to machine learning.

Coates et al. trained a network with 1 billion parameters in just three days on their commercial high-performance computing (COTSHPC) system.

You et al. proposed in 2017 to optimize the training of neural networks on Intel's Knights Landing, a chip designed for high-performance computing applications.

Kurth et al. demonstrated in 2017 how deep learning problems (such as extracting weather patterns) can be optimized and scaled on large-scale parallel HPC systems.

Yan et al. proposed in 2016 that the challenge of scheduling deep neural network applications on cloud computing infrastructure can be solved by modeling workload requirements by borrowing HPC's lightweight analysis and other technologies.

Li et al. in 2017 studied the recovery characteristics of deep neural networks against hardware errors when running on accelerators (accelerators are often deployed in major high-performance computing systems).

As with other large-scale computing challenges, we have two fundamentally different and complementary ways to accelerate workloads: adding more resources to a machine (vertical scaling, such as GPU/ Continuous improvement of TPU computing core), adding more nodes to the system (horizontal expansion, low cost).

The lines between traditional supercomputers, grids and clouds are increasingly blurred, especially when it comes to optimal execution environments for demanding workloads such as machine learning. For example, GPUs and accelerators are more common in major cloud data centers. Therefore, parallelization of machine learning workloads is critical to achieve acceptable performance at scale. However, when transitioning from centralized solutions to distributed systems, distributed computing faces serious challenges in terms of performance, scalability, failure resilience, or security.

2. Overview of Distributed Machine Learning

Since each algorithm has unique communication patterns, designing a general system that can effectively distribute regular machine learning is a a challenge. Although there are currently a variety of different concepts and implementations of distributed machine learning, we will introduce a common architecture that covers the entire design space. Generally speaking, machine learning problems can be divided into training stages and prediction stages (see Figure 1-5).

Someone finally explained distributed machine learning clearly

▲Figure 1-5 Machine learning structure. During the training phase, the ML model is optimized using training data and adjusting hyperparameters. The trained model is then deployed into the system to provide predictions for new input data

The training phase includes training a machine learning model by inputting a large amount of training data and using commonly used ML algorithms, such as Evolutionary Algorithm (EA) and Rule-based Machine Learning algorithm. For example, decision trees and association rules), topic model (TM), matrix factorization (Matrix Factorization), and algorithms based on stochastic gradient descent (SGD), etc., to update the model.

In addition to choosing an appropriate algorithm for a given problem, we also need to perform hyperparameter tuning for the chosen algorithm. The final result of the training phase is to obtain a trained model. The prediction phase is to deploy the trained model in practice. A trained model receives new data (as input) and generates predictions (as output).

While the training phase of a model is typically computationally intensive and requires large datasets, inference can be performed with less computing power. The training phase and prediction phase are not mutually exclusive. Incremental learning combines the training phase and the prediction phase, using new data in the prediction phase to continuously train the model.

When it comes to distribution, we can divide the problem across all machines in two different ways, namely data or model parallelism (see Figure 1-6). Both methods can also be applied simultaneously.

Someone finally explained distributed machine learning clearly

#▲Figure 1-6 Parallelism in distributed machine learning. Data parallelism is training multiple instances of the same model on different subsets of the training data set, while model parallelism is distributing the parallel paths of a single model to multiple nodes

In Data Parallel (Data Parallel ) method, the data is partitioned as many times as there are worker nodes in the system, and then all worker nodes apply the same algorithm to different data sets. The same model is available to all worker nodes (either through centralization or replication), thus naturally producing a single consistent output. This method can be used for every ML algorithm that satisfies the assumption of independent and identical distribution on the data samples (i.e. most ML algorithms).

In the Model Parallel approach, an exact copy of the entire data set is processed by worker nodes, which operate different parts of the model. Therefore, a model is the aggregation of all model parts. Model parallelism methods cannot be automatically applied to every machine learning algorithm because model parameters usually cannot be partitioned.

One option is to train different instances of the same or similar models and aggregate the output of all trained models using methods like ensembles (such as Bagging, Boosting, etc.). The final architectural decision is the topology of the distributed machine learning system. Different nodes that make up a distributed system need to be connected through specific architectural patterns to achieve rich functionality. This is a common task. However, the choice of mode has implications for the roles that nodes can play, the degree of communication between nodes, and the failure resilience of the entire deployment.

Figure 1-7 shows 4 possible topologies, consistent with Baran’s general classification of distributed communication networks. Centralized architecture (Figure 1-7a) uses a strictly hierarchical approach to aggregation, which occurs at a single central location. The decentralized structure allows for intermediate aggregation, where the replication model is continuously updated when the aggregation is broadcast to all nodes (such as a tree topology) (Figure 1-7b), or using a partitioned model sharded across multiple parameter servers (Figure 1-7b). 1-7c). A fully distributed architecture (Figure 1-7d) consists of a network of independent nodes that integrate the solution, and each node is not assigned a specific role.

Someone finally explained distributed machine learning clearly

▲Figure 1-7 Distributed machine learning topology

3. The common development of distributed machine learning and federated learning

The development of distributed machine learning has also created some needs for privacy protection, which has resulted in some content overlap with federated learning. Common encryption methods, such as secure multi-party computation, homomorphic computation, differential privacy, etc., are also gradually used in distributed machine learning. In general, federated learning is an effective method to collaboratively train machine learning models using distributed resources.

Federated learning is a distributed machine learning approach in which multiple users collaborate to train a model while keeping the original data dispersed and not moved to a single server or data center. In federated learning, raw data or data generated by secure processing of raw data is used as training data. Federated learning only allows the transmission of intermediate data between distributed computing resources while avoiding the transmission of training data. Distributed computing resources refer to end-users’ mobile devices or multiple organizations’ servers.

Federated learning introduces code into data instead of data into code, technically solving the basic issues of privacy, ownership and data location. In this way, federated learning enables multiple users to collaboratively train a model while meeting legal data constraints.

This article is excerpted from "Federated Learning: Detailed Explanation of Algorithms and System Implementation" (ISBN: 978-7-111-70349-5), and is published with the permission of the publisher.

The above is the detailed content of Someone finally explained distributed machine learning clearly. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

Gemma Scope: Google's Microscope for Peering into AI's Thought ProcessApr 17, 2025 am 11:55 AM

Exploring the Inner Workings of Language Models with Gemma Scope Understanding the complexities of AI language models is a significant challenge. Google's release of Gemma Scope, a comprehensive toolkit, offers researchers a powerful way to delve in

Who Is a Business Intelligence Analyst and How To Become One?Apr 17, 2025 am 11:44 AM

Unlocking Business Success: A Guide to Becoming a Business Intelligence Analyst Imagine transforming raw data into actionable insights that drive organizational growth. This is the power of a Business Intelligence (BI) Analyst – a crucial role in gu

How to Add a Column in SQL? - Analytics VidhyaApr 17, 2025 am 11:43 AM

SQL's ALTER TABLE Statement: Dynamically Adding Columns to Your Database In data management, SQL's adaptability is crucial. Need to adjust your database structure on the fly? The ALTER TABLE statement is your solution. This guide details adding colu

Business Analyst vs. Data AnalystApr 17, 2025 am 11:38 AM

Introduction Imagine a bustling office where two professionals collaborate on a critical project. The business analyst focuses on the company's objectives, identifying areas for improvement, and ensuring strategic alignment with market trends. Simu

What are COUNT and COUNTA in Excel? - Analytics VidhyaApr 17, 2025 am 11:34 AM

Excel data counting and analysis: detailed explanation of COUNT and COUNTA functions Accurate data counting and analysis are critical in Excel, especially when working with large data sets. Excel provides a variety of functions to achieve this, with the COUNT and COUNTA functions being key tools for counting the number of cells under different conditions. Although both functions are used to count cells, their design targets are targeted at different data types. Let's dig into the specific details of COUNT and COUNTA functions, highlight their unique features and differences, and learn how to apply them in data analysis. Overview of key points Understand COUNT and COU

Chrome is Here With AI: Experiencing Something New Everyday!!Apr 17, 2025 am 11:29 AM

Google Chrome's AI Revolution: A Personalized and Efficient Browsing Experience Artificial Intelligence (AI) is rapidly transforming our daily lives, and Google Chrome is leading the charge in the web browsing arena. This article explores the exciti

AI's Human Side: Wellbeing And The Quadruple Bottom LineApr 17, 2025 am 11:28 AM

Reimagining Impact: The Quadruple Bottom Line For too long, the conversation has been dominated by a narrow view of AI’s impact, primarily focused on the bottom line of profit. However, a more holistic approach recognizes the interconnectedness of bu

5 Game-Changing Quantum Computing Use Cases You Should Know AboutApr 17, 2025 am 11:24 AM

Things are moving steadily towards that point. The investment pouring into quantum service providers and startups shows that industry understands its significance. And a growing number of real-world use cases are emerging to demonstrate its value out

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Chat Commands and How to Use Them

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

PhpStorm Mac version

The latest (2018.2.1) professional PHP integrated development tool

DVWA

Damn Vulnerable Web App (DVWA) is a PHP/MySQL web application that is very vulnerable. Its main goals are to be an aid for security professionals to test their skills and tools in a legal environment, to help web developers better understand the process of securing web applications, and to help teachers/students teach/learn in a classroom environment Web application security. The goal of DVWA is to practice some of the most common web vulnerabilities through a simple and straightforward interface, with varying degrees of difficulty. Please note that this software

Dreamweaver Mac version

Visual web development tools

Dreamweaver CS6

Visual web development tools

Hot Topics

Where is the login entrance for gmail email?

7543

CakePHP Tutorial

1381

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers