Big data AI integrated interpretation-AI-php.cn

Home

Technology peripherals

Big data AI integrated interpretation

王林

Mar 25, 2024 pm 12:46 PM

Big Dataiphonedata analysisAI integration

1. AI’s “iPhone” moment

In the past year, the development of large models has been very rapid, and the stacking of computing power and data has made The model has some general structures and the ability to answer questions, leading people to the stage of artificial intelligence they have always dreamed of. For example, when chatting with a large language model, you will feel that you are not facing a blunt robot, but a flesh-and-blood person. It opens up more space for our imagination. The original human-computer interaction required using the keyboard and mouse to tell the machine our instructions through some formatting methods. Now, people can interact with computers through language, and machines can understand what we mean and respond.

In order to keep up with the trend, many technology companies have begun to focus on the research of large models. 2023 is considered the first year of artificial intelligence, just like the launch of the iPhone opened a new era of mobile Internet. The real breakthrough this time lies in the application of large-scale computing power and massive data.

大数据 AI 一体化解读

From the perspective of model structure, the Transformer structure has actually been launched for a long time. In fact, the GPT model was published a year earlier than the Bert model. However, due to the limitations of computing power at the time, GPT was far less effective than Bert. Therefore, Bert became popular first and was used for translation, with very good results. But the focus this year has become GPT. The reason behind it is because of the very high computing power. Because of the efforts of hardware manufacturers and some progress in packaging and storage particles, we have the ability to use very high computing power. Stacked together, they promote in-depth understanding of more data and bring breakthrough results in AI. Based on the strong support of the underlying platform, algorithm students can develop and iterate models more conveniently and efficiently, promoting rapid model evolution.

2. Model development paradigm

General model The development cycle is shown in the figure below:

大数据 AI 一体化解读

Many people think that model training is the most critical step. But in fact, before model training, there is a large amount of data that needs to be collected, cleaned, and managed. In this process, you can see that there are many steps that need to be verified, such as whether there is dirty data and whether the statistical distribution of the data is representative. After the model comes out, it needs to be tested and verified. This is also the verification of the data. The data is used to give feedback on the effectiveness of the model.

大数据 AI 一体化解读

Better machine learning is 80% data plus 20% model, and the focus should be on the data.

大数据 AI 一体化解读

This also reflects the evolutionary trend of model development. The original model development was model-centered, but now it has become data-centered. .

In the early days of deep learning, supervised learning was the main focus, and the most important thing was to have labeled data. The labeled data is divided into two categories, one is training data and the other is verification data. Use the training data to train the model, and then verify whether the model can give good results on the test data. The cost of labeling data is very high because people are required to label it. If you want to improve the effect of the model, you need to spend a lot of time and manpower on the model structure, improve the generalization ability of the model through structural changes, and reduce the overfit of the model. This is the model-centered development paradigm.

With the accumulation of data and computing power, unsupervised learning has gradually begun to be used. Through massive data, the model can autonomously discover the relationships existing in the data. At this time, Entering a data-centric development paradigm.

In the data-centered development model, the model structures are similar, basically a stack of Transformers, so more attention is paid to how to utilize data. In the process of using data, there will be a lot of data cleaning and comparison, which will take a lot of time because it requires a large amount of data. How to precisely control the data determines the speed of model convergence and iteration.

##3. Big Data AI Integration

1 . Big Data AI Panorama

大数据 AI 一体化解读

Alibaba Cloud has always emphasized the integration of AI and big data. Therefore, we built a platform with very good infrastructure, including high-bandwidth GPU clusters to provide high-performance AI computing power, and CPU clusters to provide cost-effective storage and data management capabilities. On top of this, we have built a big data and AI integrated PaaS platform, which includes a big data platform, an AI platform, a high-computing power platform, a cloud-native platform, etc. The engine part includes streaming computing, big data offline computing MaxCompute and PAI.

In the service layer, there are the large model application platform Bailian and the open source model community ModelScope. Alibaba has been actively promoting the sharing of model communities, hoping to use the concept of Model as a service to inspire more users with AI needs to use the basic capabilities of these models to quickly build AI applications.

2. Why it is necessary to combine big data and AI

The following two cases are used to explain why the linkage of big data and AI is needed.

Case 1: Large model question answering system with enhanced knowledge base retrieval

大数据 AI 一体化解读

In large model question answering system , first use the basic model, then embed the target document, and store the embedding result in the vector database. The number of documents can be very large, so embedding requires batch processing capabilities. The inference service of the basic model itself is also very resource-intensive. Of course, this also depends on how big the basic model is and how to parallelize it. All the generated embeddings are poured into the vector database. When querying, the query must also be vectorized, and then through vector retrieval, the knowledge that may be related to the question and answer is extracted from the vector database. This requires very good performance of the inference service.

After extracting the vector, you need to use the document represented by the vector as context, then constrain this large model, and make questions and answers on this basis, so that the effect of the answer will be far greater It is better than the results obtained by searching by yourself, and it is answered in a natural language.

In the above process, both an offline distributed big data platform is needed to quickly generate embeddings, and an AI platform for large model training and services is needed to connect the entire process. , in order to form a large model question answering system.

Case 2: Intelligent recommendation system

大数据 AI 一体化解读

Another example is personalized recommendation. This model often High timeliness is required because everyone's interests and personality will change. To capture these changes, a streaming computing system needs to be used to analyze the data obtained in the APP, and then continuously use the extracted features to Model online learning, whenever new data comes in, the model will be updated, and then serve customers through the new model. Therefore, in this scenario, streaming computing capabilities are required, as well as model serving and training capabilities.

3. How to combine big data with AI

Through the above cases, we can see that the combination of AI and big data has become an inevitable development trend. Based on this concept, we first need to have a workspace that can manage the big data platform and the AI platform together. This is why the AI workspace was born.

大数据 AI 一体化解读

In this AI workspace, it supports Flink clusters, offline computing cluster MaxCompute, AI platforms, container service computing platforms, etc.

大数据 AI 一体化解读

Unifying big data and AI is only the first step. What is more important is to connect them in a workflow. Workflows can be established in many ways, such as SDK, graphical, GUI, SPEC writing, etc. The nodes in the workflow can be big data processing nodes or AI processing nodes, so that complex processes can be well connected.

大数据 AI 一体化解读

To further improve efficiency and reduce costs, Severless cloud native services are needed. What Severless is is described in detail in the image above. Cloud native has many different levels, from share nothing (non-cloud approach) to share everything (very cloud approach). The higher the level, the higher the degree of resource sharing, the lower the unit computing cost, but the greater the pressure on the system.

大数据 AI 一体化解读

The big data and database fields have slowly begun to move toward serverless in the past two years, also based on cost considerations. Originally, even servers used on the cloud, such as databases on the cloud, existed in the form of instantiations. Behind these instances are the shadows of resources, such as how many CPUs and Cores this instance has. Slowly and gradually transforming into Serverless, the first level is single-tenant computing, which refers to setting up a cluster on the cloud and then deploying big data or database platforms in it. But this cluster is single-tenant, that is, it shares the physical machine with other people. The physical machine is virtualized into a virtual machine, which is used to build a big data platform. This is called single-tenant computing, single-tenant storage, and single-tenant management and control. What users get is an elastic ECS machine on the cloud, but the big data management and operation and maintenance solutions need to be done by themselves. EMR is a classic solution in this regard.

大数据 AI 一体化解读

Slowly, we will move from single-tenant storage to shared storage, which is the data lake solution. The data is in a more shared big data system. The calculation is to dynamically pull up a cluster. After the calculation is completed, the cluster will die, but the data will not die because the data is on the storage end of a reliable remote. This is shared storage. . Typical ones are data lake DLF and serverless EMR solutions.

大数据 AI 一体化解读

The most extreme thing is Share Everything. If you use BigQuery or Alibaba Cloud’s MaxCompute, you will see a platform and some virtualization For project management, the user provides a query, and the platform performs billing and metering based on the query.

大数据 AI 一体化解读

This can bring a lot of benefits. For example, there are many nodes in big data calculations and do not require user code, because these nodes are actually some build-in operators, such as join and aggregator. These deterministic results do not require a relatively heavy Sandbox. Because they are deterministic operators that have been rigorously tested and do not contain any malicious code or arbitrary UDF code, they can eliminate the overhead of virtualization.

The benefit of UDF is flexibility, which enables us to process rich data and has good scalability when the amount of data is large. But one of the challenges UDF will bring is the need for security and isolation.

Both Google's BigQuery and MaxComputer are based on the share everything architecture. We believe that only with the continuous improvement of technology can resources be used more tightly and computing will be This will save labor costs, allowing more companies to afford this data, and promote the use of data in model training.

大数据 AI 一体化解读

Because of share everything, we can not only manage big data and AI in a unified way through the workspace, but also connect them through PAI-flow. It can also be scheduled uniformly by sharing everything. In this way, the research and development costs of enterprise AI big data will be further reduced.

At this point, there is a lot of work to be done. The scheduling of K8S itself is oriented to microservices, which will face great challenges for big data, because the service scheduling granularity of big data is very small, and many tasks will only survive for a few seconds to dozens of seconds. The overall pressure will increase by several orders of magnitude. We mainly need to solve how to scale off this scheduling capability on K8S. The Koordinator open source project we launched is to improve the scheduling capability and integrate big data and AI in the K8S ecosystem.

大数据 AI 一体化解读

Another important task is multi-rental security isolation. How to implement multi-tenancy in the service layer and control layer of K8S, and how to implement over lake multi-tenancy on the network, so that multiple users can be served on one K8S and the data and resources of each user can be effectively isolated.

大数据 AI 一体化解读

Alibaba has launched a container service called ACS, which uses the two technologies introduced earlier to expose all resources through containerization. This enables users to use the big data platform and AI platform seamlessly. It is a multi-tenant method and can support the needs of big data. The scheduling requirements of big data are several orders of magnitude higher than those of microservices and AI, and must be done well. On this basis, ACS products can help customers manage their resources well.

大数据 AI 一体化解读

Enterprises face many demands and need to manage resources more carefully. For example, an enterprise is divided into various departments and sub-teams. When building a large model, resources will be split into many directions. Each team will do divergent innovation to see in what scenarios this basic model can be used well. application. But at a certain moment, I hope to concentrate on doing big things and pool all the computing power and resources to train the next iteration of the base model. In order to solve this problem, we have introduced multi-level quota management, which means that when tasks with higher requirements arrive, there can be a higher level to merge and consolidate all the sub-quotas below.

大数据 AI 一体化解读

In fact, there are many particularities in the AI scenario. In many cases, synchronous calculation is required, and synchronous calculation is very sensitive to delay, and AI calculation density is high, and the requirements for the network are: very high. If you want to ensure computing power, you need to supply data and exchange gradient information, and when the model is parallel, more things will be exchanged. In these cases, in order to ensure that there are no shortcomings in communication, topology-aware scheduling is needed.

For example, in the All Reduce link of model training, if random scheduling is performed, there will be a lot of cross port switch connections, but if the order is finely controlled, then the cross port switch connections will be It will be very clean, so that the delay can be well guaranteed because there will be no conflicts in the upper-layer switch.

After these optimizations, performance can be greatly improved. How to transfer these topology-aware schedules to the manager of the entire platform is also an issue that needs to be considered when AI increases data platform management.

大数据 AI 一体化解读

The previous introduction is about the management of resources and platforms. The management of data is also crucial. What we have been working on is the data warehouse. Systems such as data governance, data quality, etc. To associate the data system with the AI system, the data warehouse needs to provide an AI-friendly data link. For example, in the AI development process, the Python ecosystem is used. How can the data side use this platform through a Python SDK. The most popular library in Python is a data frame data structure similar to pandas. We can package the client side of the big data engine into a pandas interface, so that all AI development workers who are familiar with Python can use it well. Data platform. This is also the philosophy behind the MaxFrame framework we launched on MaxCompute this year.

大数据 AI 一体化解读

Data processing systems are highly cost-sensitive in many cases, and sometimes higher-density storage systems are used to store data warehouses. System, but in order not to waste this system, many GPUs will be deployed on it. This high-density cluster is very demanding on the network and GPU, and the two systems are likely to be separated from storage and calculation. Our data system may be biased toward governance and management, while the computing system may be biased toward calculation. It may be a remote connection method. Although both are under the management of a K8S, in order to avoid waiting for data during calculation, we have created data Set acceleration DataSetAcc is actually a data cache that seamlessly connects to the data of remote storage nodes, helping algorithm engineers pull data to local memory or SSD behind the scenes for calculation.

大数据 AI 一体化解读

Through the above methods, the AI and big data platforms can be organically combined, so that we can do some innovations. For example, when supporting model training for many general series, there is a lot of data that needs to be cleaned, because there is a lot of duplication in Internet data, so how to deduplicate data through a big data system is critical. It is precisely because we have organically combined the two systems that it is easy to clean the data on the big data platform, and the results can be fed into model training immediately.

大数据 AI 一体化解读

The previous article mainly introduced how big data provides support for AI model training. On the other hand, AI technology can also be used to assist data insights and move towards the BI AI data processing model.

大数据 AI 一体化解读

In the data processing process, it can help data analysts build analysis more easily. Originally, they might have to write SQL and learn how to use tools and data. system to interact. However, the AI era has changed the way human-computer interaction occurs, and can interact with data systems through natural language. For example, the Copilot programming assistant can assist in generating SQL and help complete various steps in the data development process, thus greatly improving development efficiency.

大数据 AI 一体化解读

In addition, data insights can also be done through AI. For example, a piece of data, how many unique keys there are, and what method is suitable for visualization can all be obtained by using AI. AI can observe data and understand data from various angles, realize automatic data exploration, intelligent data query, chart generation, and one-click generation of analysis reports, etc. This is an intelligent analysis service.

##4. Summary

大数据 AI 一体化解读

Driven by big data and AI, there have been some very gratifying technological developments in recent years. To remain invincible in this trend, it is necessary to link up big data and AI. Only when the two complement each other can we achieve better AI iteration acceleration and data understanding.

The above is the detailed content of Big data AI integrated interpretation. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

Can't use ChatGPT! Explaining the causes and solutions that can be tested immediately [Latest 2025]May 14, 2025 am 05:04 AM

ChatGPT is not accessible? This article provides a variety of practical solutions! Many users may encounter problems such as inaccessibility or slow response when using ChatGPT on a daily basis. This article will guide you to solve these problems step by step based on different situations. Causes of ChatGPT's inaccessibility and preliminary troubleshooting First, we need to determine whether the problem lies in the OpenAI server side, or the user's own network or device problems. Please follow the steps below to troubleshoot: Step 1: Check the official status of OpenAI Visit the OpenAI Status page (status.openai.com) to see if the ChatGPT service is running normally. If a red or yellow alarm is displayed, it means Open

Calculating The Risk Of ASI Starts With Human MindsMay 14, 2025 am 05:02 AM

On 10 May 2025, MIT physicist Max Tegmark told The Guardian that AI labs should emulate Oppenheimer’s Trinity-test calculus before releasing Artificial Super-Intelligence. “My assessment is that the 'Compton constant', the probability that a race to

An easy-to-understand explanation of how to write and compose lyrics and recommended tools in ChatGPTMay 14, 2025 am 05:01 AM

AI music creation technology is changing with each passing day. This article will use AI models such as ChatGPT as an example to explain in detail how to use AI to assist music creation, and explain it with actual cases. We will introduce how to create music through SunoAI, AI jukebox on Hugging Face, and Python's Music21 library. Through these technologies, everyone can easily create original music. However, it should be noted that the copyright issue of AI-generated content cannot be ignored, and you must be cautious when using it. Let’s explore the infinite possibilities of AI in the music field together! OpenAI's latest AI agent "OpenAI Deep Research" introduces: [ChatGPT]Ope

What is ChatGPT-4? A thorough explanation of what you can do, the pricing, and the differences from GPT-3.5!May 14, 2025 am 05:00 AM

The emergence of ChatGPT-4 has greatly expanded the possibility of AI applications. Compared with GPT-3.5, ChatGPT-4 has significantly improved. It has powerful context comprehension capabilities and can also recognize and generate images. It is a universal AI assistant. It has shown great potential in many fields such as improving business efficiency and assisting creation. However, at the same time, we must also pay attention to the precautions in its use. This article will explain the characteristics of ChatGPT-4 in detail and introduce effective usage methods for different scenarios. The article contains skills to make full use of the latest AI technologies, please refer to it. OpenAI's latest AI agent, please click the link below for details of "OpenAI Deep Research"

Explaining how to use the ChatGPT app! Japanese support and voice conversation functionMay 14, 2025 am 04:59 AM

ChatGPT App: Unleash your creativity with the AI assistant! Beginner's Guide The ChatGPT app is an innovative AI assistant that handles a wide range of tasks, including writing, translation, and question answering. It is a tool with endless possibilities that is useful for creative activities and information gathering. In this article, we will explain in an easy-to-understand way for beginners, from how to install the ChatGPT smartphone app, to the features unique to apps such as voice input functions and plugins, as well as the points to keep in mind when using the app. We'll also be taking a closer look at plugin restrictions and device-to-device configuration synchronization

How do I use the Chinese version of ChatGPT? Explanation of registration procedures and feesMay 14, 2025 am 04:56 AM

ChatGPT Chinese version: Unlock new experience of Chinese AI dialogue ChatGPT is popular all over the world, did you know it also offers a Chinese version? This powerful AI tool not only supports daily conversations, but also handles professional content and is compatible with Simplified and Traditional Chinese. Whether it is a user in China or a friend who is learning Chinese, you can benefit from it. This article will introduce in detail how to use ChatGPT Chinese version, including account settings, Chinese prompt word input, filter use, and selection of different packages, and analyze potential risks and response strategies. In addition, we will also compare ChatGPT Chinese version with other Chinese AI tools to help you better understand its advantages and application scenarios. OpenAI's latest AI intelligence

5 AI Agent Myths You Need To Stop Believing NowMay 14, 2025 am 04:54 AM

These can be thought of as the next leap forward in the field of generative AI, which gave us ChatGPT and other large-language-model chatbots. Rather than simply answering questions or generating information, they can take action on our behalf, inter

An easy-to-understand explanation of the illegality of creating and managing multiple accounts using ChatGPTMay 14, 2025 am 04:50 AM

Efficient multiple account management techniques using ChatGPT | A thorough explanation of how to use business and private life! ChatGPT is used in a variety of situations, but some people may be worried about managing multiple accounts. This article will explain in detail how to create multiple accounts for ChatGPT, what to do when using it, and how to operate it safely and efficiently. We also cover important points such as the difference in business and private use, and complying with OpenAI's terms of use, and provide a guide to help you safely utilize multiple accounts. OpenAI

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055612 fails to install in Windows 10?

4 weeks agoByDDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks agoByDDD

Nordhold: Fusion System, Explained

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

EditPlus Chinese cracked version

Small size, syntax highlighting, does not support code prompt function

SublimeText3 English version

Recommended: Win version, supports code prompts!

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SublimeText3 Linux new version

SublimeText3 Linux latest version

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

Hot Topics

1672

1428

1332

1276

1256