Home >Technology peripherals >AI >Richard Sutton: Experience is the ultimate data of AI, four stages leading to the development of real AI
Introduction: The development of strong artificial intelligence has been a topic of concern in recent years. Letting AI learn from human perception and behavior rather than simply labeled data has become the focus of many researchers. Among them, how to use the daily life experiences acquired by humans to inspire and build artificial intelligence that can adapt to different environments and interact with the external world has become a new way to explore in some fields.
Richard Sutton, known as the father of reinforcement learning, recently proposed the idea of using experience to inspire the development of AI. He divided the process of AI from using data to using experience into four development stages, and proposed the development direction of building real AI (Real AI) in the future. On May 31, 2022, Richard Sutton delivered a keynote speech entitled "The Increasing Role of Sensorimotor Experience in AI" at the 2022 Beijing Intelligent Source Conference, focusing on the use of experience to inspire the development of AI. The methods are summarized and prospected.
Speaker introduction:Richard Sutton, Modern Computing One of the founders of type-based reinforcement learning, he is an outstanding research scientist at DeepMind, a professor at the Department of Computing Science at the University of Alberta, and a member of the Royal Society, the Royal Society of Canada, the Association for the Advancement of Artificial Intelligence, the Alberta Machine Intelligence Institute (AMII) and CIFAR researcher.
Sutton believes that the interaction between an intelligent agent and the external world Interact, send actions to it, and receive perceptions (feedback from it). This kind of interaction involving experience is the normal way of perception in reinforcement learning. It is also the normal approach used when letting an agent try to predict the external world. However, this approach is uncommon in supervised learning, which is currently the most common type of machine learning. Machine learning does not involve ordinary experience (Ordinary Experience), and the model does not learn from special training data that is different from ordinary experience. In fact, at runtime, supervised learning systems do not learn at all.
#So, experience is interactive (brought) data and a way to communicate with the outside world. Experience has no meaning unless it is related to other experiences. Of course, there is one exception: rewards expressed via special signals. Rewards represent good goals, and the agent certainly hopes to maximize the rewards.
#In his speech, Sutton raised a core question: What can ultimately explain intelligence? Are they objective terms or experiential terms? The former includes things such as states, goals, people, locations, relationships, spaces, actions, and distances in the external world that are not in the agent, while the latter includes things inside the agent such as perception, actions, rewards, time steps, etc. Sutton believes that although researchers usually think about objective concepts when communicating and writing papers, more attention should now be paid to the experiences generated by the interaction between agents and the external world.
In order to further introduce the importance of experience to intelligent agents, Richard Sutton proposed that as experience is gradually valued, a total of It went through four stages. They are: Agenthood, Reward, Experiential State, and Predictive Knowledge. After these four stages of development, AI gradually gains experience and becomes more practical, learnable, and easy to expand.
Agenthood The meaning is to have/gain experience (of AI). Perhaps surprisingly, early AI systems really had no experience whatsoever. In the early stages of the development of artificial intelligence (1954-1985), most AI systems were only used to solve problems or answer questions. They had no perception and could not act. Robots are an exception, but traditional systems only have a starting state and a goal state, like the stacked building blocks in the picture below.
If you want to reach the appropriate goal state, the solution is a sequence of actions that ensures that the AI can reach it from the starting state target state. There is no perception and action in this, because the entire external world is known, determined, and closed, so there is no need for AI to perceive and act. Researchers know what will happen, so they just need to build a plan to solve the problem and let the AI execute it. Humans know that this will solve the problem.
#In the past 30 years of development, artificial intelligence research has focused on building intelligent agents. This shift can be seen in the fact that standard textbooks on artificial intelligence include the concept of agents as a foundation. For example, the 1995 version of "Artificial Intelligence: A Modern Approach" mentioned that the unified theme of the entire book is to introduce the concept of intelligent agent. From this perspective, the problem of AI is to describe and construct intelligent agents, gain cognition from the environment, and take action. As research develops, the standard, modern approach is to build an agent that can interact with the outside world. Sutton believes that AI can be viewed from this perspective.
Reward is Describe the goals of AI in empirical terms. This is also an effective method currently proposed to build all goals of AI. This is also the method proposed by Sutton and his collaborators.
Reward is considered to be a relatively sufficient hypothesis at present - intelligence and its related abilities can be understood as the result of serving to maximize rewards . So it is said that the reward is enough for the agent.
#However, Sutton believes that this idea needs to be challenged. Rewards are not enough to achieve intelligence. Reward is just a number, a scalar, which is not enough to explain the goal of intelligence. A goal that comes from outside the mind and is expressed in a single number seems too small, too reductive, even too demeaning. Humans like to imagine bigger goals, such as taking care of their families, saving the world, world peace, and making the world a better place. Human goals are more important than maximizing happiness and comfort.
#Just as researchers have discovered that rewards are not a good way to build goals, researchers have also discovered the advantages of building goals through rewards. Rewards build goals that are too small, but within which people can make progress—goals can be well, clearly defined, and easy to learn. This is rather a challenge for building goals through experience.
Sutton believes that it is challenging to imagine fully constructing goals through experience. Looking back at history, we can see that AI was not originally interested in rewards, even now. Therefore, whether it is an early problem-solving system or the latest version of the AI textbook, the goal is still defined as the world state (World State) that needs to be achieved, rather than as an empirical (definition). Such a goal may still be a specific set of "building blocks" rather than a perceived outcome to be achieved.
Of course, there are chapters in the latest textbooks that mention reinforcement learning and mention that these AIs use a reward mechanism. In addition, rewards are already a common practice in the process of building goals and can be achieved using Markov decision processes. For researchers (such as Yann LeCun) who criticize rewards for not adequately building goals, rewards are already the "cherry" on the top of the "cake" of intelligence, and it is very important.
In the next two stages, Sutton will introduce how to understand the external world from an empirical perspective, but Before doing so, he will first introduce what experience refers to.
As shown in the sequence in the figure below (not real data), when the time step starts, the system will get the sensing signal, and will also send out signals and actions. So a perceived signal may cause some actions, and those actions may cause the next perceived signal. At any time, the system needs to pay attention to recent actions and recent signals, so that it can decide what will happen next and what should be done.
As shown in the figure, this is the input and output signal array of an agent execution program. The first column is the time step, each step can be considered as an instant of 0.1 seconds or 0.01 seconds. The action signal column is represented by a two-level system, represented by gray and white. Then there is the sensory signal column, of which the first four columns are binary values (also using gray and white), the last four columns use four values from 0 to 3, represented by the four colors of red, yellow, blue and green, and the last column is continuous Variable, representing reward. In the experiment, the researchers removed the numbers and left only the colors to make it easier to look for patterns. Sutton believes that experience refers to the knowledge and understanding of patterns found in the data of sensory-motor experience.
In this case, Sutton listed four typical patterns:
1. The last digit of the action is the same as the perceived signal immediately following it. If the action at a certain time step is white, the first perceived signal thereafter is also white, and the same is true for gray.
#2. When a red pixel appears, the next time step is a green pixel. After expanding the data range, it can be found that after red and green pixels appear one after another, blue pixels will appear every other time step.
#3. The last three columns of data often have a long string of the same color, which remains unchanged. Once a color begins, it persists for multiple periods of time, eventually forming stripes. Such as a long string of red, green, blue, etc.
4. If the specific sensory data predicted by AI is displayed, in many cases it cannot be observed immediately, so Add a return value (Return) to this data, which represents the prediction of the reward that will come. The green strip in the box indicates that the subsequent reward will be more green than red. This represents the current prediction of the reward.
The special shaded area represents the waiting function. There are green and red bands in the shaded area of the wait function. Here, researchers give higher weight to earlier returns with colored rewards. When you move the return value over time, you can see the corresponding change in color and value between the predicted result and the actual reward. This return value is a prediction - it can be learned from experience.
Sutton believes that this return value is not essentially learned from events that have already occurred, but learned from the time difference signal. The most important signal is the value function. In this case, the return value is actually a value function that represents the sum of future rewards. If you want a general form of a complex function that can refer to future values, you can use a method called General Value Functions (GVFs). The general value function includes various signals, not just rewards; it can be in any time envelope form, not just exponential. The general value function can also include the strategy of any queue and can predict a very large number and a wide range of things. Of course, Sutton believes that the difficulty of making predictions through calculation depends on the form of the object being predicted. When using a general value function for prediction, the expression form of the predicted object needs to be designed in a form that is easy to learn and requires high computational efficiency.
When it comes to the word "state", many studies will mention What we come to is World State, which is a word that belongs to the objective concept. State refers to a symbolic description (reflection) of the objective world that can match the situation of the world itself. For example, for the position information of building blocks (C is on A), etc. In recent times, some researchers (such as Judea Pearl) have proposed probabilistic graphical models, which represent the probability distribution of world states. Some events, such as "It's raining outside, is the grass wet?" There are probabilistic relationships between these events.
##Another state is the belief state (Belief State) , in this concept, the state is a probability distribution, representing the state of the discrete world, and its corresponding method is called POMDPs (Partially observable Markov decision process) - there are hidden state variables, part of which are observable , can be modeled using a Markov decision process.
The above methods are all objective and far from experience. Try ways to describe the state of the world.
The difference is the state of experience. Sutton believes that the empirical state refers to the state of the entire world defined based on experience. The experience state is the summary of past experience and can predict and control the experience that will be obtained in the future.
#This approach of constructing past experience and predicting the future has been reflected in research. For example, in the Atari game, one of the reinforcement learning tasks, researchers will use the last four frames of video to construct an experience state and then predict subsequent behaviors. Some methods in LSTM networks can also be thought of as making predictions from a certain empirical state.
Looking back at the experience status, it can be updated recursively. The experience state is a function of the summary of what happened in the past. Since AI needs to access the experience state every moment to predict the next event, the update of the experience state is recursive: the current moment only accesses the experience state of the previous moment. , and the experience state at the last moment is a summary of all events that have occurred in the past. At the next moment, only the experiential state at this moment is accessed, and this experiential state is also a summary of all events that occurred in the past.
The following figure shows the construction process of the agent’s experience state. Among them, the red arrows indicate the basic working signals of the agent, including: feeling, action, reward, etc. The blue arrow marks the direction of the experience state (representation), output from the perception, which is responsible for updating its experience state at each time step. The updated status is used to strategize actions or make other updates.
Knowledge, such as "Joe Biden is the President of the United States", "The Eiffel Tower is in Paris", etc., is a description of the external objective world and is not empirical. However, knowledge like "It is expected to take X hours to do something" is empirical knowledge. There is a huge difference between empirical knowledge and objective knowledge, which is also a challenging point for AI research.
Previous AI research tended to treat knowledge as an objective item, although some recent research has looked at the problem from an empirical perspective. Early AI systems had no experience and therefore could not make predictions. More modern AI treats knowledge as an objective existence. The more advanced one is the probabilistic graphical model, but in many cases it studies the probability between two things that happen at the same time, and the prediction should be oriented to a series of sequence events.
# Prediction based on sequence events is knowledge with clear semantic properties. If something is predicted to happen, AI can compare the prediction with the actual outcome. This kind of prediction model can be considered as a new kind of world knowledge, that is, predictive knowledge. Among predictive knowledge, Sutton believes that the most cutting-edge are the General Value Function and the Option Model.
Sutton divides world knowledge into two categories, one is knowledge about world state; the other is knowledge about world state transition. knowledge. An example of knowledge about world state transitions is a world prediction model. The world prediction model here is not a primary form of Markov decision process or difference equation. It can be an abstract state that can be extracted from the empirical state. Since prediction is based on the entire behavior, in the selection model, the agent can also choose to stop a certain strategy and terminate a certain condition. Sometimes, using a melody transfer model, it is possible to predict the state after performing an action. Taking daily life as an example, assuming someone wants to go to the city, he/she will make a prediction about the distance and time to the city center. For behaviors that exceed a certain threshold (such as walking into the city for 10 minutes), further predictions will be made. Expose a state, such as tiredness, etc.
#With this model that can extend behavior, the scale of knowledge represented can also be very large. For example, you can predict the state of the world based on one behavior, and then predict the next behavior based on the state...and so on.
Summarizing the development process of experience in AI research, Sutton said that experience is the basis of world knowledge. Human beings understand and influence the world through perception and action. Experience is the basis for human beings to obtain information and take actions. The only way to act, and it is inseparable from human beings. Unfortunately, because experience is too subjective and personal, humans still don't like to think and express in experiential terms. Experience is too alien, counterintuitive, fleeting, and complex to humans. Experience is also subjective and private, and it is almost impossible to communicate with others or verify it.
Sutton believes that experience is very important for AI for the following reasons. First, experience comes from the daily operation process of AI, and obtaining these experiences is cost-free and automatic. At the same time, the field of AI has a large amount of data used for calculations, so experience provides a path to understanding the world. If any fact in the world is empirical, then AI can learn its understanding of the world from experience and build upon experience. to verify.
In summary, Sutton believes that in the past 70 years of AI development, AI has gradually increased its reliance on experience. Value – gain experience, set goals based on experience, and gain status and knowledge based on experience. At each stage, empirical research that is more unfamiliar to humans is becoming more important, and it has the advantages of being grounding, learnable, and scalable.
# Sutton believes that AI has not yet completed stages three and four in terms of experience utilization, but this trend will go further and further. Sutton believes that attributing everything to experience is a feasible path to true AI. Although very challenging, this is the picture of being able to understand data flows and achieve intelligence. Finally, Sutton further condensed the four stages of focusing on sensorimotor experience and formed a slogan: "Data drives artificial intelligence, and experience is the ultimate data. If you can make good use of experience , we can promote the development of artificial intelligence more quickly and effectively."
The above is the detailed content of Richard Sutton: Experience is the ultimate data of AI, four stages leading to the development of real AI. For more information, please follow other related articles on the PHP Chinese website!