Home >Technology peripherals >AI >'Father of Machine Learning' Mitchell writes: How AI accelerates scientific development and how the United States seizes opportunities

'Father of Machine Learning' Mitchell writes: How AI accelerates scientific development and how the United States seizes opportunities

王林
王林Original
2024-07-29 20:23:43764browse

「机器学习之父」Mitchell 撰文:AI 如何加速科学发展,美国如何抓住机遇

Editor | ScienceAI

Recently, Tom M. Mitchell, a professor at Carnegie Mellon University and known as the "Father of Machine Learning", wrote a new AI for Science white paper, focusing on the discussion "How can artificial intelligence accelerate scientific development? How can the U.S. government help achieve this goal?" This topic.

「机器学习之父」Mitchell 撰文:AI 如何加速科学发展,美国如何抓住机遇

ScienceAI has compiled the full text of the original white paper without changing its original meaning. The content is as follows.

The field of artificial intelligence has recently made significant progress, including large-scale language models such as GPT, Claude and Gemini, thus raising the possibility that a very positive impact of artificial intelligence may be to greatly accelerate the transition from cell biology to Research advances in a variety of scientific fields, from materials science to weather and climate modeling to neuroscience. Here we briefly summarize this AI science opportunity and what the U.S. government can do to seize it.

「机器学习之父」Mitchell 撰文:AI 如何加速科学发展,美国如何抓住机遇

Opportunities of Artificial Intelligence and Science

The vast majority of scientific research in almost all fields today can be classified as "lone ranger" science.

In other words, scientists and their research teams of a dozen researchers come up with an idea, conduct experiments to test it, write up and publish the results, perhaps share their experimental data on the Internet, and then repeat the process.

Other scientists can consolidate these results by reading published papers, but This process is error-prone and extremely inefficient for several reasons:

(1) It is impossible for individual scientists to read already published papers in their field All articles published are therefore partially blind to other relevant studies; (2) Experiments described in journal publications necessarily omit many details, making it difficult for others to replicate their results and build on the results; (3) A single Analysis of experimental data sets is often performed in isolation, failing to incorporate data from other related experiments conducted by other scientists (and therefore not incorporating valuable information).

In the next ten years, artificial intelligence can help scientists overcome the above three problems

AI can transform this "lone ranger" scientific research model into a "community scientific discovery" model. In particular, AI can be used to create a new type of computer research assistant that helps human scientists overcome these problems by:

  • Discover complex data sets (including those built from many experiments conducted in multiple laboratories) ) rather than conducting isolated analyzes on a single, much smaller and less representative data set. More comprehensive and accurate analysis can be achieved by basing analysis on data sets that are orders of magnitude larger than human capability.
  • Use artificial intelligence large-scale language models such as GPT to read and digest every relevant publication in the field, thereby helping scientists form new hypotheses not only based on experimental data from their own laboratory and other laboratories, but also based on published Use assumptions and arguments from the research literature to formulate new hypotheses, leading to more informed hypotheses than would have been possible without this natural language AI tool.
  • Create “base models” and train these models using many different types of experimental data collected by labs and scientists, thus bringing the growing knowledge in the field into one place and making it computer-accessible Execution model. These executable "base models" can serve the same purpose as equations such as f = ma, i.e. they make predictions about certain quantities based on other observed quantities. And, unlike classical equations, these underlying models can capture the empirical relationships between hundreds of thousands of different variables rather than just a handful of variables.
  • Automate or semi-automate new experimental design and robotic execution, thereby accelerating new relevant experiments and improving the reproducibility of scientific experiments.

「机器学习之父」Mitchell 撰文:AI 如何加速科学发展,美国如何抓住机遇

What scientific breakthroughs might this paradigm shift in scientific practice bring?

Here are a few examples:

  • Reduce the development time and cost of new vaccines for new disease outbreaks by 10x.
  • Accelerating materials research may lead to breakthrough products such as room-temperature superconductors and thermoelectric materials that convert heat into electricity without producing emissions.
  • Combining a never-before-attempted volume and diversity of cell biology experimental data to form a "basic model" of human cell function, enabling the more expensive step of conducting in vivo experiments in the laboratory , quickly simulate the results of many potential experiments.
  • Combined with experimental data from neuroscience (from single neuron behavioral data to whole-brain fMRI imaging), build a "basic model" of the human brain at multiple levels of detail, integrate data with unprecedented scale and diversity, and establish A model that predicts the neural activity the brain uses to encode different types of thoughts and emotions, how those thoughts and emotions are evoked by different stimuli, the effects of drugs on neural activity, and the effectiveness of different treatments for mental disorders.
  • Improve our ability to predict weather, both by tailoring forecasts to highly localized areas (e.g., individual farms) and by expanding our ability to predict future weather.

「机器学习之父」Mitchell 撰文:AI 如何加速科学发展,美国如何抓住机遇

What can the US government do to seize this opportunity?

Translating this opportunity into reality requires several elements:

Lots of experimental data

One lesson of basic text-based models is that the more data they are trained on, the more powerful they become. Experienced scientists also know very well the value of more and more diverse experimental data. To achieve many orders of magnitude progress in science, and to train the types of underlying models we want, we need to make very significant advances in our ability to share and jointly analyze diverse datasets contributed by the entire scientific community.

The ability to access scientific publications and read them with computers

A key part of the opportunity here is to change the current situation: scientists are unlikely to read 1% of relevant publications in their field, computers read 100% of publications, summarizes them and their relevance to current scientific issues, and provides a conversational interface to discuss their content and implications. This requires not only access to online literature, but also AI research to build such a "literary assistant."

Computing and Network Resources

Text-based basic models such as GPT and Gemini are known for the large amount of processing resources consumed during their development. Developing basic models in different scientific fields also requires large amounts of computing resources. However, the computational demands in many AI scientific efforts are likely to be much smaller than those required to train LLMs such as GPT, and thus can be achieved with investments similar to those being made by government research labs.

For example, AlphaFold, an AI model that has revolutionized protein analysis for drug design, uses far less training computation than basic text-based models like GPT and Gemini. To support data sharing, we need massive computer networks, but the current Internet already provides a sufficient starting point for transferring large experimental data sets. Therefore, the cost of hardware to support AI-driven scientific advancement is likely to be quite low compared to the potential benefits.

New Machine Learning and AI Methods

Current machine learning methods are extremely useful for discovering statistical regularities in huge data sets that humans cannot examine (for example, AlphaFold is performed on large amounts of protein sequences and their carefully measured 3D structures trained). A key part of the new opportunity is to expand current machine learning methods (discovering statistical correlations in data) in two important directions: (1) moving from finding correlations to finding causal relationships in data, and (2) moving from finding only large-scale Structured dataset learning moves toward learning from large structured datasets and large research literatures; that is, learning like human scientists from experimental data and published hypotheses and arguments expressed in natural language by others. The recent emergence of LLMs with advanced capabilities for digesting, summarizing, and reasoning about large text collections could provide the basis for this new class of machine learning algorithms.

What should the government do? The key is to support the above four parts and unite the scientific community to explore new methods based on artificial intelligence to promote their research progress. Therefore, the government should consider taking the following actions:

「机器学习之父」Mitchell 撰文:AI 如何加速科学发展,美国如何抓住机遇

Explore specific opportunities in specific areas of science, Fund multi-institutional research teams in many scientific areas to present visions and preliminary results that demonstrate how AI can be used to significantly accelerate progress in their fields, and what is needed to scale this approach. This work should not be funded in the form of grants to individual institutions, as the greatest advances may come from integrating data and research from many scientists at many institutions. Instead, it is likely to be most effective if carried out by a team of scientists from many institutions, who propose opportunities and approaches that inspire their engagement with the scientific community at large.

Accelerate the creation of new experimental datasets to train new base models and make data available to the entire community of scientists:

  • Create data sharing standards to enable one scientist to conveniently use experimental data created by different scientists, and lay the foundation for national data resources in each relevant scientific field. Note that there have been previous successes in developing and using such standards that can provide a starting template for standards efforts (e.g., the success of data sharing during the Human Genome Project).

  • Create and support data sharing websites for every relevant field. Just as GitHub has become the go-to site for software developers to contribute, share, and reuse software code, creating a GitHub for scientific datasets can serve as both a data repository and a search engine for discovering topics related to specific topics, Hypothesize or plan an experiment on the most relevant data set.

  • Study how to build incentive mechanisms to maximize data sharing. Currently, scientific fields vary widely in the extent to which individual scientists share their data and the extent to which for-profit organizations use their data for basic scientific research. Building a large, shareable national data resource is integral to the scientific opportunity for AI, and building a compelling incentive structure for data sharing will be key to success.

  • Where appropriate, fund the development of automated laboratories (e.g. robotic laboratories for chemistry, biology, etc. experiments that can be used by many scientists via the Internet) to conduct experiments efficiently and generate them in a standard format data. A major benefit of creating such laboratories is that they will also promote the development of standards that precisely specify the experimental procedures to be followed, thereby increasing the reproducibility of experimental results. Just as we can benefit from GitHubs for datasets, we can also benefit from related GitHubs to share, modify, and reuse components of experimental protocols.

「机器学习之父」Mitchell 撰文:AI 如何加速科学发展,美国如何抓住机遇

To create a new generation of artificial intelligence tools requires:

  • Funding relevant basic AI research specifically developed for scientific research methods. This should include the development of "foundational models" in a broad sense as tools to accelerate research in different fields and accelerate the shift from "lone ranger" science to a more powerful "community scientific discovery" paradigm.

  • Specially supports research by reading the research literature, critiquing stated input assumptions and suggesting improvements, and helping scientists derive results from the scientific literature in a way that is directly relevant to their current questions.

  • Specially supports research that extends machine learning from the discovery of correlations to the discovery of causation, especially in settings where new experiments can be planned and executed to test causal hypotheses.

  • Specially supports the expansion of research on machine learning algorithms, from only taking big data as input, to taking both large experimental data and complete research literature in the field as input, in order to generate statistical regularities in experimental data and research literature The assumptions, explanations, and arguments discussed in .

Related content:

https://x.com/tommmitchell/status/1817297827003064715
https://docs.google.com/document/d/1ak_XRk5j5ZHixHUxXeqaiCeeaNxXySO lH1kIeEH3DXE/edit?pli=1
Note: The pictures in this article come from the Internet.

The above is the detailed content of 'Father of Machine Learning' Mitchell writes: How AI accelerates scientific development and how the United States seizes opportunities. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn