Home  >  Article  >  Technology peripherals  >  When GPT-4 learns to read pictures and texts, a productivity revolution is unstoppable

When GPT-4 learns to read pictures and texts, a productivity revolution is unstoppable

青灯夜游
青灯夜游forward
2023-03-31 22:38:411511browse

Many researchers from academia and industry conducted in-depth discussions around "intelligent image and text processing technology and multi-scenario application technology",

"It's too complicated!"

In the experience After the continuous bombardment of GPT-4 and Microsoft 365 Copilot, I believe many people have this feeling.

Compared with GPT-3.5, GPT-4 has achieved significant improvements in many aspects. For example, in the mock bar exam, it has evolved from the original bottom 10% to the positive 10%. Of course, ordinary people may not have any idea about these professional examinations. But if I show you a picture, you will understand how terrifying its improvement is:

When GPT-4 learns to read pictures and texts, a productivity revolution is unstoppable

Source: Tang Jie, professor of the Department of Computer Science at Tsinghua University, Weibo. Link: https://m.weibo.cn/detail/4880331053992765

This is a physics question. GPT-4 is required to solve the problem step by step according to the pictures and texts. This is GPT-3.5 (here Refers to the capabilities that the model that ChatGPT relied on before the upgrade did not have. On the one hand, GPT-3.5 is only trained to understand text, and it cannot understand the picture in the question. On the other hand, GPT-3.5's problem-solving ability is also very weak, and it can be stumped by a chicken and a rabbit in the same cage. But this time, both problems seem to have been solved beautifully.

When everyone thought this was a big deal, Microsoft released another blockbuster: GPT-4. These capabilities have been integrated into a new application called Microsoft 365 Copilot. With its powerful image and text processing capabilities, Microsoft 365 Copilot can not only help you write various documents, but also easily convert documents into PPT and automatically summarize Excel data into charts...

When GPT-4 learns to read pictures and texts, a productivity revolution is unstoppable

From the technology debut to the product launch, OpenAI and Microsoft only gave the public two days to respond. Seemingly overnight, a new productivity revolution has arrived.

Because changes are happening so fast, both the academic community and the industry are more or less in a state of confusion and "FOMO (fear of missing out, fear of missing out)". Currently, everyone wants to know an answer: What can we do in this wave? What opportunities are available? From the demo released by Microsoft, we can find a clear breakthrough: Intelligent processing of graphics and text.

In real-life scenarios, many jobs in all walks of life are related to graphic and text processing, such as organizing unstructured data into charts, writing reports based on charts, and extracting useful information from massive graphic and text information. Information and so on. Because of this, the impact of this revolution may be far more profound than many people imagine. A blockbuster paper recently released by OpenAI and the Wharton School predicts this impact: about 80% of the U.S. workforce may have at least 10% of their work tasks affected by the introduction of GPT, and about 19% of workers are likely to see at least 50% of tasks affected. It is foreseeable that a large part of the work involves graphic and text intelligence.

At such an entry point, what research efforts or engineering efforts are worth exploring? In the recent CSIG Enterprise Tour event hosted by the Chinese Society of Image and Graphics (CSIG) and jointly organized by Hehe Information and CSIG Document Image Analysis and Recognition Professional Committee, many researchers from academia and industry focused on " Graphics and Text "Intelligent processing technology and multi-scenario application technology" has been discussed in depth, which may provide some inspiration to researchers and practitioners who are concerned about the field of intelligent image and text processing.

Processing graphics and text, starting from the underlying vision

As mentioned earlier, GPT-4’s graphics and text processing capabilities are very shocking. In addition to the above physics question, OpenAI's technical report also cited other examples, such as letting GPT-4 read the paper picture:

When GPT-4 learns to read pictures and texts, a productivity revolution is unstoppable

However, if we want to make this technology widespread There may still be a lot of basic work to be done before it is implemented, and the underlying vision is one of them.

The characteristics of the underlying vision are very obvious: the input is an image, and the output is also an image. Image preprocessing, filtering, restoration and enhancement all fall into this category.

"The theories and methods of underlying vision are widely used in many fields, such as mobile phones, medical image analysis, security monitoring, etc. Enterprises and institutions that value the quality of images and video content must pay attention to the direction of underlying vision. Research. If the underlying vision is not done well, many high-level vision systems (such as detection, recognition, and understanding) cannot be truly implemented." Hehe Information Image Algorithm R&D Director Guo Fengjun said during the CSIG Enterprise Tour event sharing .

How to understand this sentence? We can look at some examples:

When GPT-4 learns to read pictures and texts, a productivity revolution is unstoppable

Different from the ideal situation shown in OpenAI and Microsoft demos, real-world images and texts always exist in challenging forms, such as deformation, shadows, and moiré patterns, which will increase the difficulty of subsequent recognition and understanding. Difficulty. The goal of Guo Fengjun’s team is to solve these problems in the initial stage.

To this end, they divided this task into several modules, including region of interest (RoI) extraction, deformation correction, image restoration (such as shadow removal, moiré), quality enhancement (such as sharpening) , clarity), etc.

The combination of these technologies can create some very interesting applications. After years of exploration, these modules have achieved quite good results, and the related technology has been applied to the company's intelligent text recognition product "Scanner".

From words to tables, and then to chapters, read pictures and texts step by step

After the image is processed, the next step is to identify the content of the picture and text above. This is also a very detailed work, and may even be done in units of "words".

In many real-life scenarios, characters may not necessarily appear in standardized print form, which brings challenges to character recognition.

When GPT-4 learns to read pictures and texts, a productivity revolution is unstoppable

Take the education scene as an example. Assuming you are a teacher, you definitely want AI to directly help you correct all students' homework, and at the same time summarize the students' mastery of each part of the knowledge. It is best to also give wrong questions, typos and correction suggestions. Associate Professor Du Jun of the National Engineering Laboratory of Speech and Language Information Processing at the University of Science and Technology of China is doing work in this area.

Specifically, they created a Chinese character recognition, generation and evaluation system based on radicals, because compared with whole character modeling, there are much fewer combinations of radicals. Among them, recognition and generation are jointly optimized, which is a bit like the process of mutual reinforcement of literacy and writing when students learn. In the past, most evaluation work focused on the grammatical level, but Du Jun's team designed a method that can find typos directly from the image and explain the errors in detail. This method will be very useful in scenarios such as intelligent marking.

When GPT-4 learns to read pictures and texts, a productivity revolution is unstoppable

In addition to text, the identification and processing of tables is actually a big difficulty, because you not only have to identify the content inside, but also clarify the structural relationship between these contents. , and some tables may not even have wireframes. To this end, Du Jun's team designed a "first segment, then merge" method, that is, first split the table image into a series of basic grids, and then make further corrections through merging.

When GPT-4 learns to read pictures and texts, a productivity revolution is unstoppable

# Du Jun's team's "first segmentation, then merge" form recognition method.

Of course, all of this work will ultimately play a role in document structuring and understanding at the chapter level. In real-life environments, most documents faced by the model are more than one page (such as a paper). In this direction, the work of Du Jun's team focuses on the classification of cross-page document elements and the restoration of cross-page document structure. However, these methods still have limitations in multi-layout scenarios.

When GPT-4 learns to read pictures and texts, a productivity revolution is unstoppable

Large model, multi-modality, world model... Where is the future?

Talking about chapter-level image and text processing and understanding, in fact we are not far away from GPT-4. "After the multi-modal GPT-4 came out, we were also thinking about whether we could do something in these aspects," Du Jun said at the event. I believe many researchers or practitioners in the field of image and text processing have this idea.

The goal of the GPT series of models has always been to strive to improve versatility and ultimately achieve general artificial intelligence (AGI). The powerful image and text understanding capabilities demonstrated by GPT-4 this time are an important part of this general capability. In order to make a model with similar capabilities, OpenAI has given some reference, but also left many mysteries and unsolved problems.

First of all, the success of GPT-4 shows that the multi-modal approach to large models is feasible. However, what issues should be studied in large models and how to solve the exaggerated computing power requirements of multi-modal models are all challenges facing researchers.

For the first question, Qiu Xipeng, a professor at the School of Computer Science at Fudan University gave some directions worthy of reference. According to some information previously disclosed by OpenAI, we know that ChatGPT is inseparable from several key technologies, including in-context learning, chain of thought, and learn from instructions. Qiu Xipeng pointed out in his sharing that there are still many issues to be discussed in these directions, such as where these abilities come from, how to continue to improve, and how to use them to transform existing learning paradigms. In addition, he also shared the capabilities that should be considered when building conversational large-scale language models and the research directions that can be considered to align these models with the real world.

When GPT-4 learns to read pictures and texts, a productivity revolution is unstoppable

Regarding the second question, Nanqiang Distinguished Professor Ji Rongrong of Xiamen University contributed an important idea. He believes that there is a natural connection between language and vision, and joint learning between the two is the general trend. But in the face of this wave, the power of any university or laboratory is insignificant. So now, starting from Xiamen University where he works, he is trying to persuade researchers to integrate computing power and form a network to build large multi-modal models. In fact, at an event some time ago, Academician E Weinan, who focuses on AI for Science, also expressed similar views, hoping that all walks of life "dare to pool resources in original innovation directions."

However, will the path taken by GPT-4 definitely lead to general artificial intelligence? Some researchers are skeptical about this, and Turing Award winner Yann LeCun is one of them. He believes that these current large models have staggering demands for data and computing power, but their learning efficiency is very low (such as self-driving cars). Therefore, he created a theory called "world model" (that is, an internal model of how the world works), believing that learning a world model (which can be understood as running a simulation for the real world) may be the key to achieving AGI. At the event, Professor Yang Xiaokang of Shanghai Jiao Tong University shared their work in this direction. Specifically, his team focused on the world model of visual intuition (because visual intuition has a large amount of information), trying to model vision, intuition, and the perception of time and space. Finally, he also emphasized the importance of the intersection of mathematics, physics, information cognition and computer disciplines for this type of research.

"Caterpillars extract nutrients from food and then turn into butterflies. People have extracted billions of clues to understand that GPT-4 is the butterfly for humans." The day after GPT-4 was released , Geoffrey Hinton, the father of deep learning, tweeted.

When GPT-4 learns to read pictures and texts, a productivity revolution is unstoppable

Currently, no one can determine how big a hurricane this butterfly will set off. But to be sure, this is not a perfect butterfly yet, and the entire AGI world puzzle is not yet complete. Every researcher and practitioner still has opportunities.

The above is the detailed content of When GPT-4 learns to read pictures and texts, a productivity revolution is unstoppable. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:51cto.com. If there is any infringement, please contact admin@php.cn delete