Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages-AI-php.cn

Home

Technology peripherals

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

王林

Apr 12, 2023 am 09:31 AM

language modelgoogle

The progress of natural language processing in recent years has largely come from large-scale language models. Each new model released pushes the amount of parameters and training data to new highs, and will also Carry out a massacre of the existing benchmark rankings!

For exampleIn April this year, Google released the 540 billion parameter language model PaLM (Pathways Language Model) in language and reasoning It has successfully surpassed humans in a series of evaluations, especially its excellent performance in the few-shot small sample learning scenario, and PaLM is considered the development direction of the next generation language model.

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

Similarly, Visual language modelIn factis alsoStrong efforts can produce miracles , you can improve the performance by increasing the size of the model.

Of course, if is just a visual language model for multi-tasking , it is obviously not very universal, and it must support input in multiple languages Just output.

Recently, Google upgraded the PaLM extension to PALI (Pathways Language and Image model), which has both multi-language and image understanding capabilities , and supports 100 languages to perform a variety of image and language applications across vision, language and multi-modal, such as visual question answering, image caption (image caption), object detection, image classification, OCR, Text reasoning, etc.

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

##Paper link: https://arxiv.org/abs/2209.06794

The model is trained using a public image collection, which includes automatically crawled annotations in 109 languages, also called the WebLI data set in the article.

PaLI models pre-trained on WebLI achieve state-of-the-art performance on multiple image and language benchmarks, such as COCO-Captions, TextCaps, VQAv2, OK-VQA, TextVQA, etc. etc., also surpassed the benchmarks of multilingual visual captioning and visual question answering of previous models.

Model Architecture

One of the goals of PALI is to study language and visual models at performance and scale Are the connections on the same, especially the scalability of the language-image model?

So the architectural design of the model is very simple, mainly for the convenience of experiments, especially for reusability and scalability.

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

The model consists of a Transformer encoder that processes input text and an autoregressive Transformer decoder that generates output text.

When processing images, the input to the Transformer encoder also includes visual words representing the images processed by ViT.

A key design of the PaLI model is reuse. The researchers used the weights of previously trained single-modal vision and language models (such as mT5-XXL and large ViTs) as seeds for the model. ,This reuse not only transfers the capabilities of single-modal ,training, but also saves computational costs.

The visual component of the model uses the largest ViT architecture to date, ViT-e, which has the same structure as the 1.8 billion parameter ViT-G model, and Using the same training parameters, the difference is that it is expanded to 4 billion parameters.

Although the scaling rules have been studied in both the visual field and the language field, there is less discussion of scaling behavior in the combined model of vision and language. Expanding the scale of the visual backbone model may lead to saturation of gains in classification tasks.

The researchers also further confirmed this, and it can be observed that ViT-e is only slightly better than ViT-G on ImageNet, but ViT-e has a great improvement on the visual language task of PaLI. For example, ViT-e outperforms ViT-G by nearly 3 CIDEr points on the COCO subtitle task. 3 points more than ViT-G in tasks. This also hints at room for using larger ViT skeleton models in visual language tasks in the future.

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

The researchers adopted mT5 backbone as the language modeling component, using pre-trained mT5-Large (1 billion parameters) and mT5-XXL (13 billion parameters) to initialize PaLI’s language encoder-decoder and then continue hybrid training on many language tasks, including pure language understanding tasks, which also helps avoid catastrophic forgetting of mT5’s language understanding and generative capacity.

Finally, we got three PALI models of different sizes.

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

Dataset of 109 languages

Extension research related to deep learning shows that the larger the model, the more training data required The set is also larger.

So in order to comprehensively study and release the potential of language-image pre-training models, researchers crawled a large amount of image and text data from the Internet and constructed a new data set WebLI , which includes 12 billion alt-texts and 10 billion images in 109 languages.

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

In addition to using network text for annotation, the researchers also used the cloud vision API to perform OCR recognition on images, thereby obtaining 29 billion images-OCR of data pairs.

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

Using near-duplication to deduplicate images from the training, validation and test portions of 68 common visual and visual language datasets to avoid data leakage in downstream evaluation tasks.

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

In order to further improve data quality, researchers will also score and adjust based on the cross-modal similarity of "image and alt-text" Threshold, and finally only retain 10% of the images. A total of 1 billion images are used to train PaLI

Training large models

Since the visual-language task is multi-modal , so the model needs to have multiple semantic processing capabilities and have different goals. For example, some tasks require local localization of objects to accurately solve the task, while other tasks may require more global semantic information.

Similarly, some language tasks may require long answers, while others may require compact answers.

To resolve all these inconsistent goals, researchers leveraged the richness of WebLI pre-training data and introduced a Pretraining Task Mixture to prepare models for various downstream applications. .

In order to make the model more versatile to solve a variety of tasks, the author classified all tasks into a single common API (input: image text; output: text), making multiple images Knowledge sharing is possible between and language tasks, which is also shared with pre-training settings.

The targets used for pre-training are projected into the same API as a weighted mix, with the goal of both maintaining the ability to reuse model components while training the model to perform new tasks .

The model uses the open source T5X and Flaxformer frameworks and is trained with Flax in JAX. The visual part of ViT-e uses the open source BigVision framework to generate word vectors of the language part and the visual part. The patch vectors are cascaded and jointly used as the input of the multi-modal encoder-decoder. The encoder is initialized using mT5-XXL pre-training. During the training process of PaLI, the weights of the visual components are frozen and only the weights of the multimodal encoder-decoder are updated.

In the experimental part, the researchers compared PaLI on common visual language benchmarks, and the PaLI model achieved state-of-the-art results on these tasks, even exceeding the very large ones proposed in the previous literature. Model.

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

For example, the 17 billion parameter PALI performs better than the 80 billion parameter Flamingo model on some VQA and image captioning tasks.

And PALI also maintains good performance on single language or single visual tasks, although this is not the main training goal of PALI.

We also examine how the image and language model components interact in terms of model extensions, and where the model yields the greatest gains.

The final conclusion is that joint scaling (scaling) of these two components yields the best performance, specifically for visual components that require relatively few parameters Scaling is critical, but scaling is also important for improving performance on multilingual tasks.

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

After evaluating PaLI on the benchmark Crossmodal-3600 in 35 languages, it can be found that the multi-language title task benefits more from the extension of the PaLI model. many.

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

To avoid creating or reinforcing unfair bias in large language and image models, understanding of the data used and how the models use that data is required To maintain transparency, test the fairness of the model and conduct responsible data analysis, the article also provides a Data Card and Model Card

Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages

The above is the detailed content of Conveniently trained the biggest ViT in history? Google upgrades visual language model PaLI: supports 100+ languages. For more information, please follow other related articles on the PHP Chinese website!

Statement

This article is reproduced at:51CTO.COM. If there is any infringement, please contact admin@php.cn delete

Tool Calling in LLMsApr 14, 2025 am 11:28 AM

Large language models (LLMs) have surged in popularity, with the tool-calling feature dramatically expanding their capabilities beyond simple text generation. Now, LLMs can handle complex automation tasks such as dynamic UI creation and autonomous a

How ADHD Games, Health Tools & AI Chatbots Are Transforming Global HealthApr 14, 2025 am 11:27 AM

Can a video game ease anxiety, build focus, or support a child with ADHD? As healthcare challenges surge globally — especially among youth — innovators are turning to an unlikely tool: video games. Now one of the world’s largest entertainment indus

UN Input On AI: Winners, Losers, And OpportunitiesApr 14, 2025 am 11:25 AM

“History has shown that while technological progress drives economic growth, it does not on its own ensure equitable income distribution or promote inclusive human development,” writes Rebeca Grynspan, Secretary-General of UNCTAD, in the preamble.

Learning Negotiation Skills Via Generative AIApr 14, 2025 am 11:23 AM

Easy-peasy, use generative AI as your negotiation tutor and sparring partner. Let’s talk about it. This analysis of an innovative AI breakthrough is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining

TED Reveals From OpenAI, Google, Meta Heads To Court, Selfie With MyselfApr 14, 2025 am 11:22 AM

The TED2025 Conference, held in Vancouver, wrapped its 36th edition yesterday, April 11. It featured 80 speakers from more than 60 countries, including Sam Altman, Eric Schmidt, and Palmer Luckey. TED’s theme, “humanity reimagined,” was tailor made

Joseph Stiglitz Warns Of The Looming Inequality Amid AI Monopoly PowerApr 14, 2025 am 11:21 AM

Joseph Stiglitz is renowned economist and recipient of the Nobel Prize in Economics in 2001. Stiglitz posits that AI can worsen existing inequalities and consolidated power in the hands of a few dominant corporations, ultimately undermining economic

What is Graph Database?Apr 14, 2025 am 11:19 AM

Graph Databases: Revolutionizing Data Management Through Relationships As data expands and its characteristics evolve across various fields, graph databases are emerging as transformative solutions for managing interconnected data. Unlike traditional

LLM Routing: Strategies, Techniques, and Python ImplementationApr 14, 2025 am 11:14 AM

Large Language Model (LLM) Routing: Optimizing Performance Through Intelligent Task Distribution The rapidly evolving landscape of LLMs presents a diverse range of models, each with unique strengths and weaknesses. Some excel at creative content gen

See all articles

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks agoByDDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

4 weeks agoBy尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

1 months agoBy尊渡假赌尊渡假赌尊渡假赌

Hot Tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Safe Exam Browser

Safe Exam Browser is a secure browser environment for taking online exams securely. This software turns any computer into a secure workstation. It controls access to any utility and prevents students from using unauthorized resources.

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

SecLists

SecLists is the ultimate security tester's companion. It is a collection of various types of lists that are frequently used during security assessments, all in one place. SecLists helps make security testing more efficient and productive by conveniently providing all the lists a security tester might need. List types include usernames, passwords, URLs, fuzzing payloads, sensitive data patterns, web shells, and more. The tester can simply pull this repository onto a new test machine and he will have access to every type of list he needs.

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

Hot Topics

Where is the login entrance for gmail email?

7501

CakePHP Tutorial

1377

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers