


ICLR 2024 Oral: Noise correlation learning in long videos, single-card training only takes 1 day
Paper title: Multi-granularity Correspondence Learning from Long-term Noisy Videos Paper address: https://openreview.net/pdf?id=9Cu8MRmhq2 Project address: https://lin-yijie.github.io/projects/Norton Code address: https://github.com/XLearning-SCU/2024-ICLR-Norton
Coarse-grained NC (between Clip-Caption). Coarse-grained NC includes two categories: asynchronous (Asynchronous) and irrelevant (Irrelevant). The difference lies in whether the video clip or title can correspond to an existing title or video clip. "Asynchronous" refers to the timing misalignment between the video clip and the title, such as t1 in Figure 2. This results in a mismatch between the sequence of statements and actions, as the narrator explains before and after the actions are actually performed. "Irrelevant" refers to meaningless titles that cannot be aligned with the video clips (such as t2 and t6), or irrelevant video clips. According to relevant research by the Oxford Visual Geometry Group [5], only about 30% of the video clips and titles in the HowTo100M dataset are visually aligned, and only 15% are originally aligned; Fine-grained NC (Frame-Word). For a video clip, only part of the text description may be relevant to it. In Figure 2, the title t5 "Sprinkle sugar on it" is strongly related to the visual content v5, but the action "Observe the glaze peeling off" is not related to the visual content. Irrelevant words or video frames may hinder the extraction of key information, affecting the alignment between segments and titles.
Oriented to fine-grained NC. The researchers use log-sum-exp approximation as the soft-maximum operator to identify keywords and key frames in frame-word and word-frame alignment, realize important information extraction in a fine-grained interactive manner, and accumulate segment-title similarities. sex. For coarse-grained asynchronous NC. The researchers used the optimal transmission distance as the distance metric between video clips and titles. Given a video clip-text title similarity matrix , where
represents the number of clips and titles, the optimal transmission goal is to maximize the overall alignment similarity, which can naturally handle timing asynchronous or one-to-many (such as t3 Corresponding to v4, v5) complex alignment situation.


Oriented to coarse-grained irrelevant NC. Inspired by SuperGlue [6] in feature matching, we design an adaptive alignable hint bucket to try to filter irrelevant segments and titles. The prompt bucket is a vector of the same value in one row and one column, spliced on the similarity matrix , and its value represents the similarity threshold of whether it can be aligned. Tip Buckets integrate seamlessly into the Optimal Transport Sinkhorn solver.








The above is the detailed content of ICLR 2024 Oral: Noise correlation learning in long videos, single-card training only takes 1 day. For more information, please follow other related articles on the PHP Chinese website!

Introduction In prompt engineering, “Graph of Thought” refers to a novel approach that uses graph theory to structure and guide AI’s reasoning process. Unlike traditional methods, which often involve linear s

Introduction Congratulations! You run a successful business. Through your web pages, social media campaigns, webinars, conferences, free resources, and other sources, you collect 5000 email IDs daily. The next obvious step is

Introduction In today’s fast-paced software development environment, ensuring optimal application performance is crucial. Monitoring real-time metrics such as response times, error rates, and resource utilization can help main

“How many users do you have?” he prodded. “I think the last time we said was 500 million weekly actives, and it is growing very rapidly,” replied Altman. “You told me that it like doubled in just a few weeks,” Anderson continued. “I said that priv

Introduction Mistral has released its very first multimodal model, namely the Pixtral-12B-2409. This model is built upon Mistral’s 12 Billion parameter, Nemo 12B. What sets this model apart? It can now take both images and tex

Imagine having an AI-powered assistant that not only responds to your queries but also autonomously gathers information, executes tasks, and even handles multiple types of data—text, images, and code. Sounds futuristic? In this a

Introduction The finance industry is the cornerstone of any country’s development, as it drives economic growth by facilitating efficient transactions and credit availability. The ease with which transactions occur and credit

Introduction Data is being generated at an unprecedented rate from sources such as social media, financial transactions, and e-commerce platforms. Handling this continuous stream of information is a challenge, but it offers an


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),

WebStorm Mac version
Useful JavaScript development tools

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.