


Want to convert a document image into Markdown format?
In the past, this task required multiple steps such as text recognition, layout detection and sorting, formula table processing, text cleaning, etc.
This time, it only requires one sentence command,Multi-modal large modelVary directly outputs end-to-end results:
Picture
Whether it is a large paragraph in Chinese or English Text:
Picture
Also contains the document picture of the formula
Picture
Or a screenshot of the mobile page:
Picture
You can even convert the table in the picture into latexFormat:
Picture
Of course, as a multi-mode large-scale model, maintaining universal capabilities is essential
Picture
Vary shows great potential and extremely high upper limit. OCR can no longer require lengthy pipline, directly output end-to-end, and can be customized according to user requirements. The prompt outputs different formats such as latex, word, markdown.
With strong language priors, this architecture can avoid typo-prone words in OCR, such as "leverage" and "dupole". For fuzzy documents, with the help of language priors, it is also expected to achieve stronger OCR effects
The project that attracted the attention of many netizens immediately aroused widespread discussion once it was launched. One of the netizens called out after seeing it, "It's so awesome!"
Picture
How is this effect achieved?
Inspired by large models
Currently, almost all large multi-modal models use CLIP as the Vision Encoder or visual vocabulary. Indeed, CLIP trained on 400M image-text pairs has strong visual text alignment capabilities and can cover image encoding in most daily tasks.
But for dense and fine-grained perception tasks, such as document-level OCR and Chart understanding, especially in non-English scenarios, CLIP shows obvious coding inefficiency and out-of-vocabularyquestion.
When a large pure NLP model (such as LLaMA) transitions from English to Chinese (a "foreign language" for the large model), because the original vocabulary encoding Chinese is inefficient, the text vocabulary must be expanded to achieve a better performance. Good results.
The research team was inspired by it. It is precisely because of this feature
Now the multi-modal large model based on CLIP visual vocabulary faces the same problem and encounters "foreign language image" ”, such as a page of paper densely packed with text, it is difficult to efficiently tokenize images.
Vary is a solution provided to solve this problem. It can efficiently expand the visual vocabulary without rebuilding the original vocabulary
Picture
Different from existing methods that directly use ready-made CLIP vocabulary, Vary is divided into two stages:
First, we will use a small only The decoder network generates a powerful new visual vocabulary in an autoregressive manner
Next, in the second stage, the new vocabulary and the CLIP vocabulary are fused to efficiently train the LVLM and give it new The characteristics of Trained on document charts and other data, Vary greatly enhances fine-grained visual perception capabilities.
While maintaining vanilla multi-modal capabilities, it also inspires end-to-end Chinese and English picture, formula screenshots and chart understanding capabilities.
In addition, the research team noticed that the page content that may have originally required thousands of tokens was input through document images, and the information was Vary compressed into 256 image tokens, which also provided information for further page analysis and summary. More room for imagination.
Currently, Vary’s code and model are open source, and a web demo is also provided for everyone to try.
Interested friends can try it~
The above is the detailed content of Megvii's open source multi-modal large model supports document-level OCR, covering Chinese and English. Does it mark the end of OCR?. For more information, please follow other related articles on the PHP Chinese website!

是否要复制MicrosoftWord中的页面,并保持格式不变?这是一个聪明的想法,因为当您想要创建特定文档布局或格式的多个副本时,在Word中复制页面可能是一种有用的节省时间的技术。本指南将逐步引导您在Word中复制页面的过程,无论是创建模板还是复制文档中的特定页面。这些简单的说明旨在帮助您轻松地重新制作页面,省去从头开始的麻烦。为什么要在MicrosoftWord中复制页面?在Word中复制页面非常有益的原因有以下几点:当您有一个具有特定布局或格式的文档要复制时。与从头开始重新创建整个页面不同

标题:3秒跳转页面实现方法:PHP编程指南在网页开发中,页面跳转是常见的操作,一般情况下我们使用HTML中的meta标签或者JavaScript的方法进行页面跳转。不过,在某些特定的情况下,我们需要在服务器端进行页面跳转。本文将介绍如何使用PHP编程实现一个在3秒内自动跳转到指定页面的功能,同时会给出具体的代码示例。PHP实现页面跳转的基本原理PHP是一种在

页面刷新在我们日常的网络使用中非常常见,当我们访问一个网页后,有时候会遇到一些问题,比如网页加载不出来或者显示不正常等。这时候我们通常会选择刷新页面来解决问题,那么如何快速地刷新页面呢?下面我们就来探讨一下页面刷新的快捷键。页面刷新快捷键是一种通过键盘操作来快速刷新当前网页的方法。在不同的操作系统和浏览器中,页面刷新的快捷键可能有所不同。下面我们以常见的W

《处理Laravel页面无法正确显示CSS的方法,需要具体代码示例》在使用Laravel框架开发Web应用时,有时候会遇到页面无法正确显示CSS样式的问题,这可能会导致页面呈现不正常的样式,影响用户体验。本文将介绍一些处理Laravel页面无法正确显示CSS的方法,并提供具体的代码示例,帮助开发者解决这一常见问题。一、检查文件路径首先要检查CSS文件的路径是

待机是iOS17更新中的一项新功能,它提供了一种新的增强方式,可以在手机快速闲置时访问信息。通过StandBy,您可以方便地查看时间、查看即将发生的事件、浏览日历、获取您所在位置的天气更新等等。激活后,iPhone在充电时设置为横向时会直观地进入待机模式。此功能非常适合床头柜等无线充电点,或者在日常任务中离开iPhone充电时。它允许您轻扫待机中显示的各种小部件,以访问来自各种应用程序的不同信息集。但是,您可能希望根据您的偏好和您经常需要的信息修改这些小部件,甚至删除一些小部件。因此,让我们深入

在iOS中,Apple允许您禁用iPhone上的单个主屏幕页面。还可以重新排列主屏幕页面的顺序,并直接删除页面,而不仅仅是禁用它们。这是它的工作原理。如何重新排列主屏幕页面触摸并按住主屏幕上的空格可进入抖动模式。轻点代表主屏幕页面的圆点行。在显示的主屏幕网格中,轻触并拖动页面以将其相对于其他页面重新排列。其他人会移动以响应您的拖拽动作。当您对新排列感到满意时,点击屏幕右上角的“完成”,然后再次点击“完成”以退出抖动模式。如何禁用或删除主屏幕页面触摸并按住主屏幕上的空格可进入抖动模式。轻点代表主屏

随着互联网的日益发展,许多网站或应用也逐渐变得复杂。当用户在使用时,时常会遇到错误页面,其中最常见的就是404页面。404页面指访问的页面不存在,是常见的错误页面。而对于网站或应用来说,一个漂亮的404页面能极大提升用户体验。在本文中,我们将会介绍如何利用ThinkPHP6快速实现一个漂亮的404页面。创建路由首先,我们需要在route文件夹中创建一个err

标题:Word删除一页内容的方法介绍在使用MicrosoftWord编辑文档时,有时会遇到需要删除某一页内容的情况,可能是想删除文档中的一页空白页或者某一页不需要的内容。针对这种情况,我们可以采取一些方法来快速、有效地删除一页内容。接下来,将介绍一些在MicrosoftWord中删除一页内容的方法。方法一:删除一页内容首先,打开需要编辑的Word文档。定


Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

VSCode Windows 64-bit Download
A free and powerful IDE editor launched by Microsoft

SublimeText3 Mac version
God-level code editing software (SublimeText3)

EditPlus Chinese cracked version
Small size, syntax highlighting, does not support code prompt function

MantisBT
Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

mPDF
mPDF is a PHP library that can generate PDF files from UTF-8 encoded HTML. The original author, Ian Back, wrote mPDF to output PDF files "on the fly" from his website and handle different languages. It is slower than original scripts like HTML2FPDF and produces larger files when using Unicode fonts, but supports CSS styles etc. and has a lot of enhancements. Supports almost all languages, including RTL (Arabic and Hebrew) and CJK (Chinese, Japanese and Korean). Supports nested block-level elements (such as P, DIV),
