Home  >  Article  >  Technology peripherals  >  The latest from Oxford University | Nearly 400 summaries! Talk about the latest review of large language models and the three-dimensional world

The latest from Oxford University | Nearly 400 summaries! Talk about the latest review of large language models and the three-dimensional world

WBOY
WBOYOriginal
2024-06-02 19:41:32339browse

Written before&The author’s personal understanding

With the development of large language models (LLM), the integration between them and 3D spatial data (3D LLM) Rapid progress has been made, providing unprecedented capabilities for understanding and interacting with physical spaces. This article provides a comprehensive overview of LLM's approach to processing, understanding and generating 3D data. We highlight the unique advantages of LLMs, such as contextual learning, stepwise reasoning, open vocabulary capabilities, and broad world knowledge, and highlight their potential to advance spatial understanding and interaction with embedded artificial intelligence (AI) systems. Our research covers various 3D data representations from point clouds to Neural Rendering Fields (NeRF). We analyze their integration with LLM for tasks such as 3D scene understanding, subtitles, question answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. The paper also briefly reviews other related combined 3D and language approaches, further revealing the significant progress but emphasizing the need to exploit the full potential of 3D LLMs. Therefore, through this discussion paper, we aim to chart a path for future research to explore and extend the capabilities of 3D LLM in understanding and interacting with complex 3D worlds.

Open source link: https://github.com/ActiveVisionLab/Awesome-LLM-3D

牛津大学最新 | 近400篇总结!畅谈大语言模型与三维世界最新综述

##Related background

This section provides basic background knowledge about 3D representations, Large Language Models (LLM), 2D Visual Language Models (VLM), and Vision Foundation Models (VFM).

3D Representation

The choice of 3D representation to describe, model and understand our world is a crucial topic that helps to understand the current progress of 3D LLM . It is also a basic research field in computer vision. This field has experienced tremendous growth recently due to advances in deep learning, computing resources, and the availability of 3D data. We briefly introduce the most common three-dimensional representations currently in use.

Point cloud: Use a set of data points in space to represent a three-dimensional shape, and store the position of each point in a three-dimensional Cartesian coordinate system. In addition to storing the location, other information about each point can be stored (e.g. color, normal). Point cloud-based methods are known for their low storage footprint but lack surface topology information. Typical sources for obtaining point clouds include lidar sensors, structured light scanners, time-of-flight cameras, stereo views, photogrammetry, etc.

Voxel Grid: It is composed of unit cubes in three-dimensional space, similar to the pixel representation in two-dimensional space. Each voxel minimally encodes occupancy information (binary or probabilistically), but can additionally encode the distance to the surface, as in a signed distance function (SDF) or a truncated signed distance function (TSDF). However, when high-resolution detail is required, the memory footprint can become excessive.

Polygon mesh: Representation consists of vertices and surfaces, which can compactly describe complex three-dimensional shapes. However, their unstructured and non-differentiable nature poses challenges in integrating them with neural networks to achieve end-to-end differentiable pipelines. Some solutions to this problem, such as methods based on gradient approximation, can only use handcrafted gradient calculations. Other solutions, such as differentiable rasterizers, may lead to inaccurate rendering results such as blurred content.

In recent years, there has been increasing interest in neural scene 3D research communities, which differ from traditional representations that rely on geometric elements. Neural scenes are mappings from spatial coordinates to scene properties (such as occupancy, color, intensity, etc.), but unlike material grids, in neural scenes the mapping is a learned function, typically a multi-layer perceptron. In this way, Neural Scenes implicitly learns the ability to learn geometric, continuous and differentiable 3D shape and scene representations.

A set of neural networks focus on implicit surface representation. Occupancy networks encode shape in a continuous 3D occupancy function represented by a neural network, using 3D point locations and features from point clouds, low-resolution voxels, or images to estimate occupancy probabilities. Meanwhile, the deep SDF network uses a neural network to estimate the SDF from 3D coordinates and gradients. Recent methods, such as NeuS and NeuS2, have been shown to improve surface reconstruction fidelity and efficiency for both static and dynamic targets.

Another group of methods called Neural Radiation Fields (NeRF) has shown powerful photorealistic rendering capabilities for 3D worlds. These methods use position encoding techniques to encode scene details and leverage MLP to predict the radiance values ​​(color and opacity) of camera rays. However, the necessity of MLP to infer the color and occupancy details of every sample point in space (including sample points in empty space) requires significant computational resources. Therefore, there is a strong incentive to reduce the computational overhead of NeRF for real-time applications.

Hybrid representation attempts to combine NeRF technology with traditional volume-based methods to promote high-quality real-time rendering. For example, combining voxel grids or multi-resolution hash grids with neural networks significantly reduces NeRF training and inference times.

3D Gaussian scattering is a variation of point clouds in which each point contains additional information representing the radiation emitted in the region of space surrounding the point as anisotropic 3D Gaussian "spots". These 3D Gaussians are typically initialized from SfM point clouds and optimized using differentiable rendering. 3D Gaussian Scattering enables state-of-the-art new view synthesis at a fraction of NeRF computation by leveraging efficient rasterization instead of ray tracing.

LLM

Traditional natural language processing (NLP) encompasses a wide range of tasks designed to enable systems to understand, generate and manipulate text. Early approaches to NLP relied on techniques such as rule-based systems, statistical models, and early neural architectures such as recurrent neural networks. The recently introduced large language model (LLM) adopts a transformer architecture and is trained on a large text corpus, achieving unprecedented performance and triggering a new craze in the field. Since the focus of this article is three-dimensional LLM, we provide relevant background knowledge of LLM here. To explore LLM in depth, we refer to recent surveys in the region.

LLM Structure

In the context of LLM, "encoder-decoder" and "decoder-only" architectures are mainly used for NLP tasks.

  • Encoder-decoder architectures;
  • Decoder-only architectures;
  • Tokenization: Tokenization is a preprocessing method that breaks input text into a sequence of tokens. Token sequences are the basic data units in language models. The number of tokens is limited, and each token can correspond to a word, subword, or a single letter. During inference, input text is converted into a sequence of tokens and fed to the model, which predicts output tokens and then converts the output tokens back into text. Tokenization has a strong impact on the performance of language models because it affects the model's perception of text. Various tokenization techniques are used, such as word-level tokenization, sub-word tokenization (e.g., byte pair encoding, WordPiece, PencePiece), and character-level tokenization.

LLM Emergent Abilities

One major difference between LLM and traditional non-LLM methods is that they are available in large models but not in small models emergent ability. The term “emergency capabilities” refers to new complex capabilities that arise as LLMs expand in size and complexity. These capabilities enable people to deeply understand and generate natural language, solve problems in various fields without specific training, and adapt to new tasks through contextual learning. In the following, we will introduce several common emergent capabilities within the scope of LLM.

Contextual Learning refers to the ability of LLM to understand and respond to new tasks or queries based on the context provided in the prompts, without the need for explicit retraining or fine-tuning. The landmark papers (GPT-2/GPT-3) demonstrate contextual learning in a multi-shot approach, where the model is given several task examples in a prompt and then asked to process different examples without prior explicit training. State-of-the-art LLMs, such as GPT-4, exhibit extraordinary contextual learning capabilities, understanding complex instructions and performing a wide range of tasks from simple translation to code generation and creative writing, all based on the context provided in the prompts.

Reasoning in LLM, often referred to as the "thinking chain" prompt, involves models that generate intermediate steps or reasoning paths when dealing with complex problems or problems. This approach allows LLM to break down tasks into smaller, manageable parts, thereby promoting a more structured and understandable solution process. To achieve this, training involves datasets that include various problem-solving tasks, logic puzzles, and datasets designed to simulate reasoning under uncertainty. Current state-of-the-art LLMs typically exhibit advanced inference capabilities when model sizes are larger than 60B to 100B parameters.

Instruction compliance refers to the model's ability to understand and execute commands, or the ability to execute instructions specified by the user. This includes parsing the instruction, understanding its intent, and generating an appropriate response or action. Methods used to adapt this ability to new tasks may require instruction adaptation from a data set containing a variety of instructions paired with the correct response or action. Techniques such as supervised learning, reinforcement learning from human feedback, and interactive learning can further improve performance.

LLM Fine-tuning

In the context of 3D LLM, LLM is either used directly in its pre-trained state or fine-tuned to suit new multi-modal tasks . However, fine-tuning the entire parameters of LLM poses significant computational and memory challenges due to the large number of parameters involved. Therefore, parameter effective fine-tuning (PEFT) has become increasingly popular in adapting LLMs to specific tasks by updating only a relatively small subset of model parameters rather than retraining the entire model. The following section lists four common PEFT methods used in LLM.

Low-Rank Adaptation (LoRA) and variants update parameters via a low-rank matrix. Mathematically, the forward pass of LoRA during fine-tuning can be expressed as h=W0x+BAx. W0 is the frozen weight of LLM, while BA is a low-rank matrix parameterized by the newly introduced matrices a and B updated in the fine-tuning stage. This approach has several clear benefits. During the fine-tuning process, only B and A are optimized, significantly reducing the computational overhead associated with gradient calculations and parameter updates. Once fine-tuning is complete and the weights are merged, there is no additional inference cost compared to the original model, as shown in the equation: h = (W0 + BA) x. Furthermore, there is no need to save multiple copies of LLM for different tasks since multiple LoRA instances can be saved, thus reducing the storage footprint.

Layer Freeze: Freeze selected layers of the pre-trained model while updating other layers during training. This typically applies to layers closer to the model input or output, depending on the nature of the task and the model architecture. For example, in the 3D-LLM method, all layers except input and output embeddings can be frozen to mitigate the risk of overfitting on task-specific datasets, retain pre-trained general knowledge and reduce the parameters that need to be optimized.

Prompt Tuning Guides LLM to perform specific tasks by setting the LLM's framework in prompts, adjusting model inputs compared to traditional fine-tuning of adjusting model parameters. Manual cue engineering is the most intuitive method, but it can be difficult for experienced cue tuning engineers to find the best cues. Another set of approaches is automated tip generation and optimization. A popular method is to search for the exact best input prompt text, called a hard prompt, for example. Alternatively, optimization methods can be used to optimize the embedding of hints (soft hints).

Adaptive fine-tuning Customize the model architecture for specific tasks by adding or removing layers or modules. This can include integrating new data modalities such as visual information and textual data. The core idea of ​​adaptive fine-tuning is to utilize small neural network modules inserted between the layers of a pre-trained model. During adaptive fine-tuning, only the parameters of these adapter modules are updated, while the original model weights remain unchanged.

2D Vision-Language models

Visual language models are a family of models designed to capture and exploit the relationship between text and images/videos and be able to perform both Interaction tasks between modes. Most VLMs have Transformer-based architecture. By leveraging the attention module, visual and textual content condition each other to achieve mutual interaction. In the following paragraphs, we will briefly introduce the application of VLM in discriminative and generative tasks.

Discrimination taskInvolves predicting a certain feature of the data. VLMs, such as CLIP and ALIGN, have shown extraordinary performance in terms of zero-shot transferability to unseen data in image classification. Both models include two modules: visual encoder and text encoder. Given an image and its category, CLIP and ALIGN are trained by maximizing the similarity between the image embedding and the text embedding of the sentence “photo of {image category}”. Zero-shot transferability is achieved by replacing "{image category}" with possible candidates during inference and searching for sentences that best match the image. These two works inspired numerous subsequent works, further improving the accuracy of image classification. These models can also extract learned knowledge for use in other tasks, including object detection, image segmentation, document understanding, and video recognition.

Generation TaskUse VLM to generate text or images from input data. By leveraging large-scale training data, a single VLM can often perform multiple image-to-text generation tasks, such as image captioning and visual question answering (VQA). Notable examples include SimVLM, BLIP, and OFA, among others. More powerful VLMs, such as BLIP-2, Flamingo, and LLaVA, are capable of handling multi-turn dialogue and reasoning based on input images. With the introduction of diffusion models, text-to-image generation has also become the focus of the research community. By training on a large number of image-text pairs, diffusion models can generate high-quality images based on text input. This functionality also extends to generating videos, 3D scenes and dynamic 3D objects. In addition to generating tasks, existing images can also be edited via text prompts.

Vision Foundation Models

The Vision Foundation Model (VFM) is a large neural network designed to extract image representations that are sufficiently diverse and expressive to be directly deployed on various In this downstream task, it reflects the role of pre-trained LLM in downstream NLP tasks. One notable example is DINO, which uses a self-supervised teacher-student training model. The learned representations achieve good results in both image classification and semantic image matching. Attention weights in DINO can also be used as segmentation masks for the semantic components of the observed scene. Subsequent works such as iBOT and DINOv2 further improved the representation by introducing a masked image modeling (MIM) loss. SAM is a transformer-based image segmentation model trained on a dataset consisting of 1.1 billion images with semantic masks and exhibits strong zero-shot transfer capabilities. DINO (Zhang et al.)—not to be confused with DINO (Caron et al.)—adopts a DETR-like architecture and hybrid query selection for object detection. The follow-up work Grounding DINO introduces text supervision to improve accuracy. Stable Diffusion is a text-to-image generator that is also used as a feature extractor for "real" images by running a single diffusion step on a clean or artificially noisy image and extracting intermediate features or attention masks. These features have recently been exploited for segmentation and image matching tasks due to the size and diversity of the training sets used for diffusion models, and due to the observed emergent properties of diffusion features, such as zero-shot correspondence between images.

Task

##3D Captioning (3D → Text)

##Object-Level Captioning
  • Scene-Level Captioning
  • 3D Dense Captioning
3D Grounding (3D + Text → 3D Position)

Single- Object Grounding
  • Multi-Object Grounding
3D Conversation (3D + Text → Text)

3D Question Answering (3D -QA)
  • 3D Situated Question Answering (3D-SQA)
  • 3D Dialogue
3.4 3D Embodied Agents (3D + Text → Action)

3D Task Planning
  • 3D Navigation
  • 3D Manipulation
3.5 Text-to-3D Generation ( Text → 3D)

3D Object Generation
  • 3D Scene Generation
  • 3D Editing
3D TASKS WITH LLMS

3D scene understanding tasks have been widely studied. The core of scene understanding is to identify and classify all objects in a specified three-dimensional environment, a process called semantic or instance-level understanding. This stage is crucial as it forms the basis for building more subtle interpretations. Subsequently, higher-level scene understanding focuses on spatial understanding, which refers to the construction of spatial scene graphs and the semantics of target relationships. Going one step further, it is possible to predict potential interactions such as affordances, scene changes, and understand the broader context of the scene, such as functionality and aesthetic style. 3D data also presents unique challenges that do not exist in 2D, such as the relatively high cost of obtaining and labeling 3D data, sparse 3D data structures that are not uniformly dense or aligned with the grid, and the need to coordinate multiple (possibly occluded) viewpoint. To do this, the researchers harnessed the power of language to embed semantics and relationships in the 3D world. Recent efforts in integrating large language models (LLMs) with 3D data have shown that leveraging the inherent strengths of LLMs, namely zero-shot learning, contextual learning, stepwise reasoning, and extensive world knowledge, holds promise for achieving multi-level understanding and interaction.

牛津大学最新 | 近400篇总结!畅谈大语言模型与三维世界最新综述

How do LLMs process 3D scene information?

Traditional LLM is limited to text as input and output, which makes ingesting 3D The capability of information becomes the main focus of all 3D-LLM methods. The general idea is to map 3D object or scene information into language space so that LLM can understand and process these 3D inputs. Specifically, this usually involves two steps: (i) using a pre-trained 3D encoder to process the corresponding 3D representation to produce raw 3D features; (ii) employing an alignment module to convert these 3D features into 3D that the LLM can process Tags, similar to the tokenization process described. The pretrained LLM can then use these aligned 3D markers when generating output.

As mentioned earlier, considering the diversity of 3D representations, there are multiple ways to obtain 3D features. As shown in the “3D Geometry” column in Table 1, point clouds are most common due to their simplicity and compatibility with various pre-trained 3D encoders, making them a popular choice for multi-task and multi-modal learning methods . Multi-view images are also often used because research on 2D feature extraction has matured, meaning that 3D feature extraction only requires additional 2D to 3D lifting schemes. RGB-D data easily obtained using depth cameras is often used in 3D embedded agent systems to extract viewpoint-related information for navigation and understanding. A 3D scene graph is a more abstract 3D representation that is good at modeling the existence of objects and their relationships and capturing high-level information of the scene. They are frequently used for 3D scene classification and planning tasks. NeRF is currently less used in 3D-LLM methods. We believe this is due to their implicit nature, which makes them harder to tokenize and integrate with feedforward neural networks.

LLMs for Enhancing 3D Task Performance

LLMs trained on large amounts of data have been shown to obtain commonsense knowledge about the world. The potential of LLM's world knowledge and reasoning capabilities has been explored to enhance the understanding of 3D scenes and reformulate the pipeline for several 3D tasks. In this section, we focus on methods that aim to use LLM to improve the performance of existing methods in 3D visual language tasks. When applying LLM to 3D tasks, we can divide its use into two groups: knowledge augmentation and inference augmentation methods. Knowledge augmentation methods exploit the vast world knowledge embedded in LLM to improve 3D task performance. This can provide contextual insights, fill knowledge gaps, or enhance semantic understanding of the 3D environment. Alternatively, methods to enhance inference do not rely on their world knowledge, but leverage the ability of LLM to perform inference step by step, thus providing better generalization capabilities to more complex 3D challenges. The following two sections describe each of these methods.

  • Knowledge-enhanced approaches: There are several ways to leverage LLM world knowledge. Chen et al. used LLM for 3D room classification from RGB-D images. Here, the knowledge embedded in LLM is used to determine the room category based on the object category information contained in the room. First, this approach creates a scene graph from the Matterport3D data, which contains nodes for areas and objects, as well as object nodes linked to room nodes. Next, select key objects to form a query for each room type. Description of the LLM scores extracted from the selected objects, with the highest score predicting the room label. Spatial information such as size or location can also be provided.
  • Reasoning-enhanced approaches: In addition to world knowledge, LLM’s reasoning capabilities also help to handle other 3D tasks, especially the basics of vision in complex 3D scenes with detailed geometry and multiple objects. In this case, textual descriptions of objects should include their appearance and spatial relationship to surrounding items. Ordinary grounding methods are often difficult in this situation due to the inability to understand detailed textual descriptions. LLM-Grounder, Transcribe3D, and Zero-shot 3DVG solve this problem by leveraging LLM's inference capabilities to analyze textual descriptions and generate a series of instructions to locate objects using the existing grounding toolbox.

LLMs for 3D Multi-Task Learning

Many works focus on using LLM’s instruction following and contextual learning capabilities to unify multiple 3D tasks into one in language space. By using different text prompts to represent different tasks, these studies aim to make LLM a unified conversational interface. Implementing multi-task learning using LLM usually involves several key steps, starting with constructing 3D text data pairs. These pairings require crafting task instructions in text form and defining the output for each different task. Next, the 3D data (usually in the form of point clouds) is fed to a 3D encoder to extract 3D features. The alignment module is then used to (i) align 3D features with text embeddings from LLM at multiple levels (object level, relationship level and scene level) and (ii) translate 3D features into LLM interpretable tokens. Finally, an appropriate training strategy needs to be selected, such as single-stage or multi-stage 3D language alignment training and multi-task instruction fine-tuning.

牛津大学最新 | 近400篇总结!畅谈大语言模型与三维世界最新综述

In the remainder of this section, we will explore these aspects in detail. We also summarize the scope and capabilities of each method reviewed in this section in Table 2.

  • Data for Multi-Task Learning: As shown in Table 2, we classify tasks into four categories: subtitle, basic, question and answer (QA), and specific agent tasks (i.e., planning, navigation, and operation). Therefore, the text output of each task follows a predefined format. For subtitles and QA tasks, the output is plain text and is not restricted to a specific format. The output of the basic task is a 3D bounding box, usually the center coordinates of the reference object and its 3D size. Typically, the values ​​of points and sizes are normalized to fall within the range of 0-255, which limits the range of tokens that LLM needs to predict. For planning, the model outputs a sequence of steps to perform a task in text form, whereas for navigation, the output is a sequence of spatial coordinates. For actions, the output is a textual sequence of actions. Existing methods follow these guidelines to build their multi-task instruction fine-tuning datasets.
  • Training an LLM for multiple 3D tasks: The first step in training an LLM for multiple 3D tasks involves obtaining meaningful 3D features, where the extraction method varies according to the type of 3D scene. For single object point clouds, point LLM, Chat-3D and GPT4Point use point BERT to extract 3D object features. For indoor scenes, LEO uses PointNet++ for feature extraction, while Chat-3D v2 and 3DMIT segment the scene and use Uni-3D to extract features for each segmented part. At the same time, MultiPLY integrates the extracted object features into the scene graph to represent the entire scene. 3D-LLM and scene LLM lift features from 2D multi-view images into 3D representations. 3D-LLM extracts 2D semantic features from Mask2Former or SAM. Scene LLM follows ConceptFusion to fuse global information and local details, mapping pixel-by-pixel CLIP features into point-by-point 3D features. For outdoor 3D scenes, LiDAR LLM uses VoxelNet to extract 3D voxel features.

LLMs as 3D Multi-Modal Interfaces

In addition to exploring 3D multi-task learners, some recent studies have also combined information from different modalities to Further improve model capabilities and enable new interactions. In addition to text and 3D scenes, multimodal 3D LLM can also include 2D images, audio, or touch information in the scene as input.

Most works aim to build a common representation space across different modalities. Since some existing works already provide pretrained encoders that map text, images, or audio to a common space, some works choose to learn 3D encodings that align the 3D embeddings with the embedding spaces of pretrained encoders for other modalities. device. JM3D-LLM learns a 3D point cloud encoder that aligns the embedding space of point clouds with the embedding space of text images of SLIP. It renders image sequences of point clouds and builds hierarchical text trees during training to achieve detailed alignment. Point Bind also learns a similar 3D encoder and aligns it with ImageBind to unify the embedding space for images, text, audio, and point clouds. This enables the use of different task heads to handle different tasks such as retrieval, classification and generation between various modes. However, a notable limitation is that this approach is only suitable for small-scale object-level scenes, as it is computationally expensive for 3D encoders to process large scenes with millions of points. Furthermore, most pre-trained multi-modal encoders like CLIP are designed for single-object scenes and are not suitable for large-scale scenes with multiple objects and local details.

In contrast, large scenes require more detailed design to incorporate multiple modes. ConceptFusion builds an enhanced feature map that fuses global information and local details of each component image of a large scene. This is achieved by using pre-trained feature extractors that are already aligned with different modalities including text and audio. It then uses traditional SLAM methods to map the feature map to the scene’s point cloud. MultiPLY uses a representation similar to ConceptGraph. It identifies all salient objects in the scene, obtains the global embedding of each object, and finally builds the scene graph. The resulting representation is a scene embedding aligned with Llama’s embedding space. Embeddings of other modalities including audio, temperature and haptics can also be mapped into the same space using linear projections. All embeds are tokenized and sent to LLM immediately. Compared to methods that can handle large-scale scenes, methods that can handle large-scale scenes reduce costs by relying on pre-trained encoders to bridge modal gaps instead of learning new encoders from scratch.

LLMs for Embodied Agents

You can use LLM’s planning, tool usage, and decision-making capabilities to create 3D concrete agents. These capabilities enable LLM to generate intelligent decisions, including navigating in 3D environments, interacting with objects, and selecting appropriate tools to perform specific tasks. This section describes how 3D concrete agents perform planning, navigation, and manipulation tasks.

  • 3D Task Planning: For specific agents, "task planning" refers to the ability to generate steps to perform a specific task, given a task description and a 3D environment. Mission planning is often a prerequisite for navigation and maneuvering missions because the accuracy of planning directly affects the performance of subsequent missions. LEO and LLM Planner utilize LLM to generate step-by-step plans and dynamically adjust based on environmental awareness. LEO emphasizes scene-aware planning based on the current scene configuration, while LLM Planner uses GPT3 to divide planning into high-level sub-goals and low-level actions, and re-plan when the agent gets into trouble during task execution. 3D-VLA combines 3D perception, reasoning and action through generated world models. It focuses on enhancing planning capabilities by leveraging its generative models to predict future state representations such as target images and point clouds.
  • 3D Navigation: 3D navigation refers to the ability of an embedded agent to move and position itself in a 3D environment, usually based on visual input and verbal instructions. Each of the methods described - LEO, Agent3D Zero, LLM Planner and NaviLLM - implements 3D navigation in a different way. LEO processes vehicle-centered 2D images and target-centered 3D point clouds as well as text instructions.
  • 3D Object Manipulation: In the context of 3D concrete agents, manipulation refers to their ability to physically interact with objects, from moving objects to complex sequences such as assembling parts or opening doors. The core idea used to enable LLM to perform operational tasks is to tokenize action sequences. In order for LLM to output a specific action, you first need to define an action token, which allows LLM to generate said action based on the task and 3D scene context. Platforms like CLIPort or the motion planning module in the robotic arm then translate these tokenized actions into physical actions performed by the agent.

LLMs for 3D Generation

Traditionally, 3D modeling is a complex, time-intensive process with a high barrier to entry, requiring knowledge of geometry, textures Detailed attention to lighting and lighting is required to achieve realistic results. In this section, we take a closer look at the integration of LLM with 3D generation technologies, showing how the language provides a way to generate contextualized objects in a scene and provide innovative solutions for 3D content creation and manipulation.

  • Object-level Generation: Shape GPT uses shape-specific 3D VQ-VAE to quantize 3D shapes into discrete "shape word" markers. This enables the integration of shape data into the multimodal input to the T5 language model, along with text and images. This multimodal representation enables T5 to learn cross-modal interactions such as text-to-shape generation and shape editing/completion. GPT4Point uses a two-stream approach - aligning point cloud geometry to text via a point QFormer, which is then fed into coupled LLM and diffusion paths for text understanding and high-fidelity 3D object generation consistent with text input.
  • Scene-scale Generation: Holodeck and GALA-3D employ a multi-stage pipeline to gradually refine an initial rough 3D scene layout from text to a detailed and realistic 3D environment. Holodeck uses specialized modules to create basic layouts, select materials, and incorporate elements such as doors and windows based on GPT-4’s spatial reasoning and layout/style recommendations. It then populates the layout with ob-averse assets that match GPT-4's textual description. The optimizer arranges these targets according to spatial relationship constraints obtained from GPT-4 to encourage realistic target layout and interaction.
  • Procedural Generation and Manipulation: LLMR, 3D-GPT and SceneCraft adopt a modular architecture with specialized components/agents for interactive 3D world creation and code generation from natural language. LLMR consists of different components that are used to generate code to build scenes in Unity, understand existing scene targets and properties for modification, identify the functionality required to execute instructions, and evaluate the final code quality. Similarly, 3D-GPT has components for interpreting instructions and determining the required generation function, enriching the description with detailed modeling attributes, and converting the rich description into Python code for the Blender API. Collectively, these approaches demonstrate task decomposition and specialization of LLM components to handle instruction interpretation, function mapping, and robust code generation.

3D TASKS WITH VLMS

Open-Vocabulary 3D Scene Understanding

Open-Vocabulary 3D Scene Understanding is designed to Identify and describe scene elements using natural language descriptions instead of predefined category labels. OpenScene adopts a zero-shot approach to predict dense features of 3D scene points co-embedded in a shared feature space with CLIP's text and image pixel embeddings, enabling task recognition training and open vocabulary querying to identify objects, materials, affordances, activities, and rooms type. CLIP-FO3D follows a similar approach, modifying CLIP to extract dense pixel features from 3D scenes projected into point clouds, and then training the 3D model via distillation to transfer the knowledge of CLIP. Semantic abstraction extracts association graphs from CLIP as abstract target representations to generalize to new semantics, vocabulary, and domains. Open Fusion combines the SEEM visual language model with TSDF 3D mapping, leveraging region-based embeddings and confidence maps for real-time open vocabulary scene creation and querying.

Text-Driven 3D Generation

Here we investigate text-to-3D generation methods utilizing 2D VLM and guidance using a differentiable rendering text-to-image diffusion model. Early works such as DreamFields, CLIP-Mesh, CLIP-Forge, and Text2Mesh explored CLIP-guided zero-shot 3D generation.

DreamFusion introduces Score Distriction Sampling (SDS), in which the parameters of a 3D representation are optimized by making renderings from any angle look highly realistic, as evaluated by a pre-trained 2D diffusion model. It uses a text-to-image Imagen model to optimize NeRF representation via SDS. Magic3D proposes a two-stage framework: generating a coarse model with a low-resolution diffusion prior and a sparse 3D hash mesh, and then optimizing the textured 3D mesh model using an efficient differentiable renderer and a high-resolution latent diffusion model. Fantasia3D uses a hybrid DMET representation and spatially varying BRDF to unravel geometry and appearance. ProlificDreamer introduces variational fractional distillation (VSD), a particle-based framework that treats 3D parameters as random variables to increase fidelity and diversity. Dream3D leverages explicit 3D shape priors and text-to-image diffusion models to enhance text-guided 3D synthesis. MVDream adopts a multi-view consistent diffusion model that can be trained on a small amount of shot data for personalized generation. Text2NeRF combines NeRF representations with pre-trained text-to-image diffusion models to generate different indoor/outdoor 3D scenes based on language. In addition to generating geometry and appearance simultaneously, some research has also explored the possibility of synthesizing textures based only on given geometry.

End-to-End Architectures for 3D Vision & Language

A Transformer model pre-trained on a large 3D text dataset learns a powerful joint representation that combines vision and Linguistic modalities are connected. 3D VisTA is a Transformer model that uses self-attention to jointly model 3D visual and text data to achieve effective pre-training for goals such as masked language/target modeling and scene text matching. UniT3D uses a unified Transformer method, combined with PointGroup 3D detection backbone, BERT text encoder and multi-modal fusion module, to jointly pre-train the synthesized 3D language data. SpatialVLM adopts a different strategy to jointly train VLM on a large synthetic 3D spatial reasoning data set, improving the performance of 3D spatial visual question answering tasks and supporting applications such as robot thought chain reasoning. Multi CLIP pre-trains a 3D scene encoder to align scene features with CLIP’s text and image embeddings, aiming to transfer CLIP’s knowledge to improve 3D understanding for tasks such as visual question answering.

Dataset

牛津大学最新 | 近400篇总结!畅谈大语言模型与三维世界最新综述

牛津大学最新 | 近400篇总结!畅谈大语言模型与三维世界最新综述

##Challenges and Future Opportunities

Despite progress in integrating LLM with 3D data, challenges in data representation, computational efficiency, and benchmarking remain, requiring innovative solutions.

Representation selection has a great impact on the performance of the 3D visual language model. Currently, point clouds are mainly used to represent indoor (e.g., vertices of a mesh) and outdoor (e.g., lidar point clouds) environments due to their simplicity and neural network compatibility. However, they struggle to capture the details that are critical for accurate, rich spatial models. Developing new 3D scene representations that more effectively bridge the gap between spatial information and language can unlock new levels of understanding and interaction. By finding innovative ways to encode linguistic and semantic information in 3D representations, such as using extracted linguistic and semantic embeddings, we can help bridge the gap between these two modalities.

Both 3D data processing and the computational requirements of LLM pose significant challenges. As the complexity of 3D environments and the size of language models increase, scalability remains a concern. Advances in LLM architectures designed for adaptability and computational efficiency can significantly broaden the baseline for their improvement and are critical to comprehensively evaluate and improve the capabilities of multi-modal LLMs in 3D tasks. The limited scope of current benchmarks, especially in three-dimensional reasoning, hinders the assessment of spatial reasoning skills and the development of three-dimensional decision-making/interaction systems. Furthermore, currently used metrics do not fully capture the functionality of LLM in 3D environments. It is crucial to develop task-specific metrics to more accurately measure the performance of different 3D tasks. Finally, the granularity of current scene understanding benchmarks is too simple, limiting in-depth understanding of complex 3D environments. A more diverse set of tasks is required.

Improved benchmarks are critical to fully evaluate and improve the capabilities of multi-modal LLM in 3D tasks. The limited scope of current benchmarks, especially in three-dimensional reasoning, hinders the assessment of spatial reasoning skills and the development of three-dimensional decision-making/interaction systems. Furthermore, currently used metrics do not fully capture the functionality of LLM in 3D environments. It is crucial to develop task-specific metrics to more accurately measure the performance of different 3D tasks. Finally, the granularity of current scene understanding benchmarks is too simple, limiting in-depth understanding of complex 3D environments. A more diverse set of tasks is required.

Safety and ethical implications must be considered when using LLM for 3D understanding. LLM can hallucinate and output inaccurate, unsafe information, leading to incorrect decisions in critical 3D applications. Furthermore, LLMs often fail in unpredictable and difficult-to-explain ways. They may also inherit social biases present in the training data, penalizing certain groups when making predictions in real-world 3D scenes. It is crucial that LLM is used prudently in 3D environments, employing strategies to create more inclusive datasets, robust evaluation frameworks for bias detection and correction, and mechanisms to minimize hallucinations to ensure accountability and fairness the result of.

Conclusion

This article conducts an in-depth exploration of the integration of LLM and 3D data. This survey systematically reviews the methods, applications and emergent capabilities of LLM in processing, understanding and generating 3D data, highlighting the transformative potential of LLM across a range of 3D tasks. From enhancing spatial understanding and interaction in three-dimensional environments to advancing the capabilities of embedded artificial intelligence systems, LLM plays a key role in advancing the field.

Key findings include identifying LLM’s unique strengths, such as zero-shot learning, advanced reasoning, and broad world knowledge, that help bridge the gap between textual information and spatial interpretation. The paper demonstrates LLM integration with 3D data for a wide range of tasks. Exploring other 3D visual language methods with LLM reveals rich research prospects aimed at deepening our understanding of the 3D world.

Additionally, the survey highlights significant challenges such as data representation, model scalability, and computational efficiency, demonstrating that overcoming these obstacles is critical to fully realizing the potential of LLM in 3D applications. In conclusion, this survey not only provides a comprehensive overview of the current state of 3D tasks using LLM, but also lays the foundation for future research directions. It calls for collaboration to explore and expand LLM's capabilities in understanding and interacting with complex 3D worlds, paving the way for further advances in the field of spatial intelligence.

The above is the detailed content of The latest from Oxford University | Nearly 400 summaries! Talk about the latest review of large language models and the three-dimensional world. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn