ICML 2024｜Complex combination 3D scene generation, LLMs conversational 3D controllable generation and editing framework is here-AI-php.cn

ICML 2024｜Complex combination 3D scene generation, LLMs conversational 3D controllable generation and editing framework is here

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jul 31, 2024 pm 08:12 PM

projectGALA3D

ICML 2024｜复杂组合3D场景生成，LLMs对话式3D可控生成编辑框架来了

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com

The first author and corresponding author of this paper are both from the VDIG (Visual Data Interpreting and Generation) Laboratory of Wangxuan Computer Institute of Peking University, the first The author is doctoral student Zhou Xiaoyu, and the corresponding author is doctoral supervisor Wang Yongtao. In recent years, the VDIG laboratory has published a number of representative results at top conferences such as IJCV, CVPR, AAAI, ICCV, ICML, and ECCV. It has won the championship and runner-up awards in heavyweight competitions in the CV field at home and abroad for many times, and has won awards from well-known universities at home and abroad, Scientific research institutions cooperate extensively.

In recent years, Text-to-3D methods for single objects have made a series of breakthroughs, but generating controllable, high-quality complex multi-object 3D scenes from text still faces huge challenges. Previous methods have major flaws in the complexity, geometric quality, texture consistency, multi-object interaction, controllability and editability of the generated scene.

Recently, the VDIG research team from the Wangxuan Institute of Computer Science at Peking University and its collaborators announced the latest research results GALA3D. For the generation of multi-object complex 3D scenes, this work proposes an LLM-guided controllable generation framework for complex 3D scenes, GALA3D, which can generate high-quality, high-consistency 3D scenes with multiple objects and complex interactive relationships, and supports conversational interaction. Controlling editor, the paper has been accepted by ICML 2024.

ICML 2024｜复杂组合3D场景生成，LLMs对话式3D可控生成编辑框架来了

Paper title: GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting
Paper link: https://arxiv.org/pdf/2402.07207
Paper code: https://github.com/VDIGPKU/GALA3D
Project website: https://gala3d.github.io/

ICML 2024｜复杂组合3D场景生成，LLMs对话式3D可控生成编辑框架来了

GALA3D is a high-quality Text-to-3D complex Combined scene generation and controllable editing framework. The user inputs a description text, and GALA3D can zero-shot generate the corresponding three-dimensional scene with multiple objects and complex interactive relationships. GALA3D demonstrates its excellent performance in generating scene quality, complex interaction of multiple objects, and scene geometric consistency while ensuring that the generated 3D scene is highly aligned with the text. In addition, GALA3D supports user-friendly end-to-end generation and controlled editing, allowing ordinary users to easily customize and edit 3D scenes in conversational conversations. In communicating with users, GALA3D can accurately realize conversational and controllable editing of complex 3D scenes, and realize diversified controllable editing needs such as layout transformation of complex 3D scenes, embedding of digital assets, and changes in decoration style based on user dialogue. .

Method introduction

The overall architecture of GALA3D is shown in the figure below:

ICML 2024｜复杂组合3D场景生成，LLMs对话式3D可控生成编辑框架来了

GALA3D uses large language models (LLMs) to generate initial layouts, and proposes a layout-guided generative 3D Gaussian representation to construct complex 3D scenes. GALA3D Design optimizes the shape and distribution of 3D Gaussians through adaptive geometry control to generate 3D scenes with consistent geometry, texture, scale and precise interactions. In addition, GALA3D also proposes a combined optimization mechanism that combines conditional diffusion priors and Vincentian graph models to collaboratively generate 3D multi-object scenes with consistent styles, while iteratively optimizing the initial layout priors extracted from LLMs to obtain more realistic and accurate The real scene space layout. Extensive quantitative experiments and qualitative studies demonstrate that GALA3D achieves significant results in text-to-complex 3D scene generation, surpassing existing Vincent 3D scene methods.

a, scene layout prior based on LLMs

大規模言語モデルは、優れた自然言語理解および推論機能を実証します。この記事では、3D 複雑なシーンにおける LLM の大規模言語モデルの推論およびレイアウト生成機能についてさらに詳しく説明します。手動設計を行わずに比較的合理的なレイアウトを事前に取得する方法は、シーンのモデリングと生成のコストを削減するのに役立ちます。このため、LLM (GPT-3.5 など) を使用してテキスト入力のインスタンスとその空間関係を抽出し、対応するレイアウト事前分布を生成します。ただし、3D 空間レイアウトと、LLM によって解釈されるシーン以前のレイアウトと実際のシーンの間には一定のギャップがあり、その結果、通常、浮遊オブジェクトや通過オブジェクト、プロポーションが過度に異なるオブジェクトの組み合わせなどが生成されます。さらに、ビジョンベースの事前拡散とレイアウトガイドによる生成 3D ガウスを通じて、上記で生成された大まかなレイアウトを調整および最適化するレイアウト調整モジュールを提案します。

b、レイアウトの改良

GALA3D は、上記の LLM によって事前に生成されたレイアウトを最適化する前に、拡散に基づくレイアウトレイアウト最適化モジュールを使用します。具体的には、レイアウトガイド付き 3D ガウス空間レイアウトの勾配最適化を 3D 生成プロセスに追加し、ControlNet を通じて LLM で生成されたレイアウトの空間位置、回転角度、サイズ比を調整しました。図は、その前と前の 3D シーンとレイアウトを示しています。最適化後。最適化されたレイアウトは、より正確な空間位置とスケールを持ち、3D シーン内の複数のオブジェクト間の相互作用がより合理的になります。

ICML 2024｜复杂组合3D场景生成，LLMs对话式3D可控生成编辑框架来了

c、レイアウトガイド付き生成 3D ガウス表現

3D レイアウト制約を 3D ガウス表現に初めて導入し、複雑な Vincent 3D シーン用のレイアウトガイド付き生成 3D ガウスを提案します。レイアウトガイド付き 3D ガウス表現には、意味論的に抽出された複数のインスタンスオブジェクトが含まれており、各インスタンスオブジェクトの事前レイアウトは次のようにパラメーター化できます。

ICML 2024｜复杂组合3D场景生成，LLMs对话式3D可控生成编辑框架来了

ここで、N はシーン内のインスタンスオブジェクトの総数を表します。具体的には、各インスタンスの 3D ガウスは、適応ジオメトリ制御を通じて最適化され、インスタンスレベルのオブジェクトの 3D ガウス表現が取得されます。さらに、相対的な位置関係に従って複数のオブジェクトガウスをシーン全体に結合し、レイアウトに基づいてグローバル 3D ガウスを生成し、グローバルガウススプラッティングを通じてシーン全体をレンダリングします。

d、適応幾何制御

生成プロセス中に 3D ガウスの空間分布と幾何学的形状をより適切に制御するために、生成 3D ガウスの適応幾何制御方法を提案します。まず、初期ガウスのセットが与えられると、3D ガウスをレイアウト範囲内に制限するために、GALA3D は一連の密度分布関数を使用してガウス楕円体の空間位置を制限します。次に、レイアウトサーフェス付近のガウスをサンプリングして、分布関数に適合させます。その後、形状正則化を使用して 3D ガウスの幾何学形状を制御することを提案します。 3D 生成プロセス中、適応ジオメトリ制御はガウス分布とジオメトリを継続的に最適化し、より詳細なテクスチャと規則的なジオメトリを備えた 3D マルチオブジェクトとシーンを生成します。また、適応型ジオメトリ制御により、レイアウトに基づいて生成される 3D ガウスの制御性と一貫性が向上します。

実験結果

既存の Text-to-3D 生成方法と比較して、GALA3D は 3D シーン生成の品質と一貫性が優れていることが次の表に示されています。有効ユーザー調査を実施し、125 名の参加者 (うち 39.2% は関連分野の専門家および実践者) を対象に、この記事の手法と既存の手法の生成シナリオを多角的に評価しました。その結果を以下に示します。表:

ICML 2024｜复杂组合3D场景生成，LLMs对话式3D可控生成编辑框架来了

実験結果は、GALA3D がシーンの品質、幾何学的忠実度、テキストの一貫性、シーンの一貫性などの多次元評価指標において既存の手法を上回り、最適な生成品質を達成することを示しています。

下の図の定性的な実験結果に示されているように、GALA3D は複雑なマルチオブジェクトの組み合わせの 3D シーンをゼロショットで一貫性よく生成できます。 ICML 2024｜复杂组合3D场景生成，LLMs对话式3D可控生成编辑框架来了

下の図は、GALA3D がユーザーフレンドリーで会話型をサポートできることを示しています制御可能な生成と編集:

ICML 2024｜复杂组合3D场景生成，LLMs对话式3D可控生成编辑框架来了

研究の詳細については、元の論文を参照してください。

The above is the detailed content of ICML 2024｜Complex combination 3D scene generation, LLMs conversational 3D controllable generation and editing framework is here. For more information, please follow other related articles on the PHP Chinese website!

Statement

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

What is Few-Shot Prompting? - Analytics VidhyaApr 22, 2025 am 09:13 AM

Few-Shot Prompting: A Powerful Technique in Machine Learning In the realm of machine learning, achieving accurate responses with minimal data is paramount. Few-shot prompting offers a highly effective solution, enabling AI models to perform specific

What is Temperature in prompt engineering? - Analytics VidhyaApr 22, 2025 am 09:11 AM

Prompt Engineering: Mastering the "Temperature" Parameter for AI Text Generation Prompt engineering is crucial when working with large language models (LLMs) like GPT-4. A key parameter in prompt engineering is "temperature," whi

Are You At Risk Of AI Agency Decay? Take The Test To Find OutApr 21, 2025 am 11:31 AM

This article explores the growing concern of "AI agency decay"—the gradual decline in our ability to think and decide independently. This is especially crucial for business leaders navigating the increasingly automated world while retainin

How to Build an AI Agent from Scratch? - Analytics VidhyaApr 21, 2025 am 11:30 AM

Ever wondered how AI agents like Siri and Alexa work? These intelligent systems are becoming more important in our daily lives. This article introduces the ReAct pattern, a method that enhances AI agents by combining reasoning an

Revisiting The Humanities In The Age Of AIApr 21, 2025 am 11:28 AM

"I think AI tools are changing the learning opportunities for college students. We believe in developing students in core courses, but more and more people also want to get a perspective of computational and statistical thinking," said University of Chicago President Paul Alivisatos in an interview with Deloitte Nitin Mittal at the Davos Forum in January. He believes that people will have to become creators and co-creators of AI, which means that learning and other aspects need to adapt to some major changes. Digital intelligence and critical thinking Professor Alexa Joubin of George Washington University described artificial intelligence as a “heuristic tool” in the humanities and explores how it changes

Understanding LangChain Agent FrameworkApr 21, 2025 am 11:25 AM

LangChain is a powerful toolkit for building sophisticated AI applications. Its agent architecture is particularly noteworthy, allowing developers to create intelligent systems capable of independent reasoning, decision-making, and action. This expl

What are the Radial Basis Functions Neural Networks?Apr 21, 2025 am 11:13 AM

Radial Basis Function Neural Networks (RBFNNs): A Comprehensive Guide Radial Basis Function Neural Networks (RBFNNs) are a powerful type of neural network architecture that leverages radial basis functions for activation. Their unique structure make

The Meshing Of Minds And Machines Has ArrivedApr 21, 2025 am 11:11 AM

Brain-computer interfaces (BCIs) directly link the brain to external devices, translating brain impulses into actions without physical movement. This technology utilizes implanted sensors to capture brain signals, converting them into digital comman

See all articles