search
HomeTechnology peripheralsAI7 Ways to Split Data Using LangChain Text Splitters - Analytics Vidhya

LangChain Text Splitters: Optimizing LLM Input for Efficiency and Accuracy

Our previous article covered LangChain's document loaders. However, LLMs have context window size limitations (measured in tokens). Exceeding this limit truncates data, compromising accuracy and increasing costs. The solution? Send only relevant data to the LLM, requiring data splitting. Enter LangChain's Text Splitters.

7 Ways to Split Data Using LangChain Text Splitters - Analytics Vidhya

Key Concepts:

  1. The Crucial Role of Text Splitters: Understand why efficient text splitting is vital for optimizing LLM applications, balancing context window size and cost.
  2. Diverse Text Splitting Techniques: Explore various methods, including character counts, token counts, recursive splitting, and techniques tailored to HTML, code, and JSON structures.
  3. LangChain Text Splitter Implementation: Learn practical application, including installation, code examples for text splitting, and handling diverse data formats.
  4. Semantic Splitting for Enhanced Relevance: Discover how sentence embeddings and cosine similarity create semantically coherent chunks, maximizing relevance.

Table of Contents:

  • What are Text Splitters?
  • Data Splitting Methods
  • Character Count-Based Splitting
  • Recursive Splitting
  • Token Count-Based Splitting
  • Handling HTML
  • Code-Specific Splitting
  • JSON Data Handling
  • Semantic Chunking
  • Frequently Asked Questions

What are Text Splitters?

Text splitters divide large text into smaller, manageable chunks for improved LLM query relevance. They work directly on raw text or LangChain document objects. Multiple methods cater to different content types and use cases.

Data Splitting Methods

LangChain Text Splitters are crucial for efficient large document processing. They improve performance, contextual understanding, enable parallel processing, and facilitate better data management. Let's examine several methods:

Prerequisites: Install the package using pip install langchain_text_splitters

Character Count-Based Splitting

This method splits text based on character count, using a specified separator.

from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_text_splitters import CharacterTextSplitter

# Load data (replace with your PDF path)
loader = UnstructuredPDFLoader('how-to-formulate-successful-business-strategy.pdf', mode='single')
data = loader.load()

text_splitter = CharacterTextSplitter(separator="\n", chunk_size=500, chunk_overlap=0, is_separator_regex=False)
texts = text_splitter.split_documents(data)
len(texts) # Output: Number of chunks

This example splits text into 500-character chunks, using newline characters as separators.

Recursive Splitting

This uses multiple separators sequentially until chunks are below chunk_size. Useful for sentence-level splitting.

from langchain_text_splitters import RecursiveCharacterTextSplitter

recursive_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", r"(?>> 293

# ... (rest of the code remains similar)

Token Count-Based Splitting

LLMs use tokens; splitting by token count is more accurate. This example uses the o200k_base encoding (check the GitHub link for model/encoding mappings).

from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(encoding_name='o200k_base', chunk_size=50, chunk_overlap=0)
texts = text_splitter.split_documents(data)
len(texts) # Output: Number of chunks

Recursive splitting can also be combined with token counting.

For plain text, recursive splitting with character or token counting is generally preferred.

Handling HTML

For structured data like HTML, splitting should respect the structure. This example splits based on HTML headers.

from langchain_text_splitters import HTMLHeaderTextSplitter

headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2"), ("h3", "Header 3")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on, return_each_element=True)
html_header_splits = html_splitter.split_text_from_url('https://diataxis.fr/')
len(html_header_splits) # Output: Number of chunks

HTMLSectionSplitter allows splitting based on other sections.

Code-Specific Splitting

Programming languages have unique structures. This example uses syntax-aware splitting for Python code.

from langchain_text_splitters import RecursiveCharacterTextSplitter, Language

# ... (Python code example) ...

python_splitter = RecursiveCharacterTextSplitter.from_language(language=Language.PYTHON, chunk_size=100, chunk_overlap=0)
python_docs = python_splitter.create_documents([PYTHON_CODE])

JSON Data Handling

Nested JSON objects can be split while preserving key relationships.

from langchain_text_splitters import RecursiveJsonSplitter

# ... (JSON data example) ...

splitter = RecursiveJsonSplitter(max_chunk_size=200, min_chunk_size=20)
chunks = splitter.split_text(json_data, convert_lists=True)

Semantic Chunking

This method uses sentence embeddings and cosine similarity to group semantically related sentences.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings # Requires OpenAI API key

# ... (code using OpenAIEmbeddings and SemanticChunker) ...

Conclusion

LangChain offers various text splitting methods, each suited for different data types. Choosing the right method optimizes LLM input, improving accuracy and reducing costs.

Frequently Asked Questions

(Q&A section remains largely the same, with minor wording adjustments for clarity and flow.)

The above is the detailed content of 7 Ways to Split Data Using LangChain Text Splitters - Analytics Vidhya. For more information, please follow other related articles on the PHP Chinese website!

Statement
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
How to Run LLM Locally Using LM Studio? - Analytics VidhyaHow to Run LLM Locally Using LM Studio? - Analytics VidhyaApr 19, 2025 am 11:38 AM

Running large language models at home with ease: LM Studio User Guide In recent years, advances in software and hardware have made it possible to run large language models (LLMs) on personal computers. LM Studio is an excellent tool to make this process easy and convenient. This article will dive into how to run LLM locally using LM Studio, covering key steps, potential challenges, and the benefits of having LLM locally. Whether you are a tech enthusiast or are curious about the latest AI technologies, this guide will provide valuable insights and practical tips. Let's get started! Overview Understand the basic requirements for running LLM locally. Set up LM Studi on your computer

Guy Peri Helps Flavor McCormick's Future Through Data TransformationGuy Peri Helps Flavor McCormick's Future Through Data TransformationApr 19, 2025 am 11:35 AM

Guy Peri is McCormick’s Chief Information and Digital Officer. Though only seven months into his role, Peri is rapidly advancing a comprehensive transformation of the company’s digital capabilities. His career-long focus on data and analytics informs

What is the Chain of Emotion in Prompt Engineering? - Analytics VidhyaWhat is the Chain of Emotion in Prompt Engineering? - Analytics VidhyaApr 19, 2025 am 11:33 AM

Introduction Artificial intelligence (AI) is evolving to understand not just words, but also emotions, responding with a human touch. This sophisticated interaction is crucial in the rapidly advancing field of AI and natural language processing. Th

12 Best AI Tools for Data Science Workflow - Analytics Vidhya12 Best AI Tools for Data Science Workflow - Analytics VidhyaApr 19, 2025 am 11:31 AM

Introduction In today's data-centric world, leveraging advanced AI technologies is crucial for businesses seeking a competitive edge and enhanced efficiency. A range of powerful tools empowers data scientists, analysts, and developers to build, depl

AV Byte: OpenAI's GPT-4o Mini and Other AI InnovationsAV Byte: OpenAI's GPT-4o Mini and Other AI InnovationsApr 19, 2025 am 11:30 AM

This week's AI landscape exploded with groundbreaking releases from industry giants like OpenAI, Mistral AI, NVIDIA, DeepSeek, and Hugging Face. These new models promise increased power, affordability, and accessibility, fueled by advancements in tr

Perplexity's Android App Is Infested With Security Flaws, Report FindsPerplexity's Android App Is Infested With Security Flaws, Report FindsApr 19, 2025 am 11:24 AM

But the company’s Android app, which offers not only search capabilities but also acts as an AI assistant, is riddled with a host of security issues that could expose its users to data theft, account takeovers and impersonation attacks from malicious

Everyone's Getting Better At Using AI: Thoughts On Vibe CodingEveryone's Getting Better At Using AI: Thoughts On Vibe CodingApr 19, 2025 am 11:17 AM

You can look at what’s happening in conferences and at trade shows. You can ask engineers what they’re doing, or consult with a CEO. Everywhere you look, things are changing at breakneck speed. Engineers, and Non-Engineers What’s the difference be

Rocket Launch Simulation and Analysis using RocketPy - Analytics VidhyaRocket Launch Simulation and Analysis using RocketPy - Analytics VidhyaApr 19, 2025 am 11:12 AM

Simulate Rocket Launches with RocketPy: A Comprehensive Guide This article guides you through simulating high-power rocket launches using RocketPy, a powerful Python library. We'll cover everything from defining rocket components to analyzing simula

See all articles

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

MantisBT

MantisBT

Mantis is an easy-to-deploy web-based defect tracking tool designed to aid in product defect tracking. It requires PHP, MySQL and a web server. Check out our demo and hosting services.

WebStorm Mac version

WebStorm Mac version

Useful JavaScript development tools

ZendStudio 13.5.1 Mac

ZendStudio 13.5.1 Mac

Powerful PHP integrated development environment

SAP NetWeaver Server Adapter for Eclipse

SAP NetWeaver Server Adapter for Eclipse

Integrate Eclipse with SAP NetWeaver application server.

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor