Home >Technology peripherals >AI >LLMLingua: Integrate LlamaIndex, compress hints and provide efficient large language model inference services
The emergence of large language models (LLM) has stimulated innovation in multiple fields. However, the increasing complexity of prompts, driven by strategies such as chain-of-thought (CoT) prompts and contextual learning (ICL), poses computational challenges. These lengthy prompts require significant resources for reasoning and therefore require efficient solutions. This article will introduce the integration of LLMLingua with the proprietary LlamaIndex to perform efficient inference
LLMLingua is a paper published by Microsoft researchers at EMNLP 2023. LongLLMLingua is a method to enhance llm's ability to perceive key information in long context scenarios through fast compression.
LLMLingua emerged as a pioneering solution to verbose prompts in LLM applications. This approach focuses on compressing lengthy prompts while ensuring semantic integrity and increasing inference speed. It combines various compression strategies to provide a subtle way to balance hint length and computational efficiency.
The following are the advantages of integrating LLMLingua with LlamaIndex:
The integration of LLMLingua and LlamaIndex marks an important step for llm in rapid optimization step. LlamaIndex is a specialized repository containing pre-optimized hints tailored for a variety of LLM applications, and through this integration LLMLingua can access a rich set of domain-specific, fine-tuned hints, thereby enhancing its hint compression capabilities.
LLMLingua improves the efficiency of LLM applications through synergy with LlamaIndex’s optimization hint library. Leveraging LLAMA's specialized cues, LLMLingua can fine-tune its compression strategy to ensure domain-specific context is preserved while reducing the length of the cues. This collaboration dramatically speeds up inference while preserving key domain nuances
LLMLingua’s integration with LlamaIndex extends its impact on large-scale LLM applications. By leveraging LLAMA's expert tips, LLMLingua has optimized its compression technology, reducing the computational burden of processing lengthy tips. This integration not only accelerates inference but also ensures the retention of critical domain-specific information.
Using LlamaIndex to implement LLMLlingua requires a series of structured steps Process, which includes the use of specialized hint libraries for efficient hint compression and enhanced inference speed
Required first Establish a connection between LLMLingua and LlamaIndex. This includes access rights, API configuration and establishing connections for timely retrieval.
LlamaIndex is available as a specialized repository containing content tailored for various LLM applications Pre-optimization tips. LLMLingua can access this repository, retrieve domain-specific hints, and utilize these hints for compression
LLMLingua uses its hint compression method to simplify the retrieved hints. These techniques focus on compressing lengthy prompts while ensuring semantic consistency, thereby increasing inference speed without affecting context or relevance.
LLMLingua fine-tunes its compression strategy based on specialized hints obtained from LlamaIndex. This refinement process ensures that domain-specific nuances are preserved while efficiently reducing prompt length.
After using LLMLingua’s customized strategy and combining it with LlamaIndex’s pre-optimization prompts for compression, the obtained prompts can be used for LLM inference tasks. At this stage, we perform compression hints within the LLM framework to enable efficient context-aware reasoning
Code implementation continues to undergo iterative refinement. This process includes improving the compression algorithm, optimizing retrieval of hints from LlamaIndex, and fine-tuning the integration to ensure consistency and enhanced performance of compressed hints and LLM inference.
If necessary, testing and verification can be performed so that the efficiency and effectiveness of the integration of LLMLingua and LlamaIndex can be evaluated . Performance metrics are evaluated to ensure that compression hints maintain semantic integrity and increase inference speed without compromising accuracy.
We will start to delve into the code implementation of LLMLingua and LlamaIndex
Installation Package:
# Install dependency. !pip install llmlingua llama-index openai tiktoken -q # Using the OAI import openai openai.api_key = "<insert_openai_key>"</insert_openai_key>
Get data:
!wget "https://www.dropbox.com/s/f6bmb19xdg0xedm/paul_graham_essay.txt?dl=1" -O paul_graham_essay.txt
Load model:
from llama_index import (VectorStoreIndex,SimpleDirectoryReader,load_index_from_storage,StorageContext, ) # load documents documents = SimpleDirectoryReader(input_files=["paul_graham_essay.txt"] ).load_data()
Vector storage :
index = VectorStoreIndex.from_documents(documents) retriever = index.as_retriever(similarity_top_k=10) question = "Where did the author go for art school?" # Ground-truth Answer answer = "RISD" contexts = retriever.retrieve(question) contexts = retriever.retrieve(question) context_list = [n.get_content() for n in contexts] len(context_list) #Output #10
Original prompt and return
# The response from original prompt from llama_index.llms import OpenAI llm = OpenAI(model="gpt-3.5-turbo-16k") prompt = "\n\n".join(context_list + [question]) response = llm.complete(prompt) print(str(response)) #Output The author went to the Rhode Island School of Design (RISD) for art school.
Set LLMLingua
from llama_index.query_engine import RetrieverQueryEngine from llama_index.response_synthesizers import CompactAndRefine from llama_index.indices.postprocessor import LongLLMLinguaPostprocessor node_postprocessor = LongLLMLinguaPostprocessor(instruction_str="Given the context, please answer the final question",target_token=300,rank_method="longllmlingua",additional_compress_kwargs={"condition_compare": True,"condition_in_question": "after","context_budget": "+100","reorder_context": "sort", # enable document reorder,"dynamic_context_compression_ratio": 0.3,}, )
通过LLMLingua进行压缩
retrieved_nodes = retriever.retrieve(question) synthesizer = CompactAndRefine() from llama_index.indices.query.schema import QueryBundle # postprocess (compress), synthesize new_retrieved_nodes = node_postprocessor.postprocess_nodes(retrieved_nodes, query_bundle=QueryBundle(query_str=question) ) original_contexts = "\n\n".join([n.get_content() for n in retrieved_nodes]) compressed_contexts = "\n\n".join([n.get_content() for n in new_retrieved_nodes]) original_tokens = node_postprocessor._llm_lingua.get_token_length(original_contexts) compressed_tokens = node_postprocessor._llm_lingua.get_token_length(compressed_contexts)
打印2个结果对比:
print(compressed_contexts) print() print("Original Tokens:", original_tokens) print("Compressed Tokens:", compressed_tokens) print("Comressed Ratio:", f"{original_tokens/(compressed_tokens + 1e-5):.2f}x")
打印的结果如下:
next Rtm's advice hadn' included anything that. I wanted to do something completely different, so I decided I'd paint. I wanted to how good I could get if I focused on it. the day after stopped on YC, I painting. I was rusty and it took a while to get back into shape, but it was at least completely engaging.1] I wanted to back RISD, was now broke and RISD was very expensive so decided job for a year and return RISD the fall. I got one at Interleaf, which made software for creating documents. You like Microsoft Word? Exactly That was I low end software tends to high. Interleaf still had a few years to live yet. [] the Accademia wasn't, and my money was running out, end year back to thelot the color class I tookD, but otherwise I was basically myself to do that for in993 I dropped I aroundidence bit then my friend Par did me a big A rent-partment building New York. Did I want it Itt more my place, and York be where the artists. wanted [For when you that ofs you big painting of this type hanging in the apartment of a hedge fund manager, you know he paid millions of dollars for it. That's not always why artists have a signature style, but it's usually why buyers pay a lot for such work. [6] Original Tokens: 10719 Compressed Tokens: 308 Comressed Ratio: 34.80x
验证输出:
response = synthesizer.synthesize(question, new_retrieved_nodes) print(str(response)) #Output #The author went to RISD for art school.
LLMLingua与LlamaIndex的集成证明了协作关系在优化大型语言模型(LLM)应用程序方面的变革潜力。这种协作彻底改变了即时压缩方法和推理效率,为上下文感知、简化的LLM应用程序铺平了道路。
这种集成不仅可以提升推理速度,而且可以保证在压缩提示中保持语义的完整性。通过对基于LlamaIndex特定领域提示的压缩策略进行微调,我们平衡了提示长度的减少和基本上下文的保留,从而提高了LLM推理的准确性
从本质上讲,LLMLingua与LlamaIndex的集成超越了传统的提示压缩方法,为未来大型语言模型应用程序的优化、上下文准确和有效地针对不同领域进行定制奠定了基础。这种协作集成预示着大型语言模型应用程序领域中效率和精细化的新时代的到来。
The above is the detailed content of LLMLingua: Integrate LlamaIndex, compress hints and provide efficient large language model inference services. For more information, please follow other related articles on the PHP Chinese website!