我一直在云端构建 LLM 应用程序。我还看到很多开发人员制作 LLM 应用程序,这对于 MVP 或原型来说非常好,但需要一些工作才能使其做好生产准备。应用所列出的一种或多种实践可以帮助您的应用程序以有效的方式进行扩展。本文不涵盖应用程序开发的整个软件工程方面,而仅涵盖 LLM 包装应用程序。此外,代码片段是用 python 编写的,相同的逻辑也可以应用于其他语言。
使用 LiteLLM 或 LangChain 等中间件来避免供应商锁定并随着模型的发展在模型之间轻松切换。
Python:
from litellm import completion response = completion( model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Hello, how are you?"}] )
诸如LiteLLM和LangChain之类的中间件解决方案在您的应用程序和各种LLM提供商之间提供了一个抽象层。这种抽象允许您轻松地在不同模型或提供程序之间切换,而无需更改核心应用程序代码。随着人工智能领域的快速发展,不断发布具有改进功能的新模型。通过使用中间件,您可以根据性能、成本或功能要求快速采用这些新模型或切换提供商,确保您的应用程序保持最新状态并具有竞争力。
通过在 API 调用中实现重试逻辑来避免速率限制问题。
Python:
import time from openai import OpenAI client = OpenAI() def retry_api_call(max_retries=3, delay=1): for attempt in range(max_retries): try: response = client.chat.completions.create( model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Hello!"}] ) return response except Exception as e: if attempt == max_retries - 1: raise e time.sleep(delay * (2 ** attempt)) # Exponential backoff
LLM 提供商经常施加费率限制,以防止滥用并确保公平使用。实施具有指数退避的重试机制可以帮助您的应用程序优雅地处理临时故障或速率限制错误。这种方法通过自动重试失败的请求来提高应用程序的可靠性,减少由于暂时性问题而导致服务中断的可能性。指数退避策略(增加重试之间的延迟)有助于防止立即重新请求使 API 不堪重负,这可能会加剧速率限制问题。
不要依赖单一的法学硕士提供商。实施后备措施来处理配额问题或服务中断。
from litellm import completion def get_llm_response(prompt): providers = ['openai/gpt-3.5-turbo', 'anthropic/claude-2', 'cohere/command-nightly'] for provider in providers: try: response = completion(model=provider, messages=[{"role": "user", "content": prompt}]) return response except Exception as e: print(f"Error with {provider}: {str(e)}") continue raise Exception("All LLM providers failed")
如果该提供商遇到停机或达到配额限制,则依赖单个 LLM 提供商可能会导致服务中断。通过实施后备选项,您可以确保应用程序的持续运行。这种方法还允许您利用不同提供商或模型的优势来完成各种任务。 LiteLLM 通过为多个提供者提供统一的接口来简化此过程,使其更容易在它们之间切换或实现回退逻辑。
使用 Langfuse 或 Helicone 等工具进行 LLM 跟踪和调试。
from langfuse.openai import OpenAI client = OpenAI( api_key="your-openai-api-key", langfuse_public_key="your-langfuse-public-key", langfuse_secret_key="your-langfuse-secret-key" ) response = client.chat.completions.create( model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Hello, AI!"}] )
实现可观察性的优点:
可观测性工具为您的 LLM 应用程序的性能、使用模式和潜在问题提供重要的见解。它们允许您实时监控和分析与法学硕士的交互,帮助您优化提示、识别瓶颈并确保人工智能生成的响应的质量。随着时间的推移,这种可见性级别对于维护、调试和改进应用程序至关重要。
使用带有版本控制的提示管理工具,而不是在代码或文本文件中硬编码提示。
from promptflow import PromptFlow pf = PromptFlow() prompt_template = pf.get_prompt("greeting_prompt", version="1.2") response = openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=[{"role": "user", "content": prompt_template.format(name="Alice")}] )
有效的及时管理对于维护和改进您的 LLM 申请至关重要。通过使用专用的提示管理工具,您可以对提示进行版本控制、A/B 测试不同的变体,并在您的应用程序中轻松更新它们。这种方法将提示逻辑与应用程序代码分开,使得在不更改核心应用程序的情况下更容易迭代提示。它还使非技术团队成员能够为快速改进做出贡献,并在改进人工智能交互方面实现更好的协作。
Use a persistent cache like Redis for storing conversation history instead of in-memory cache which is not adapted for distributed systems.
from langchain.memory import RedisChatMessageHistory from langchain.chains import ConversationChain from langchain.llms import OpenAI # Initialize Redis chat message history message_history = RedisChatMessageHistory(url="redis://localhost:6379/0", ttl=600, session_id="user-123") # Create a conversation chain with Redis memory conversation = ConversationChain( llm=OpenAI(), memory=message_history, verbose=True ) # Use the conversation response = conversation.predict(input="Hi there!") print(response) # The conversation history is automatically stored in Redis
Storing conversation history is essential for maintaining context in ongoing interactions and providing personalized experiences. Using a persistent cache like Redis, especially in distributed systems, ensures that conversation history is reliably stored and quickly accessible. This approach allows your application to scale horizontally while maintaining consistent user experiences across different instances or servers. The use of Redis with LangChain simplifies the integration of persistent memory into your conversational AI system, making it easier to build stateful, context-aware applications.
Whenever possible like extracting structured information, provide a JSON schema instead of relying on raw text output.
import openai response = openai.ChatCompletion.create( model="gpt-3.5-turbo-1106", response_format={"type": "json_object"}, messages=[ {"role": "system", "content": "Extract the name and age from the user's input."}, {"role": "user", "content": "My name is John and I'm 30 years old."} ] ) print(response.choices[0].message.content) # Output: {"name": "John", "age": 30}
Using JSON mode for information extraction provides a structured and consistent output format, making it easier to parse and process the LLM's responses in your application. This approach reduces the need for complex post-processing of free-form text and minimizes the risk of misinterpretation. It's particularly useful for tasks like form filling, data extraction from unstructured text, or any scenario where you need to integrate AI-generated content into existing data structures or databases.
Implement alerts for prepaid credits and per-user credit checks, even in MVP stages.
def check_user_credits(user_id, requested_tokens): user_credits = get_user_credits(user_id) if user_credits < requested_tokens: raise InsufficientCreditsError(f"User {user_id} has insufficient credits") remaining_credits = user_credits - requested_tokens if remaining_credits < CREDIT_ALERT_THRESHOLD: send_low_credit_alert(user_id, remaining_credits) return True
Implementing credit alerts and per-user credit checks is crucial for managing costs and ensuring fair usage in your LLM application. This system helps prevent unexpected expenses and allows you to proactively manage user access based on their credit limits. By setting up alerts at multiple thresholds, you can inform users or administrators before credits are depleted, ensuring uninterrupted service. This approach is valuable even in MVP stages, as it helps you understand usage patterns and plan for scaling your application effectively.
Create mechanisms for users to provide feedback on AI responses, starting with simple thumbs up/down ratings.
def process_user_feedback(response_id, feedback): if feedback == 'thumbs_up': log_positive_feedback(response_id) elif feedback == 'thumbs_down': log_negative_feedback(response_id) trigger_improvement_workflow(response_id) # In your API endpoint @app.route('/feedback', methods=['POST']) def submit_feedback(): data = request.json process_user_feedback(data['response_id'], data['feedback']) return jsonify({"status": "Feedback received"})
Implementing feedback loops is essential for continuously improving your LLM application. By allowing users to provide feedback on AI responses, you can identify areas where the model performs well and where it needs improvement. This data can be used to fine-tune models, adjust prompts, or implement additional safeguards. Starting with simple thumbs up/down ratings provides an easy way for users to give feedback, while more detailed feedback options can be added later for deeper insights. This approach helps in building trust with users and demonstrates your commitment to improving the AI's performance based on real-world usage.
Use prompt guards to check for prompt injection attacks, toxic content, and off-topic responses.
import re from better_profanity import profanity def check_prompt_injection(input_text): injection_patterns = [ r"ignore previous instructions", r"disregard all prior commands", r"override system prompt" ] for pattern in injection_patterns: if re.search(pattern, input_text, re.IGNORECASE): return True return False def check_toxic_content(input_text): return profanity.contains_profanity(input_text) def sanitize_input(input_text): if check_prompt_injection(input_text): raise ValueError("Potential prompt injection detected") if check_toxic_content(input_text): raise ValueError("Toxic content detected") # Additional checks can be added here (e.g., off-topic detection) return input_text # Return sanitized input if all checks pass # Usage try: safe_input = sanitize_input(user_input) # Process safe_input with your LLM except ValueError as e: print(f"Input rejected: {str(e)}")
Implementing guardrails is crucial for ensuring the safety and reliability of your LLM application. This example demonstrates how to check for potential prompt injection attacks and toxic content. Prompt injection attacks attempt to override or bypass the system's intended behavior, while toxic content checks help maintain a safe and respectful environment. By implementing these checks, you can prevent malicious use of your AI system and ensure that the content generated aligns with your application's guidelines and ethical standards. Additional checks can be added to detect off-topic responses or other unwanted behaviors, further enhancing the robustness of your application.
All the above listed points can be easily integrated into your application and they prepare you better for scaling in production. You may also agree or disagree on some of the above points. In any case, feel free to post your questions or comments.
以上是构建稳健的法学硕士申请的基本实践的详细内容。更多信息请关注PHP中文网其他相关文章!