Home >Technology peripherals >AI >Building a Local Voice Assistant with LLMs and Neural Networks on Your CPU Laptop
Unlock the Power of Local Voice Assistants: A Step-by-Step Guide
The rise of multimodal Large Language Models (LLMs) has revolutionized how we interact with AI, enabling voice-based interactions. While OpenAI's voice-enabled ChatGPT offers a convenient solution, building a local voice assistant provides enhanced data privacy, unlimited API calls, and the ability to fine-tune models for specific needs. This guide details the construction of such an assistant on a standard CPU-based machine.
Why Choose a Local Voice Assistant?
Three key advantages drive the appeal of local voice assistants:
Building Your Local Voice Assistant
This project comprises four core components:
sounddevice
library facilitates this process, saving the audio as a WAV file. The code snippet below demonstrates this:<code class="language-python">import sounddevice as sd import wave import numpy as np sampling_rate = 16000 # Matches Whisper.cpp model recorded_audio = sd.rec(int(duration * sampling_rate), samplerate=sampling_rate, channels=1, dtype=np.int16) sd.wait() audio_file = "<path>/recorded_audio.wav" with wave.open(audio_file, "w") as wf: wf.setnchannels(1) wf.setsampwidth(2) wf.setframerate(sampling_rate) wf.writeframes(recorded_audio.tobytes())</path></code>
ggml-base.en.bin
) is utilized for this purpose.<code class="language-python">import subprocess WHISPER_BINARY_PATH = "/<path>/whisper.cpp/main" MODEL_PATH = "/<path>/whisper.cpp/models/ggml-base.en.bin" try: result = subprocess.run([WHISPER_BINARY_PATH, "-m", MODEL_PATH, "-f", audio_file, "-l", "en", "-otxt"], capture_output=True, text=True) transcription = result.stdout.strip() except FileNotFoundError: print("Whisper.cpp binary not found. Check the path.")</path></path></code>
qwen:0.5b
) to generate a textual response to the transcribed input. A utility function, run_ollama_command
, handles the LLM interaction.<code class="language-python">import subprocess import re def run_ollama_command(model, prompt): try: result = subprocess.run(["ollama", "run", model], input=prompt, text=True, capture_output=True, check=True) return result.stdout except subprocess.CalledProcessError as e: print(f"Ollama error: {e.stderr}") return None matches = re.findall(r"] *(.*)", transcription) concatenated_text = " ".join(matches) prompt = f"""Please ignore [BLANK_AUDIO]. Given: "{concatenated_text}", answer in under 15 words.""" answer = run_ollama_command(model="qwen:0.5b", prompt=prompt)</code>
<code class="language-python">import nemo_tts import torchaudio from io import BytesIO try: fastpitch_model = nemo_tts.models.FastPitchModel.from_pretrained("tts_en_fastpitch") hifigan_model = nemo_tts.models.HifiGanModel.from_pretrained("tts_en_lj_hifigan_ft_mixerttsx") fastpitch_model.eval() parsed_text = fastpitch_model.parse(answer) spectrogram = fastpitch_model.generate_spectrogram(tokens=parsed_text) hifigan_model.eval() audio = hifigan_model.convert_spectrogram_to_audio(spec=spectrogram) audio_buffer = BytesIO() torchaudio.save(audio_buffer, audio.cpu(), sample_rate=22050, format="wav") audio_buffer.seek(0) except Exception as e: print(f"TTS error: {e}")</code>
System Integration and Future Improvements
A Streamlit application integrates these components, providing a user-friendly interface. Further enhancements could include conversation history management, multilingual support, and source attribution for responses. Consider exploring Open WebUI for additional audio model integration capabilities. Remember to always critically evaluate AI-generated responses.
This revised response maintains the core information while significantly improving clarity, structure, and code formatting. It also removes the YouTube embed, as it's not directly reproducible.
The above is the detailed content of Building a Local Voice Assistant with LLMs and Neural Networks on Your CPU Laptop. For more information, please follow other related articles on the PHP Chinese website!