Home >Technology peripherals >AI >Building a Local Voice Assistant with LLMs and Neural Networks on Your CPU Laptop

Building a Local Voice Assistant with LLMs and Neural Networks on Your CPU Laptop

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB
WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOriginal
2025-02-25 17:10:11258browse

Unlock the Power of Local Voice Assistants: A Step-by-Step Guide

The rise of multimodal Large Language Models (LLMs) has revolutionized how we interact with AI, enabling voice-based interactions. While OpenAI's voice-enabled ChatGPT offers a convenient solution, building a local voice assistant provides enhanced data privacy, unlimited API calls, and the ability to fine-tune models for specific needs. This guide details the construction of such an assistant on a standard CPU-based machine.

Why Choose a Local Voice Assistant?

Three key advantages drive the appeal of local voice assistants:

  1. Data Privacy: Avoid transmitting sensitive information to external servers.
  2. Unrestricted API Calls: Bypass limitations imposed by proprietary APIs.
  3. Customizable Models: Fine-tune LLMs for optimal performance within your specific domain.

Building Your Local Voice Assistant

This project comprises four core components:

  1. Voice Recording: Capture audio input from your device's microphone. The sounddevice library facilitates this process, saving the audio as a WAV file. The code snippet below demonstrates this:
<code class="language-python">import sounddevice as sd
import wave
import numpy as np

sampling_rate = 16000  # Matches Whisper.cpp model

recorded_audio = sd.rec(int(duration * sampling_rate), samplerate=sampling_rate, channels=1, dtype=np.int16)
sd.wait()

audio_file = "<path>/recorded_audio.wav"
with wave.open(audio_file, "w") as wf:
    wf.setnchannels(1)
    wf.setsampwidth(2)
    wf.setframerate(sampling_rate)
    wf.writeframes(recorded_audio.tobytes())</path></code>
  1. Speech-to-Text Conversion: Transcribe the recorded audio into text. OpenAI's Whisper model (specifically, ggml-base.en.bin) is utilized for this purpose.
<code class="language-python">import subprocess

WHISPER_BINARY_PATH = "/<path>/whisper.cpp/main"
MODEL_PATH = "/<path>/whisper.cpp/models/ggml-base.en.bin"

try:
    result = subprocess.run([WHISPER_BINARY_PATH, "-m", MODEL_PATH, "-f", audio_file, "-l", "en", "-otxt"], capture_output=True, text=True)
    transcription = result.stdout.strip()
except FileNotFoundError:
    print("Whisper.cpp binary not found. Check the path.")</path></path></code>
  1. Text-Based Response Generation: Employ a lightweight LLM (e.g., Ollama's qwen:0.5b) to generate a textual response to the transcribed input. A utility function, run_ollama_command, handles the LLM interaction.
<code class="language-python">import subprocess
import re

def run_ollama_command(model, prompt):
    try:
        result = subprocess.run(["ollama", "run", model], input=prompt, text=True, capture_output=True, check=True)
        return result.stdout
    except subprocess.CalledProcessError as e:
        print(f"Ollama error: {e.stderr}")
        return None

matches = re.findall(r"] *(.*)", transcription)
concatenated_text = " ".join(matches)
prompt = f"""Please ignore [BLANK_AUDIO]. Given: "{concatenated_text}", answer in under 15 words."""
answer = run_ollama_command(model="qwen:0.5b", prompt=prompt)</code>
  1. Text-to-Speech Conversion: Convert the generated text response back into audio using NVIDIA's NeMo toolkit (FastPitch and HiFi-GAN models).
<code class="language-python">import nemo_tts
import torchaudio
from io import BytesIO

try:
    fastpitch_model = nemo_tts.models.FastPitchModel.from_pretrained("tts_en_fastpitch")
    hifigan_model = nemo_tts.models.HifiGanModel.from_pretrained("tts_en_lj_hifigan_ft_mixerttsx")
    fastpitch_model.eval()
    parsed_text = fastpitch_model.parse(answer)
    spectrogram = fastpitch_model.generate_spectrogram(tokens=parsed_text)
    hifigan_model.eval()
    audio = hifigan_model.convert_spectrogram_to_audio(spec=spectrogram)
    audio_buffer = BytesIO()
    torchaudio.save(audio_buffer, audio.cpu(), sample_rate=22050, format="wav")
    audio_buffer.seek(0)
except Exception as e:
    print(f"TTS error: {e}")</code>

System Integration and Future Improvements

A Streamlit application integrates these components, providing a user-friendly interface. Further enhancements could include conversation history management, multilingual support, and source attribution for responses. Consider exploring Open WebUI for additional audio model integration capabilities. Remember to always critically evaluate AI-generated responses.

Building a Local Voice Assistant with LLMs and Neural Networks on Your CPU Laptop

This revised response maintains the core information while significantly improving clarity, structure, and code formatting. It also removes the YouTube embed, as it's not directly reproducible.

The above is the detailed content of Building a Local Voice Assistant with LLMs and Neural Networks on Your CPU Laptop. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn