Ollama Voice Bot: Run Local LLMs for AI Phone Agents (2026)

✓ Mis à jour : Mars 2026  ·  Par l'équipe AIO Orchestration  ·  Lecture : 8 min

Why Build an Ollama Voice Bot? The Case for Local LLMs in 2026

Voice AI pipeline diagram: microphone to STT to LLM to TTS to speaker — real-time ollama voice bot : guide 5 steps processing
The era of cloud-only AI is giving way to a more hybrid, privacy-conscious approach. For voice applications like AI phone agents, the arguments for running models locally are more compelling than ever. An Ollama voice bot represents the pinnacle of this trend, offering a powerful trifecta of benefits: privacy, cost-efficiency, and ultra-low latency.
< 10ms
Network Latency (Local)
150-400ms
Network Latency (Cloud API)
100%
Data Sovereignty
By leveraging Ollama, we can easily deploy, manage, and switch between various open-source models, creating a flexible and powerful foundation for our offline AI voice solution.

Choosing Your Engine: Qwen 2.5 7B vs. Llama 3 8B for Voice

The choice of LLM is the most critical decision for your voice bot's performance. For real-time conversation, we need a model that is not only intelligent but, more importantly, *fast*. The goal is to minimize the "time to first token" (TTFT) and maximize the "tokens per second" (T/s) to avoid awkward silences. In 2026, two models stand out for this specific use case: Alibaba's Qwen 2.5 7B and Meta's Llama 3 8B. While Llama 3 is an exceptional all-rounder, Qwen 2.5 has been fine-tuned with a focus on speed and conversational flow, making it a prime candidate for a Qwen voice agent. Here's a breakdown based on running a 4-bit quantized version (`q4_K_M`) on an NVIDIA L4 GPU:
Metric Qwen 2.5 7B (q4_K_M) Llama 3 8B (q4_K_M) Recommendation for Voice
Time to First Token (TTFT) ~85 ms ~110 ms Qwen's lower TTFT means the bot starts "speaking" faster, feeling more responsive.
Tokens per Second (T/s) ~120 T/s ~95 T/s Qwen generates the rest of the response faster, crucial for short, conversational replies.
VRAM Usage (kept alive) ~5.1 GB ~5.8 GB Both are manageable on modern GPUs (like an L4 or 4060 Ti), but Qwen is slightly lighter.
Conversational Quality Excellent, excels at short, direct answers. Strong multilingual support. Exceptional, provides more detailed and nuanced responses. Can sometimes be too verbose for voice. Qwen's natural brevity is an advantage for phone calls. Llama 3 may require more aggressive prompt engineering to keep responses concise.
Expert Recommendation: Start with Qwen 2.5 7B Instruct (q4_K_M) for your Ollama voice bot. Its superior speed and naturally concise response style are perfectly suited for the rapid back-and-forth of a phone conversation.

Technical Setup Guide: Installing Your Ollama Voice Bot on Ubuntu

Let's build our self-hosted LLM telephony engine. This guide assumes you're using Ubuntu 22.04 LTS and have a server with a compatible NVIDIA GPU.

Prerequisites

Step 1: Install NVIDIA Drivers and CUDA Toolkit

Ollama relies on the underlying NVIDIA drivers and CUDA toolkit to harness the power of your GPU.
# First, update your system
sudo apt update && sudo apt upgrade -y

# Add the official NVIDIA CUDA repository
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

# Install the CUDA toolkit and drivers
sudo apt-get -y install cuda-toolkit-12-4

# Verify the installation
nvidia-smi
Running `nvidia-smi` should display a table with your GPU details and the CUDA version. If you see this, you're ready for the next step.

Step 2: Install Ollama

Ollama's one-line installation script makes this process incredibly simple.
# Download and run the Ollama installation script
curl -fsSL https://ollama.com/install.sh | sh
This command downloads the `ollama` binary, creates a systemd service to run it in the background, and sets up the command-line tool.

Step 3: Pull and Run Your First Model

Now, let's download our chosen model, the quantized version of Qwen 2.5 7B.
# Pull the qwen2:7b-instruct-q4_K_M model
ollama pull qwen2:7b-instruct-q4_K_M

# Once downloaded, run the model to test it
ollama run qwen2:7b-instruct-q4_K_M

>>> Hello, what is your purpose?
You are now in an interactive chat with your own local LLM voice assistant's brain! Type a message and see it respond. To exit, type `/bye`.

Optimal Configuration for Real-Time Voice

The default Ollama configuration is good, but for a high-performance voice bot, we need to tune it. This is done by creating a custom "Modelfile" or by passing parameters via the API.

Eliminating Cold Starts with `keep_alive`

By default, Ollama unloads a model from VRAM after 5 minutes of inactivity to free up resources. For a phone agent that needs to be instantly available, this is unacceptable. We can force the model to stay loaded in VRAM indefinitely. To do this via the API, set the `keep_alive` parameter in your API request body to `-1`. This ensures the model is always hot and ready, eliminating the "cold start" latency of loading it into memory for the first call after a period of inactivity.
{
  "model": "qwen2:7b-instruct-q4_K_M",
  "prompt": "Hello!",
  "stream": false,
  "keep_alive": -1
}
Critical for Production: Using `keep_alive: -1` is non-negotiable for any serious self-hosted LLM telephony deployment. The VRAM is dedicated, but the performance gain is essential.

Controlling Responses with `max_tokens` and `temperature`

For voice, we need short, predictable responses.

Interacting with Your Local LLM: The Ollama API and Python

Ollama exposes a simple REST API on port `11434`. You can interact with your voice bot's brain from any programming language. Here’s how to do it with Python, which is perfect for integrating with a telephony platform like Asterisk. First, ensure you have the `requests` library installed: `pip install requests`.
import requests
import json

# Ollama API endpoint
OLLAMA_ENDPOINT = "http://localhost:11434/api/generate"

# System prompt for our phone agent
SYSTEM_PROMPT = """
You are a helpful, friendly, and concise AI phone agent for 'AquaSparkle Pools'.
Your goal is to answer customer questions about pool maintenance.
- Keep your answers under 40 words.
- You are bilingual in English and French. Respond in the language of the user's question.
- Do not ask follow-up questions. Provide a direct answer.
"""

def get_llm_response(user_query: str) -> str:
    """
    Gets a response from the local Ollama voice bot.
    """
    full_prompt = f"{SYSTEM_PROMPT}\n\nUser: {user_query}\nAI:"

    payload = {
        "model": "qwen2:7b-instruct-q4_K_M",
        "prompt": full_prompt,
        "stream": False,
        "keep_alive": -1, # Keep the model loaded in VRAM
        "options": {
            "num_predict": 80, # Max tokens
            "temperature": 0.7
        }
    }

    try:
        response = requests.post(OLLAMA_ENDPOINT, data=json.dumps(payload), timeout=15)
        response.raise_for_status()
        
        response_data = response.json()
        return response_data.get("response", "").strip()

    except requests.exceptions.RequestException as e:
        print(f"Error communicating with Ollama: {e}")
        return "I'm sorry, I'm having trouble connecting to my brain right now."

# Example usage
if __name__ == "__main__":
    question = "How often should I check my pool's pH level?"
    answer = get_llm_response(question)
    print(f"User Question: {question}")
    print(f"AI Agent Answer: {answer}")

    question_fr = "Bonjour, comment puis-je éviter les algues dans ma piscine?"
    answer_fr = get_llm_response(question_fr)
    print(f"\nUser Question: {question_fr}")
    print(f"AI Agent Answer: {answer_fr}")
This script demonstrates how to package your prompt, set parameters, and get a clean response from your Ollama voice bot. For more complex state management, you can explore our guide on AI conversation memory.

Prompt Engineering for Natural Phone Conversations

The quality of your Ollama voice bot depends heavily on your system prompt. The example above is a great start. Let's break down the key principles: 1. Define the Persona: "You are a helpful, friendly, and concise AI phone agent for 'AquaSparkle Pools'." This sets the tone and context. 2. State the Goal: "Your goal is to answer customer questions about pool maintenance." This focuses the LLM's task. 3. Impose Constraints (Very Important for Voice): * `Keep your answers under 40 words.` This is the most critical rule for preventing long, awkward monologues. * `Do not ask follow-up questions.` In many IVR scenarios, you want the LLM to answer and then wait for the next user input, not drive the conversation. 4. Define Capabilities: "You are bilingual in English and French. Respond in the language of the user's question." Modern models like Qwen are excellent at this. This carefully crafted prompt, combined with the API parameters, transforms a general-purpose LLM into a specialized Qwen voice agent.

Performance Tuning: Quantization and Monitoring

To run efficiently, we need to optimize our model and keep an eye on resources.

The Power of Quantization (q4_K_M)

Quantization is the process of reducing the precision of the model's weights, which dramatically shrinks its size and VRAM footprint, and often increases speed. We're using `q4_K_M`, a popular 4-bit quantization method.
  • `q4` means it's a 4-bit quantization.
  • `_K` is a "K-quants" strategy, an improved method over older techniques.
  • `_M` means it's the "medium" size variant, offering a great balance of performance and quality.
For a 7B parameter model, the difference is stark:
  • Unquantized (FP16): ~14 GB VRAM
  • Quantized (q4_K_M): ~5 GB VRAM
This is why quantization is not just an option but a necessity for deploying a cost-effective offline AI voice system on consumer or prosumer-grade hardware.

Monitoring VRAM Usage with `nvidia-smi`

While your bot is running, you need to monitor its resource consumption. The `nvidia-smi` command is your best friend.
# Watch GPU usage in real-time
watch -n 1 nvidia-smi
You will see a process named `ollama` appear in the process list, consuming VRAM. When you use the `keep_alive: -1` parameter, you should see this process consistently occupying ~5.1GB of VRAM for the Qwen 2.5 7B model. This confirms the model is loaded and ready for instant inference.

Bringing it to Life: Integration with Asterisk EAGI

The final step is to connect your Python script to a telephony platform. Asterisk, the open-source PBX, is a perfect choice. Using the **Enhanced AGI (EAGI)** protocol, you can create a powerful interactive voice response (IVR) system. The workflow looks like this:
  1. Incoming Call: A call arrives at your Asterisk server.
  2. Asterisk Dialplan: The dialplan executes an `AGI` script.
  3. Speech-to-Text (STT): The AGI script uses an STT engine (e.g., Vosk, Whisper) to capture the caller's audio and transcribe it to text.
  4. Call Ollama API: The transcribed text is passed to your Python script, which calls the local Ollama API (as shown in our example).
  5. Get LLM Response: The Ollama voice bot processes the text and returns a concise answer.
  6. Text-to-Speech (TTS): The Python script sends the LLM's text response to a TTS engine (e.g., Piper, Coqui AI) to generate audio.
  7. Stream Audio: The generated audio is streamed back to the caller via Asterisk.
This entire loop happens in a fraction of a second, enabled by the low latency of your local LLM voice assistant. The Python script acts as the central orchestrator, bridging the gap between telephony, STT/TTS, and the LLM. For a deep dive, see our upcoming article on full Asterisk and LLM integration.

Frequently Asked Questions (FAQ)

What is an Ollama voice bot?

An Ollama voice bot is a conversational AI system, typically used for phone calls or voice assistants, that uses the Ollama platform to run a Large Language Model (LLM) on local hardware. This approach prioritizes privacy, cost control, and low latency by avoiding reliance on cloud-based AI services.

Can I run this on a Raspberry Pi or a computer without a GPU?

While you can run Ollama and smaller models on a CPU or a Raspberry Pi, it is not recommended for real-time voice applications. The inference speed would be too slow, leading to long, unacceptable delays in conversation. A dedicated NVIDIA GPU with at least 8GB of VRAM is strongly recommended for a responsive local LLM voice assistant.

How does this compare to cloud services like Google Dialogflow or Amazon Lex?

Cloud services offer a managed, easier-to-set-up platform but come with per-use costs, potential data privacy issues, and higher network latency. An Ollama-based solution requires a higher initial technical investment but provides superior privacy, predictable costs (fixed hardware/server expenses), and the lowest possible latency for more natural conversations.

What are the best open-source STT and TTS engines to pair with this?

For a fully offline AI voice system, you'll need local STT/TTS. Great open-source options include:

  • Speech-to-Text (STT): Vosk, Whisper (via whisper.cpp for CPU/GPU efficiency).
  • Text-to-Speech (TTS): Piper (very fast, low resource), Coqui TTS (high quality, more complex).
Piper is often favored for its speed, which is critical in a telephony feedback loop.

How do I handle multiple concurrent calls?

Handling concurrent calls requires careful resource management. A single powerful GPU (like an L40S or A100) can handle multiple models or concurrent requests to the same model. Ollama's architecture is evolving to better support this. For high-volume scenarios, you might run multiple Ollama instances on different GPUs and use a load balancer to distribute requests from your Asterisk servers.

Why is Qwen 2.5 7B recommended over Llama 3 8B for voice?

While both are excellent, Qwen 2.5 7B has demonstrated a lower Time to First Token (TTFT) and higher tokens-per-second generation speed in many benchmarks. For voice, starting the response quickly and finishing it fast is more important than the slightly higher reasoning complexity Llama 3 might offer. Qwen's tendency towards more concise answers is also a natural fit for phone conversations.

Is it difficult to set up the Asterisk integration?

It requires familiarity with both Asterisk dialplans and scripting (like Python or Perl). The concept of using AGI is straightforward, but the implementation details—managing audio streams, calling external processes for STT/TTS/LLM, and handling errors—can be complex. However, the modularity of the approach allows you to build and test each component (STT, LLM, TTS) independently.

What are the real-world hardware costs?

As of 2026, a capable server with a GPU like an NVIDIA L4 or an RTX 4060 Ti can be purchased or rented for a reasonable price. A dedicated server with an L4 GPU might cost around $200-$400/month. A self-built machine with a consumer GPU could have an upfront cost of $1500-$2500. This is a fixed cost, which can be significantly cheaper than paying per-minute fees for thousands of call-minutes per month on a cloud platform.

How can I make my Qwen voice agent support more languages?

The Qwen model family has strong multilingual capabilities out of the box. The key is in your prompt. By instructing the model to "respond in the language of the user's question," you leverage its inherent ability to detect the input language and generate a response in the same one. Ensure your STT engine also supports the languages you wish to serve.

What does `keep_alive: -1` actually do?

The `keep_alive` parameter tells the Ollama server how long to keep a model loaded in the GPU's VRAM after it has finished processing a request. By default, it's 5 minutes. Setting it to `-1` tells Ollama to *never* unload the model. This dedicates the VRAM to that model but ensures that every single API call is a "hot" call, with zero time spent loading the model into memory, which is essential for a responsive self-hosted LLM telephony system.

Prêt à déployer votre Agent Vocal IA ?

Solution on-premise, latence 335ms, 100% RGPD. Déploiement en 2-4 semaines.

Demander une Démo Guide Installation

Frequently Asked Questions