While it's easy to make a local assist pipeline, and get away from Google Assistant or Alexa, it's not really that easy to go all local and still have decent performance. I'll show you how you can go completely local on all stages - and still have blazing fast performance.
Before you start configuring the local LLM, make sure to check out a more recent article on the topic.
With the introduction of Voice Assistants, Home Assistant has made it really easy to use voice commands without using the likes of Google Assistant or Alexa. You can set up your own pipeline, choosing a custom conversation agent, text to voice, voice to text and you can even pick your own wake word. I'll show you how you can go completely local - and still have blazing fast performance. The caveat is; you need to have something a little beefier than a Raspberry Pi. I run Home Assistant on a RPi4 myself, but offload compute heavy stuff like Frigate, Whisper, Piper and the local LLM to an external server. This external server doesn't need to be much, but you need a CUDA-compatible GPU with a decent amount of VRAM. I have a RTX 360 with 12 GB DDR to play with, and it's enough to follow this tutorial.
To start with, you need to follow my local Whisper and Piper tutorial from he other day. This will set you up with a Proxmox LXC that has CUDA support to take advantage of your GPU. You'll set up the Whisper and Piper services locally - and they don't require much in terms of compute (although, Whisper without GPU is a painful experience!).
When you've gotten Whisper and Piper to work, you are ready to move on to the local LLM. I've found that LocalAI is a great way to expose a custom conversation agent for Home Assistant. Basically, you download the latest LocalAI container with CUDA support, download a model that understands Home Assistant, OpenAI functions and configure it to run on your GPU.
Start with extending your docker-compose.yml file, adding the following service:
local-ai:
image: quay.io/go-skynet/local-ai:master-cublas-cuda12
ports:
- "8080:8080"
environment:
- DEBUG=true
- MODELS_PATH=/models
- THREADS=1
volumes:
- "./models:/models"
runtime: nvidia
Now, create a folder named "models" in the same directory where your docker-compose.yml file is located.
Inside the "models" directory, create a luna.yaml-file containing:
name: luna
parameters:
model: luna-ai-llama2-uncensored.Q6_K.gguf
top_k: 90
temperature: 0.2
top_p: 0.7
context_size: 4096
threads: 4
gpu_layers: 50
f16: false
mmap: true
backend: llama
roles:
assistant: 'ASSISTANT:'
system: 'SYSTEM:'
user: 'USER:'
template:
chat: lunademo-chat
completion: lunademo-completion
It seems this particular model requires around 30-some layers. I'm not sure if it's dependent on the GPU, but with 50 layers, the whole model fits into my VRAM. The Number of "threads" is dependant on the CPU. I've allocated 4 vcpu to my LXC.
Then create a lunademo-chat.tmpl file containing:
USER: {{.Input}}
ASSISTANT:
Finally, a file lunademo-completion.tmpl containing:
Complete the following sentence: {{.Input}}
Honestly, I'm not so sure you need to define the templates in the luna.yaml-file at all; and if you don't the last to .tmpl-files aren't necessary.
Now, download the model itself (you may need to extend your disk space!):
wget https://huggingface.co/TheBloke/Luna-AI-Llama2-Uncensored-GGUF/resolve/main/luna-ai-llama2-uncensored.Q6_K.gguf -O luna-ai-llama2-uncensored.Q6_K.gguf
Before launching, check your VCPU and Memory settings. You may need to up your VCPU count and assigned RAM.
Now launch the LLM via the LocalAI container that is newly added in your docker-compose.yml:
docker compose up -d
Try it:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "luna",
"messages": [{"role": "user", "content": "How are you doing?"}],
"temperature": 0.1
}'
The first call takes a while, but subsequent calls are as fast as ChatGPT on my computer - if not faster. I can ask the model anything - even in Norwegian and it answers in Norwegian. But it clearly hallucinates more than ChatGPT! The Luna model is capable of generating correctly formatted Home Assistant function calls, but honestly it struggles with choosing the correct domains and entity_ids. I find that gpt-3.5-turbo-1106 is a lot more reliable and smarter. However, the Luna model can work great for you.
Now that your model is working on your server, we can make the model available to Home Assistant as a conversation agent. Install Extended OpenAI Conversation via HACS, and configure the newly installed local module as a service.
You still need to configure your new conversation agent. You need to switch out the default "gpt-3.5-turbo-1106" model name with "luna" (corresponding to the "name"-parameter in your luna.yaml-file). You may also optimize the prompt and reduce context. I'm not sure about the maximum context of this Luna model; but I've set it to 4096. The worst that can happen is you send a too large prompt+question and the model crashes on you.
Now, you can use your local LLM in your own local pipeline. Take a look at this quite popular video to see the stages in setting up your own local pipeline. The difference between the Youtube video and your newly set up local pipeline, is that yours is not slow - it's blazing fast.
If you want to add a wake word, you'll need configurable mic, for example the M5Stack ATOM Echo. It's not super stable, but it's good for testing. Another alternative is the Wyoming satellite based on a Raspberry Pi.