ITT we are discussing LocalAI and its sister applications, LocalAGI and LocalRecall, as used on eom.dev. These services offer APIs for using self-hosted artificial intelligence models, and this thread will be used to document and discuss their usage on this platform.
Self-hosted AI models with LocalAI and the Newelle virtual assistant
Self-hosted artificial intelligence has been a long-standing goal of eom.dev; however, it has been a difficult journey getting the hardware and software configured correctly (see the Dell Tower Plus and PowerEdge Server threads for details). After numerous iterations, a usable platform has finally been established. Naturally, I am immediately thinking of ways to improve this setup through additional services and custom applications. In this live stream I will be demoing my self-hosted LocalAI deployment,
LocalAI
LocalAI is a platform for running artificial intelligence models locally on one’s own hardware. It functions as a drop-in replacement for the OpenAI API, is comatible with NVIDIA GPUs, and runs well in Kubernetes.
While I have actually had LocalAI running on my network for quite some time now, using it would cause my server fans to speed up, causing quite a lot of noise. This would have been fine if they were only on when I was using the API; however, the model engine was remaining loaded on the GPU persistently after a single use, so the fans would stay on until this GPU process was killed manually. Given I run this from my apartment, I need the noise levels to be tolerable without manual intervention. Fortunately, the following environment variables were described in the documentation, and they produce exactly the effect I needed:
With these set, the fans turn on when a model is first loaded, and then they shut down automatically after five minutes of inactivity. This may be worth increasing to fifteen minutes to reduce loading times when calling the API at a slightly slower rate. This and other variables are defined in the Ansible role created for this deployment.
This Ansible role simply deploys the Helm charts published by go-skynet (which is an organization created by mudler).
LocalAGI
LocalAGI extends the capabilities of LocalAI as a drop-in replacement for the OpenAI Responses API.
I have not deployed it to my own infrastructure yet, but an existing pull request served as a starting point for creating an Ansible role to do so.
LocalRecall
LocalRecall is a further extension of LocalAI, adding memory and storage to the AI stack.
I have no starting point for this deployment, and I am not sure LocalAGI fully supports the connection (see this GitHub issue). This is what is preventing me from deploying the full stack at this time.
Newelle
Newelle is a free and open source AI assistant that runs on a Linux desktop. I was able to configure it to utilize the LocalAI API, and it seems like it should work with LocalAGI and LocalRecall as well. It seems to offer some local mechanisms for these features as well. I will have to use it on a regular basis for a while before I will be able to comment on its utility.
Self-Hosted AI Agents with LocalAI, LocalAGI, and LocalRecall
LocalAI is now integrated with Matrix on eom.dev! You can now chat with open source AI models by mentioning @LocalAI in Matrix rooms.
Simple Matrix Bot
This integration was originally powered by Baibot, an open source project created by etke.cc.
I had actually tried deploying a couple other bots for this purpose before finding Baibot. Unfortunately, these were difficult to configure in a Kubernetes environment, produced mysterious error messages that were difficult to diagnose, or were not able to support encryption.
I found Baibot easy to set up, well-documented, and featureful. I will be interested to explore other works from this organization.
We’re a group of individuals from Europe that have been active and recognized members of the Matrix community for ~7 years.
The etke.cc service was founded in 2021 by Nikita Chernyi, based on Slavi Pantaleev’s free-software work: the matrix-docker-ansible-deploy Ansible playbook - the most popular and sane way to deploy Matrix on your own infrastructure. [1]
That said, Baibot only interfaces with the LocalAI API. Much more powerful options are available through LocalAGI connectors. Once I realized this, Baibot was removed after only a short period of time. Its deployment definition was saved in the ansible-role-localai repository for reference.
AI Agents
Agents are effectively workflows that combine AI output with certain actions, such as performing searches, using APIs, or executing code. An agent may loop between consulting the LLM and taking actions several times before producing its final output.
LocalAGI
Agents are defined through LocalAGI.
Connectors
Connectors define the user interface for an agent. The Matrix integration with LocalAI on eom.dev is now powered by a LocalAGI connector. In some ways, I actually preferred Baibot: it replied in threads, formatted messages properly, and supported E2E encryption. The LocalAGI Matrix connector lacks these features; however, the agentic workflows are far more interesting, so I will have to hope for an update or write one myself. There are additional connectors for email, IRC, Twitter, and more. For my purposes, I would be interested in a Mastodon connector; unfortunately, one does not yet exist.
Prompts
There are several differnet prompts that can be configured for the agent. The system prompt seems to be the most impactful, as it is sent alongside each chat completion; however, the identity guidance and long-term goals are available as well. In truth, I’m not certain what these do just yet.
Retrieval Augmented Generation
Retrieval augmented generation (RAG) is a technique for providing additional contextual information to large language models as a way of improving their output. Search results from Wikipedia, Google, or a local knowledge base may be injected into the user’s prompt to provide the model with up-to-date and relevant information to inform its response.
LocalAGI utilizes some RAG techniques on its own through the use of actions and MCP; however, the more advanced RAG functionality using vector databases and word embeddings is implemented in LocalRecall.
Actions
An action is something the agent is able to do. They are basically scripts that the agent can execute under certain circumstances. These actions can be anything from searching DuckDuckGo to executing trades on the Kraken cryptocurrency exchange. Anything that we can define programmatically can be used as an action in the workflow.
MCP Servers
Model Context Protocol (MPC) servers are an in-between layer for agents and web services that provide a standardized interface for models to utilize the APIs of these services.
MCP (Model Context Protocol) is an open-source standard for connecting AI applications to external systems. [2]
I have not set up any MCP connections for my agent just yet, though I have been browsing some of the available ones for my existing services such as Discourse, Grafana, and Gitea.
LocalRecall
LocalRecall handles memory for the AI agent through the use of vector databases and word embedding.
In short, sources from the knowledge base are transformed into vectors using the embeddings endpoint of LocalAI. These vectors are then stored in a vector database that comes packaged with LocalRecall and is queried for additional context on chat completions. I am using the llama3.2-1b-instruct model also known as bert-embeddings for embeddings, llama3-8b-instruct as my base LLM, and chromem for the vector database.
Sources can be local files, web pages, git repositories, sitemaps, etc., and are stored in a collection that maps to the agent name from LocalAGI. When the agent is created and access to the knowledge base is enabled, the collection will be created automatically thorough the LocalRecall API and sources can be added subsequently.
Results
The end result of this effort is a functioning chatbot in the Matrix room that is aware of threads on Discourse and Gitea. It can answer questions about the platform, remembers its chat history, and can learn and improve over time. At the time of writing, this is still a prototype. It honestly doesn’t produce the greatest responses just yet; however, with some tweaking of the models, prompts, and other features, I am hoping this can become quite a useful feature of this platform.