Open-source and Local LLMs

There are two self-hosted/local LLM solutions that BioChatter currently supports out-of-the-box.

Below, we provide installation and usage instructions for both of them.

Xorbits Inference (Xinference)

Xorbits Inference is an open-source toolkit for running open-source models, particularly language models. To support BioChatter applications in local and protected contexts, we provide API access via BioChatter classes in a unified way. Briefly, this module allows to connect to any open-source model supported by Xinference via the state-of-the-art and easy-to-use OpenAI API. This allows local and remote access to essentially all relevant open-source models, including these builtin models, at very little setup cost.

Usage

Usage is essentially the same as when calling the official OpenAI API, but uses the XinferenceConversation class under the hood. Interaction with the class is possible in the exact same way as with the standard class.

Connecting to the model from BioChatter

All that remains once Xinference has started your model is to tell BioChatter the API endpoint of your deployed model via the base_url parameter of the XinferenceConversation class. For instance:

from biochatter.llm_connect import XinferenceConversation

conversation = XinferenceConversation(
    base_url="http://localhost:9997",
    prompts={},
    correct=False,
)
response, token_usage, correction = conversation.query("Hello world!")

Deploying locally via Docker

We have created a Docker workflow that allows the deployment of builtin Xinference models, here. It will soon be available via Dockerhub. There is another workflow that allows mounting (potentially) any compatible model from HuggingFace, here. Note that, due to graphics driver limitations, this currently only works for Linux machines with dedicated Nvidia graphics cards. If you have a different setup, please check below for deploying Xinference without the Docker workflow.

Deploying locally without Docker

Installation

To run Xinference locally on your computer or a workstation available on your network, follow the official instructions for your type of hardware. Briefly, this includes installing the xinference and ctransformers Python libraries into an environment of your choice, as well as a hardware-specific installation of the llama-ccp-python library.

Deploying your model

After installation, you can run the model (locally using xinference or in a distributed fashion. After startup, you can visit the local server address in your browser (standard is http://localhost:9997) and select and start your desired model. There is a large selection of predefined models to choose from, as well as the possibility to add your own favourite models to the framework. You will see your running models in the Running Models tab, once they have started.

Alternatively, you can deploy (and query) your model via the Xinference Python client:

from xinference.client import Client

client = Client("http://localhost:9997")
model_uid = client.launch_model(model_name="chatglm2")  # download model from HuggingFace and deploy
model = client.get_model(model_uid)

chat_history = []
prompt = "What is the largest animal?"
model.chat(
    prompt,
    chat_history,
    generate_config={"max_tokens": 1024}
)

Ollama

Ollama is arguably the biggest open-source project for local LLM hosting right now. In comparison to Xinference it lacks the complete freedom of running any HuggingFace model in a simple fashion, but has the benefit of higher stability for the supported models. The list of supported models is updated diligently by the Ollama community. BioChatter support was added by implementing the LangChain ChatOllama and LangChain OllamaEmbeddings classes, connecting to Ollama APIs.

Usage

Usage is essentially the same as when calling the official OpenAI API, but uses the OllamaConversation class under the hood. Interaction with the class is possible in the exact same way as with the standard class.

Connecting to the model from BioChatter

Once Ollama has been set up (see below), you can directly use BioChatter to connect to the API endpoint and start any available model. It will be downloaded and launched on-demand. You can now configure the OllamaConversation instance setting the base_url and model_name parameters. For example:

from biochatter.llm_connect import OllamaConversation

conversation = OllamaConversation(
    base_url="http://localhost:11434",
    prompts={},
    model_name='llama3',
    correct=False,
)
response, token_usage, correction = conversation.query("Hello world!")

Deploying locally via Docker

To deploy Ollama with Docker is extremely easy and well documented. You can follow the official Ollama Docker blog post for that or check the Ollama DockerHub page that will also help you with the installation of the required nvidia-container-toolkit library if you want to use GPUs from Docker containers.

Deploying locally without Docker

Installation

You can download and run Ollama also directly on your computer. For this you can just visit the official website that provides you with an installer for any OS. More info on the setup and startup process can be found in the GitHub README.