To connect to a vector database for using semantic similarity search and
retrieval-augmented generation (RAG), we provide an implementation that connects
to a Milvus instance (local or remote). These functions
are provided by the modules
vectorstore.py (for performing embeddings) and
vectorstore_agent.py (for maintaining the
connection and search).
This is implemented in the ChatGSE
Docker workflow and the BioChatter Docker compose found in this repository. To
start Milvus on its own in these repositories, you can call
docker compose up
-d standalone (
standalone being the Milvus endpoint, which starts two other
services alongside it).
To connect to a vector DB host, we can use the corresponding class:
This establishes a connection with the vector database (using a host IP and port) and uses two collections, one for the embeddings and one for the metadata of embedded text (e.g. the title and authors of the paper that was embedded).
To embed text from documents, we use the LangChain and BioChatter functionalities for processing and passing the text to the vector database.
from biochatter.vectorstore import DocumentReader()
from langchain.text_splitter import RecursiveCharacterTextSplitter
# read and split document at `pdf_path`
reader = DocumentReader()
docs = reader.load_document(pdf_path)
text_splitter = RecursiveCharacterTextSplitter(
separators=[" ", ",", "\n"],
split_text = text_splitter.split_documents(docs)
# embed and store embeddings in the connected vector DB
doc_id = dbHost.store_embeddings(splitted_docs)
The dbHost class takes care of calling an embedding model, storing the embedding in the database, and returning a document ID that can be used to refer to the stored document.
To perform a semantic similarity search, all that is left to do is pass a
question or statement to the
dbHost, which will be embedded and compared to
the present embeddings, returning a number
k most similar text fragments.
Using the collections we created at setup, we can delete entries in the vector database using their IDs. We can also return a list of all collected docs to determine which we want to delete.