Custom BioChatter Light and Next: Cancer Genetics Use Case
This example is part of the BioChatter manuscript supplement. We demonstrate the resulting applications as web apps at https://decider-light.biochatter.org and https://decider-next.biochatter.org. Find more information on how to build and use these apps below.
Background
Personalised medicine tailors treatment to a patient's unique genetic makeup. In cancer care, this approach helps categorize patients and assign them to specific treatment groups in clinical trials. However, interpreting and making decisions based on this data is challenging due to the complexity of genetic variations, the interaction between genes and environmental factors, tumor diversity, patient histories, and the vast amount of data produced by advanced technologies.
In the DECIDER consortium, we aim to improve clinical decisions by providing support systems, for instance for the geneticists working on these cases. The code for the use case lives at https://github.com/biocypher/decider-genetics.
Below, we show how we build a support application for this use case.
Sources of knowledge
We integrate knowledge from diverse resources, using BioCypher to build a knowledge graph of:
-
Processed whole genome sequencing data of ovarian cancer patients (synthetic data)
-
genomic changes
-
classified by consequence (protein truncation, amino acid change)
-
algorithmic prediction of deleteriousness
-
variant identifiers
-
-
allele dosages
-
gene allele copy number (amplifications, deletions, loss-of-heterogeneity)
-
mutation pervasiveness (estimate of number of affected alleles, or suspected subclonality)
-
-
proportion of cancer cells in the sample (tumour purity)
-
-
the patients' clinical history (synthetic data)
-
personal information (age at diagnosis, BMI, etc.)
-
treatment history, known side effects, clinical response
-
lab test results (blood, imaging, histopathology)
-
common treatment-relevant mutations (BRCA), HR deficiency, PARP-inhibitor maintenance
-
-
data from open resources (real data)
-
variant annotations (as provided by the genetics pipeline of the DECIDER consortium)
-
gene annotations (as provided by the genetics pipeline of the DECIDER consortium)
-
pathway / process annotations (from public databases such as Gene Ontology)
-
drug annotations (from OncoKB)
-
In addition, we provide access to more resources via the RAG and API agents:
-
relevant publications from PubMed (real data) embedded in a vector database
-
relevant knowledge streamed live from OncoKB (see below) via API access through BioChatter's API agent
The geneticist's workflow
Personalised cancer therapy is guided by identifying somatic genomic driver events in specific genes, particularly when these involve well-known hotspot mutations. However, unique somatic events in the same genes or pathways can create a "grey zone" that requires manual geneticist analysis to determine their clinical significance.
To address this, a comprehensive BioCypher backend processes whole-genome sequencing data to catalog somatic changes, annotating their consequences and potential actionability. These data can then be linked to external resources for clinical interpretation. For example, certain mutations in the BRCA1 or ERBB2 genes can indicate sensitivity to specific treatments such as PARP inhibitors or trastuzumab.
To fully leverage actionable data, the integration of patient-specific information with literature on drug targets and mechanisms of action or resistance is essential. OncoKB is the primary resource for this information, accessible through drug annotations added to the knowledge graph (KG) and via the BioChatter API calling mechanism.
Additionally, semantic search tools facilitate access to relevant biomedical literature, enabling geneticists to quickly verify findings against established treatments or resistance mechanisms.
In summary, the main contributions of our use case to the productivity of this workflow are:
-
making processed and analysed genomic data locally available in a centralised resource by building a custom KG
-
allowing comparison to literature via semantic search inside a vector database with relevant publications
-
providing live access to external resources via the API agent
Building the application
We will explain how to use the BioCypher ecosystem, specifically, BioCypher and BioChatter, to build a decision support application for a cancer geneticist.
The code base for this use case, including all details on how to set up the KG and the applications, is available at https://github.com/biocypher/decider-genetics.
You can find live demonstrations of the application at links provided in the README of the repository.
The build procedures can be reproduced by cloning the repository and running docker-compose up -d
(or the equivalent for the Next app) in the root directory (note that the default configuration requires authentication with OpenAI services).
The process involves the following steps:
-
Identifying data sources and creating a knowledge graph schema
-
Building the KG with BioCypher from the identified sources
-
Using BioChatter Light to develop and troubleshoot the KG application
-
Customising BioChatter Next to yield an integrated conversational interface
-
Deploying the applications
Identifying data sources and creating a knowledge graph schema
We examine the data sources described above and design a KG schema that can accommodate the data.
The configuration file, schema_config.yaml, can be seen in the config
directory of the repository.
The schema should also be designed with LLM access in mind; performance in generating specific queries can be adjusted for in step three (troubleshooting using BioChatter Light).
We created a bespoke adapters for the genetics data of the DECIDER cohort according to the output format of the genetics pipeline, and reused existing adapters for the open resources.
They can be found in the decider_genetics/adapters directory of the repository.
For this use case, we created synthetic data to stand in for the real data for privacy reasons; the synthetic data are available in the data
directory.
This is the schema of our KG:
graph TD;
Patient[Patient] -->|PatientToSequenceVariantAssociation| SequenceVariant[SequenceVariant]
Patient[Patient] -->|PatientToCopyNumberAlterationAssociation| CopyNumberAlteration[CopyNumberAlteration]
SequenceVariant[SequenceVariant] -->|SequenceVariantToGeneAssociation| Gene[Gene]
CopyNumberAlteration[CopyNumberAlteration] -->|CopyNumberAlterationToGeneAssociation| Gene[Gene]
Gene[Gene] -->|GeneToBiologicalProcessAssociation| BiologicalProcess[BiologicalProcess]
Gene[Gene] -->|GeneDruggabilityAssociation| Drug[Drug]
Building the KG with BioCypher
In the dedicated adapters for the DECIDER genetics data, we pull the data from the synthetic data files and build the KG. We perform simplifying computations, as described above, to facilitate standard workflows (such as counting alleles, identifying pathogenic variants, and calculating tumour purity). We mold the data into the specified schema in a transparent and reproducible manner by configuring the adapters (see the decider_genetics/adapters directory).
After creating the schema and adapters, we run the build script to populate the KG.
BioCypher is configured using the biocypher_config.yaml file in the config
directory.
Using the Docker Compose workflow included in the BioCypher template repository, we build a containerised version of the KG.
We can inspect the KG in the Neo4j browser at http://localhost:7474
after running the build script.
Any changes, if needed, can be made to the configuration of schema and adapters.
Using BioChatter Light to develop and troubleshoot the KG application
Upon deploying the KG via Docker, we can use a custom BioChatter Light application to interact with the KG. Briefly, we remove all components except the KG interaction panel via environment variables in the docker-compose.yml file (see also the corresponding vignette). This allows us to start the KG and interact with it using an LLM in a reproducible manner with just one command. We can then test the LLM-KG interaction by asking questions and examining the generated queries and its results from the KG. Once we are satisfied with the KG schema and LLM performance, we can advance to the next step.
OpenAI API key needed
In the standard configuration, we use the OpenAI API to generate queries.
Provide your OPENAI_API_KEY
in the shell environment, or modify the
application to call a different LLM.
The BioChatter Light application, including the KG creation, can be built using docker compose up -d
in the root directory of the repository.
An online demonstration of this application can be found at https://decider-light.biochatter.org. You can use this demonstration to test the KG - LLM interaction, asking questions such as:
-
How many patients do we have on record, and what are their names?
-
What was patient1's response to previous treatment, and which treatment did they receive?
-
Which patients have HR deficiency but have not received PARP inhibitors?
-
How many patients had severe adverse reactions, and to which drugs?
-
Does patient1 have a sequence variant in a gene that is druggable? Which drug, and what evidence level has the association?
-
Does patient1 have a sequence variant in a gene that is druggable with evidence level "1"? Which drug? Return unique values.
-
Does patient1 have a copy number variant in a gene that is druggable with evidence level "1"? Which drug? Return unique values.
The query returned by the model can also be modified and rerun without an additional call to the LLM, allowing for easy troubleshooting and exploration of the KG. The schema information of the KG is displayed in the lower section of the page for reference.
Customising BioChatter Next to yield an integrated conversational interface
We can further customise the Docker workflow to start the BioChatter Next application, including its REST API middleware biochatter-server
.
In addition to deploying all software components, we can also customise its appearance and functionality.
Using the biochatter-next.yaml configuration file (in config
, as all other configuration files), we can adjust the welcome message, how-to-use section, the system prompts for the LLM, which tools can be used by the LLM agent, the connection details of externally hosted KG or vectorstore, and other parameters.
We then start BioChatter Next using a dedicated Docker Compose file, which includes the biochatter-server
middleware and the BioChatter Next application.
OpenAI API key needed
In the standard configuration, we use the OpenAI API to generate queries.
Provide your OPENAI_API_KEY
in the .bioserver.env
file, or modify the
application to call a different LLM.
The BioChatter Next application, including the customisation of the LLM and the integration of the KG, can be built using docker compose -f docker-compose-next.yml up -d
in the root directory of the repository.
An online demonstration of this application can be found at https://decider-next.biochatter.org.
Deploying the applications
The final step is to deploy one or both applications on a server. Using the Docker Compose workflow, we can deploy the applications in many different environments, from local servers to cloud-based solutions. The environment supplied by the Docker software allows for high reproducibility and easy scaling. The BioChatter Light app can be used for testing, but also to provide a simple one-way interface to the KG for users who do not need the full conversational interface. The BioChatter Next app can be configured to connect to KG and vectorstore deployments on different servers, allowing for a distributed architecture and dedicated maintenance of components; but it can also be deployed in tandem from one Docker Compose, for smaller setups or local use.