BioBTree v2: Grounding LLM Responses with Large-Scale Structured Biomedical Data

Tamer Gür
Rottweil, Germany
[email protected]

March 2026

Abstract

Biological systems are inherently interconnected, and understanding their mechanisms often requires examining data from multiple angles—genomic, proteomic, chemical, clinical—yet these data reside in specialized databases with different identifiers, formats, and interfaces. We present BioBTree v2, a biomedical graph database that unifies more than fifty primary databases—across genes, proteins, chemical compounds, pathways, diseases, and clinical data, with complementary sources in key domains to maximize coverage—into a queryable graph with billions of connections. Users traverse this graph using an intuitive chain query syntax, replacing multi-step manual workflows with single-line queries. Through a Model Context Protocol (MCP) server, Large Language Models can directly query and reason over this graph, grounding their responses in structured database records. We evaluate this approach through four use cases comparing BioBTree-augmented LLM responses against three baseline LLMs (ChatGPT, Gemini, Claude) across drug safety, drug mechanism, clinical gene annotation, and transcription factor network analysis. Across all four use cases, BioBTree provided quantitative data, verified identifiers, and specialized database content, demonstrating that combining structured database access with LLM reasoning yields more comprehensive results for biomedical research. BioBTree v2 is open source (GPL-v3) and available at https://github.com/tamerh/biobtree. A publicly accessible instance with MCP and REST API endpoints is hosted at https://sugi.bio.

1 Introduction

Biomedical research depends on hundreds of specialized databases covering genes, proteins, variants, drugs, diseases, and pathways. Yet the boundaries between these databases create substantial friction for researchers whose questions span multiple domains. For example, a single drug discovery query may require traversing MONDO for disease classification, UniProt for protein products, ChEMBL for compound activity, and ClinicalTrials.gov for trial status—each with different identifiers (BRCA1 vs. ENSG00000012048 vs. HGNC:1100), APIs, and data formats. Navigating these boundaries manually is time-consuming and may cause researchers to miss connections that a more integrated view of the data could reveal.

Existing integration efforts address parts of this problem. BioMart [1] provides a general-purpose query framework for biological databases. BioThings [2] aggregates multiple APIs for programmatic access across entity types. Knowledge graphs such as Hetionet [3] and PrimeKG [4] provide valuable network representations for computational analyses. Open Targets [5] offers rich disease-target associations with pre-computed evidence scores. BridgeDb [6] provides robust identifier mapping across biological databases. The NCATS Biomedical Data Translator [7] employs a sophisticated federated architecture for multi-source reasoning. gget [8] provides a convenient Python and command-line interface for querying individual genomic databases. These tools each serve important roles and are used by the research community; however, a gap remains for a unified query interface that spans these domains with complementary sources and native LLM integration.

Large Language Models have demonstrated impressive and rapidly advancing capabilities in biomedical question answering [9], with particular strength in reasoning, literature synthesis, and contextual interpretation. Nevertheless, biomedical databases update continuously, and the scale and fragmentation of this data across many specialized sources remains a challenge; LLMs also remain susceptible to hallucinations in biomedical contexts [10]. Equipping LLMs with direct structured access to authoritative databases offers a complementary path. The Model Context Protocol (MCP) provides a standardized framework for this integration [11], and community efforts such as BioContextAI are building registries of biomedical MCP servers [12]. By exposing database access through MCP, specialized knowledge systems become directly usable by any MCP-compatible LLM. We demonstrate this complementarity through four use cases in Section 3.

We present BioBTree v2, a biomedical graph database that addresses both the fragmentation and AI integration challenges. BioBTree builds a unified index from more than fifty primary databases, including complementary sources for major domains such as Ensembl and NCBI Gene for genomics or ChEMBL and PubChem for chemistry, connecting them through billions of cross-reference edges into a queryable graph. Users can traverse this graph using an intuitive chain query syntax:

glioblastoma >> mondo >> gencc >> ensembl >> uniprot >> chembl_target

This single query traverses from a disease name through the Mondo disease ontology, to gene-disease associations in GenCC, to gene annotations in Ensembl, to protein sequences in UniProt, and finally to drug targets in ChEMBL—a workflow that would otherwise require navigating five different databases with their distinct interfaces and identifier systems.

BioBTree v2 introduces several key innovations. The system fetches data directly from each database’s authoritative source and processes it through an efficient pipeline, enabling the entire graph to be rebuilt as sources release updates. The chain query syntax enables intuitive cross-database traversal with filtering at each step. An MCP server exposes three tools—biobtree_search, biobtree_map, and biobtree_entry—enabling any MCP-compatible LLM to access the biomedical graph; rather than implementing complex retrieval-augmented generation pipelines, the system provides the graph schema to the LLM and relies on its reasoning capabilities to plan query traversals. A direct REST API is also available for programmatic access independent of LLMs.

This paper describes the design and implementation of BioBTree v2: the unified integration of diverse biomedical databases, the chain query syntax and MCP server architecture, and an empirical evaluation through four use cases comparing BioBTree-augmented LLM responses against baseline LLMs across drug safety, drug mechanism, clinical gene annotation, and transcription factor network analysis. Figure 1 illustrates the system architecture and Table 1 lists the integrated databases.

GPCPDTerhaianoetsxotmheaOomeiiwsnEninsaetxocstrys &oprms &yloey & &&Pgss &St NhieiGrDeesonOeurtntnectuwothsugoyeresrkprse
LGLCOBRMEUhRC,gNenniEelEeCssePMacinFe,Be-mroBtVOCIaPbtLomar,ETrol,,,e,ULacwHAP,MBLxoheGlpubSOExnerNhaCTNRGoreCFhRDOEmsdoemINONNyldGEClients
SMsmeBGFcCCeanirECACrlaPaptoanPhInHMNTipurcryBptrDEtAPeTDtdhTheBBcOSO,serz,,I,t,,HMM /SeRInHBO,5SCeeQetMirpC,iPLrufSerDoGhLSgDipIveeqPBRan,.CBe,eRr,ro,Ie.X,l ArEy.,.Dt,.A.ingS.....,.,.eeTE........snn...tsg,Ai ePntIec. billions of edges • identifier » dataset » dataset — Figure 1: BioBTree v2 system architecture. The query engine traverses a cross-reference graph across eight domain groups (representative databases shown). Dashed lines indicate cross-domain linkages.

Table 1: Major datasets integrated in BioBTree v2, organized by domain. The graph also includes derived identifier mappings from source database cross-references (e.g., probe identifiers, DrugBank, KEGG) that provide additional connectivity.

Domain	Databases
Genomics & Genes	Ensembl [13], HGNC [14], NCBI Gene [15], RefSeq [16], dbSNP [17], ENCODE cCREs [18]
Proteins & Structure	UniProt [19], AlphaFold [20], AlphaMissense [21], PDB [22], InterPro [23], ESM2 similarity [24], DIAMOND similarity [25], JASPAR [26]
Chemistry & Drugs	ChEMBL [27], PubChem [28], ChEBI [29], HMDB [30], BindingDB [31], BRENDA [32], Rhea [33], SwissLipids [34], LIPID MAPS [35], SureChEMBL [36]
Pathways & Networks	Reactome [37], STRING [38], IntAct [39], BioGRID [40], SIGNOR [41], CollecTRI [42], CellPhoneDB [43], CORUM [44]
Disease & Phenotype	ClinVar [45], MONDO [46], HPO [47], Orphanet [48], GWAS Catalog [49], GenCC [50], Clinical Trials [51], PharmGKB [52]
Ontologies	GO [53], EFO [54], UBERON [55], Cell Ontology [56], MeSH [57], ECO [58], BAO [59]
Expression	Bgee [60], CELLxGENE [61], Expression Atlas [62], FANTOM5 [63], RNAcentral [64], miRDB [65]
Taxonomy & Other	NCBI Taxonomy [66], CTD [67], MSigDB [68]

2 Methods

BioBTree v2 builds upon the identifier mapping framework introduced in our previous work [69], extending it with a chain query syntax, expanded dataset coverage, and native integration with Large Language Models. The system fetches data directly from primary biomedical databases, builds bidirectional cross-references between entries, and produces a unified graph index that enables multi-hop traversal queries across data sources. This section describes the data model, processing pipeline, query interface, and AI integration components.

2.1 Data Model

The core data model represents biological entities as nodes in a cross-reference graph. Each entry consists of an identifier serving as the primary key, a dataset identifier indicating the source database, a set of typed attributes specific to that dataset, and cross-references linking to entries in other datasets with optional evidence codes and relationship types. Critically, the index stores both forward references (from source entries to their cross-references) and reverse references (from target entries back to all sources that reference them), enabling bidirectional traversal. For example, when UniProt is processed and declares a cross-reference to an Ensembl gene, the system records both the forward link (UniProt → Ensembl) and the reverse link (Ensembl → UniProt). This bidirectional indexing means that a query starting from a gene symbol can reach proteins, and a query starting from a protein can reach genes—without requiring both databases to declare the relationship.

2.2 Data Processing Pipeline

The processing pipeline consists of three phases—update, sort, and merge—designed to handle datasets of arbitrary size with bounded memory usage and support for incremental rebuilds.

Update phase. Data is acquired from primary databases through a mix of approaches: many datasets are streamed directly from their authoritative sources via FTP or HTTP, while others require pre-computation before integration. For example, ESM2 protein structural similarity scores are computed from Meta AI’s protein language model embeddings, and DIAMOND sequence similarity results are generated via all-versus-all BLASTP searches on UniProt entries; both produce pre-computed similarity results that are then ingested as datasets. Source data in various formats is parsed incrementally and distributed to bucket files based on identifiers using bucketing strategies tailored to different identifier systems, enabling efficient storage, parallel processing, and state management during updates. During parsing, identifiers and terms undergo dataset-specific normalization, such as expanding medical term abbreviations and resolving naming variations, to maximize cross-reference matching across sources. Each dataset writes both forward cross-references to its own buckets and reverse cross-references to the buckets of target datasets, so that each dataset’s processing enriches the graph for other datasets. Additionally, during processing of later datasets, the pipeline can query already-indexed entries from earlier datasets to resolve identifiers and enrich cross-references that are not explicitly declared in the source data. A three-state tracking system (processing, processed, merged) records each dataset’s status, enabling automatic crash recovery and incremental updates—only datasets whose sources have changed need reprocessing.

Sort phase. Bucket files are sorted independently in parallel across CPU cores, then concatenated per dataset into compressed index files. Cross-references are sorted by per-dataset configurable criteria including species priority (human entries first, then model organisms) and relevance scores (expression levels, interaction confidence), ensuring that query results are pre-ranked by biological relevance without requiring runtime sorting.

Merge phase. All sorted index files across all datasets are combined using a k-way merge algorithm implemented with a min-heap data structure. A worker pool with configurable workers reads keys on demand from potentially thousands of files while sharing a buffer pool, reducing peak memory to a few gigabytes for the complete dataset. Entries are serialized using Protocol Buffers and written in batches to LMDB (Lightning Memory-Mapped Database), a high-performance key-value store based on B+ tree indexing with memory-mapped I/O. Keys are inserted in lexicographic order using append mode for optimal write performance. LMDB provides fast point lookups at query time, where each identifier key maps to entries across multiple datasets—for example, a gene symbol may resolve to entries in Ensembl and HGNC simultaneously. Checkpoint files track progress through the merge, enabling recovery from interruptions without reprocessing completed portions. The choice of LMDB over a general-purpose graph database reflects the read-heavy access pattern of biomedical reference data: the index is rebuilt periodically but queried continuously. By building a custom graph index on a lightweight key-value store, BioBTree avoids the performance overhead and operational complexity of a full database engine while retaining full flexibility over algorithmic choices, data layout, and domain-specific optimizations. The system also supports local federation, where exceptionally large datasets such as dbSNP can be partitioned into separate indexes that are queried transparently alongside the main index; this allows independent updates of individual datasets without rebuilding the entire graph.

2.3 Query Interface

BioBTree v2 introduces a chain query syntax that expresses multi-hop traversals as a sequence of dataset names connected by the >> operator. The first operator performs a lookup in the specified dataset, while subsequent operators follow cross-references to target datasets:

BRCA1 >> ensembl >> uniprot >> chembl_target

This query looks up BRCA1 in Ensembl, maps the resulting gene entries to their corresponding UniProt proteins, and then follows cross-references to ChEMBL drug targets. Each step can include filter expressions using Common Expression Language (CEL) syntax:

BRCA1 >> hgnc >> clinvar[germline_classification=="Pathogenic"] 
TP53 >> ensembl >> uniprot >> pdb[resolution<2.0]

The first query retrieves pathogenic ClinVar variants for BRCA1; the second finds high-resolution PDB structures for TP53. Filters can combine multiple conditions and apply at any step in the chain.

The query engine uses response formats tailored to each operation. The biobtree_search and biobtree_map tools return compact, token-efficient responses that deliver maximum informational content with minimal payload, while biobtree_entry returns complete entry details including all attributes and nested cross-reference information. This design is important for AI agent integration, where context window limits make compact responses essential for effective multi-turn reasoning.

2.4 MCP Server and AI Integration

To enable integration with Large Language Models, BioBTree v2 includes a Model Context Protocol (MCP) server that exposes the query interface through a standardized tool-calling interface. The MCP server provides three tools: biobtree_search for identifier lookup across all datasets, biobtree_map for chain query execution, and biobtree_entry for retrieving complete entry details. Rather than implementing a complex retrieval-augmented generation (RAG) pipeline, the MCP server embeds the graph schema—the complete list of dataset connections and query syntax—directly into the tool descriptions. The LLM receives this structural context and uses its own reasoning capabilities to plan multi-hop traversals for a given question, effectively acting as the query planner over the graph. This design is deliberately simple: it avoids the overhead of embedding-based retrieval or custom retrieval logic, and naturally improves as LLM reasoning capabilities advance. The prompts can be tuned for specific use cases if needed, but in practice the default schema context is sufficient for diverse biomedical queries.

3 Use Cases

To evaluate BioBTree’s MCP server in realistic biomedical research scenarios, we designed four use cases spanning drug safety, drug mechanism analysis, clinical gene annotation, and transcription factor networks. These represent a selected subset of the question types the system can address, chosen to demonstrate cross-database traversal across diverse biomedical domains; the full range of integrated databases (Table 1) and direct API access enables many additional query types beyond those presented here. For each question, the user submits a natural language query through the Claude CLI (Claude Opus 4.5); the LLM autonomously decides which BioBTree MCP tools to invoke, retrieves structured data, and synthesizes the response. We compare these results against three baseline LLMs used without BioBTree: ChatGPT (GPT-5.2), Gemini (Pro 3), and Claude (Opus 4.6 Extended Thinking). All queries were performed on 3–4 March 2026. Baseline LLMs were used with their default settings, which may include built-in web search and tool-use capabilities. The comparison assesses whether BioBTree’s structured database access provides information that LLMs cannot reliably provide in their current state. Full responses are provided in the Supplementary Materials (S1–S4).

3.1 Use Case 1: SCN9A Drug Safety — Tissue Expression Profiling

Motivation. Before entering clinical trials, drug developers must assess tissue expression profiles of their target to anticipate off-target effects. SCN9A encodes the Nav1.7 voltage-gated sodium channel, a well-characterized pain target with multiple clinical programs. This use case tests whether database-backed tissue expression data can surface safety-relevant findings that literature-trained models miss.

I’m developing a drug that inhibits SCN9A (Nav1.7 sodium channel) for chronic pain treatment. Before clinical trials, I need to understand the expression profile of SCN9A to anticipate potential off-target effects: (1) Which tissues express SCN9A most highly? (2) Are there any non-pain-related tissues with significant expression that could indicate safety concerns? (3) What phenotypes are associated with SCN9A dysfunction in humans?

All three baseline LLMs performed well on qualitative information, correctly identifying dorsal root ganglia and trigeminal ganglia as primary expression sites, and citing anosmia, erythromelalgia, and congenital insensitivity to pain as associated phenotypes. For non-pain tissues with significant expression, all three cited testis, placenta, and colon—tissues frequently discussed in published reviews of Nav1.7.

BioBTree retrieved quantitative expression data from Bgee, providing ranked expression scores for 187 tissues (Table 2). This revealed a finding missed by all three baseline LLMs: endometrial stromal cells rank as the third highest-expressing tissue (expression score 83.65), above trigeminal ganglion (80.31), male germ cells (78.17), and colonic epithelium (77.13)—all tissues that baseline LLMs did mention. This represents a potential reproductive safety concern for women that would not be identified through literature synthesis alone, as endometrial Nav1.7 expression is less frequently discussed in published reviews than the neural and gastrointestinal expression patterns that dominate the literature.

Beyond the expression data, BioBTree also returned verifiable disease identifiers from Orphanet (ORPHA:88642 for congenital insensitivity to pain with anosmia, ORPHA:90026 for primary erythromelalgia) and HPO phenotype terms (HP:0007021 for pain insensitivity, HP:0000458 for anosmia), providing traceable links to clinical databases that baseline LLMs did not include.

We also compared BioBTree against the Open Targets Platform [5], accessed via its own MCP server, on the same question. Open Targets reported bone marrow (expression value 1,574) and neutrophilic metamyelocytes (1,546) as the highest-expressing tissues, with tibial nerve ranked eighth (795). Endometrial stromal cells did not appear among the top-ranked tissues. The two platforms thus produce substantially different expression rankings for the same gene, reflecting differences in their underlying expression data sources and normalization methods. For SCN9A—a voltage-gated sodium channel canonically expressed in peripheral sensory neurons—BioBTree’s Bgee-derived rankings placing dorsal root ganglia and sural nerve at the top are concordant with the established literature. Open Targets provided complementary data unavailable in BioBTree, including mouse knockout phenotypes (anosmia, neonatal lethality), genetic constraint scores (observed/expected ratio 0.72), and quantitative disease association scores. For comprehensive drug safety assessment, integrating multiple expression sources and database tools may surface findings that individual approaches miss.

Table 2: SCN9A tissue expression comparison. Score and Rank columns are derived from BioBTree’s Bgee expression data. Checkmarks indicate whether the baseline LLM mentioned the tissue qualitatively in its response (without scores or ranking); dashes indicate the tissue was not mentioned.

Tissue	Score	Rank	ChatGPT	Gemini	Claude
Sural nerve	90.90	1	✓	✓	✓
Dorsal root ganglion	88.05	2	✓	✓	✓
Endometrial stromal cells	83.65	3	—	—	—
Trigeminal ganglion	80.31	4	✓	✓	✓
Male germ cells	78.17	5	✓	✓	✓
Colonic epithelium	77.13	6	✓	✓	✓
Tibial nerve	76.87	7	—	—	✓
...
Islets of Langerhans	72.88	10	—	—	—

3.2 Use Case 2: Alectinib Drug Mechanism Analysis

Motivation. Understanding a drug’s complete target profile—including off-target interactions, quantitative potency data, and clinical trial landscape—is essential for anticipating adverse effects and identifying repurposing opportunities. This use case tests whether structured database access provides a more comprehensive mechanism analysis than literature-trained models for a clinically approved kinase inhibitor.

Alectinib is a breakthrough targeted therapy for ALK-positive lung cancer. For a comprehensive drug mechanism analysis: (1) How many UniProt protein targets does Alectinib have? List all with names and classify by type. (2) How many clinical trials involve Alectinib? Break down by phase. (3) What diseases is it approved or being investigated for? (4) How many ChEMBL and PubChem bioactivity records exist? Include IC50/Kd values. (5) What other approved or investigational molecules share targets with Alectinib?

BioBTree returned 33 UniProt protein targets from ChEMBL, including the complete off-target profile: primary kinase targets (ALK, RET, ROS1), secondary kinases (GAK, CLK1/2/3, SRPK1/2), GPCRs (5-HT2B, dopamine receptors), transporters (ABCG2, bile salt export pump), and nuclear receptors (estrogen, progesterone, androgen receptors). Baseline LLMs reported only 2–7 targets, focusing on primary kinase targets from literature (Table 3). The comprehensive off-target profile matters for anticipating drug-drug interactions and adverse effects. Quantitative discrepancies were also notable: BioBTree returned 66 clinical trials (with NCT identifiers) and 96 ChEMBL activity records, while baseline LLMs reported substantially higher counts (80–150+ trials, 300–675 activities) that could not be verified against the source databases. All activity records returned by BioBTree included quantitative IC50 values (ALK: 0.59–1.9 nM, RET: 2–4.8 nM) with traceable identifiers, enabling independent verification.

This use case also revealed identifier errors across all three baseline LLMs. Gemini reported ChEMBL ID CHEMBL2178382 and PubChem CID 4258113 for Alectinib; the correct identifiers are CHEMBL1738797 and PubChem CID 49806720. These incorrect identifiers point to entirely different compounds in their respective databases. ChatGPT cited P29353 as the UniProt accession for RET kinase; P29353 is actually the SHC-transforming protein 1—a completely different protein. The correct RET accession is P07949. Such identifier inaccuracies, while likely to improve as LLMs advance, illustrate the current risk of relying on generated identifiers without database verification.

Table 3: Alectinib use case comparison. BioBTree responses are queried from its graph database with traceable identifiers. Errors in baseline LLM responses are marked.

Metric	BioBTree + Claude	ChatGPT	Gemini	Claude
ChEMBL ID	CHEMBL1738797	✓	×^a	✓
PubChem CID	49806720	✓	×^b	✓
RET UniProt ID	P07949	×^c	—	✓
UniProt targets	33	5–6	2–3	7
Clinical trials	66	80–120	150+	50+
ChEMBL activities	96	400–600	300+	675

^aReported CHEMBL2178382 (different compound). ^bReported CID 4258113 (different compound). ^cReported P29353 (SHC-transforming protein 1, not RET kinase).

3.3 Use Case 3: BRCA1 Clinical Gene Annotation

Motivation. Clinical gene annotation for genetic testing requires up-to-date variant data, structural coverage, and access to AI-based prediction tools. BRCA1 is among the most clinically important genes, with its variant database growing continuously as more patients undergo genetic testing. This use case tests data currency and access to specialized prediction databases.

BRCA1 is critical for hereditary breast/ovarian cancer genetic testing. For a comprehensive gene annotation: (1) How many transcripts in Ensembl and RefSeq? How many exons in the canonical? (2) How many UniProt entries and InterPro domains? (3) How many PDB structures exist? (4) How many ClinVar variants? How many pathogenic? (5) What Reactome pathways involve BRCA1? (6) AI variant predictions: How many SpliceAI splice-altering predictions and AlphaMissense missense pathogenicity scores exist? Show variants with scores.

Baseline LLMs performed well on basic gene annotation: all three correctly reported 47 Ensembl transcripts and 23 exons in the canonical transcript, and identified key InterPro domains (RING zinc finger, BRCT). Claude notably produced a comprehensive structured report with clinical annotations and founder mutation context, demonstrating LLMs’ advancing capabilities in synthesizing biomedical information. However, three differences emerged (Table 4).

First, data currency: BioBTree returned 15,445 ClinVar variants from its graph database, approximately 2,000 more than the 12,700–13,444 reported by baseline LLMs from their training data. For a gene where new variant submissions occur regularly, this difference has direct clinical implications for variant interpretation pipelines. Similarly, BioBTree reported 33 PDB structures compared to 32 listed in UniProt’s own cross-references at the time of writing, as BioBTree integrates PDB cross-references from both PDB and UniProt sources independently. Second, specialized database access: BioBTree retrieved 3,719 SpliceAI splice-altering predictions and 12,463 AlphaMissense missense pathogenicity scores, each with genomic coordinates and quantitative scores. All three baseline LLMs acknowledged these databases exist but could not retrieve the actual prediction data. Third, identifier accuracy: ChatGPT listed sequential RefSeq transcript accessions (NM_007299.4 through NM_007343.4) as BRCA1 transcripts; BRCA1 has only five curated RefSeq mRNA transcripts (NM_007294 through NM_007300). The fabricated accessions represent a hallucination that could mislead downstream analysis.

Table 4: BRCA1 clinical gene annotation comparison. BioBTree provides current counts from its graph database and access to specialized prediction databases unavailable to baseline LLMs.

Metric	BioBTree + Claude	ChatGPT	Gemini	Claude
Ensembl transcripts	47	47	47	47
Canonical exons	23	23	22	23
ClinVar variants	15,445	13,444	13,444	∼12,700
PDB structures	33	33	“>100”	“∼80+”
SpliceAI predictions	3,719	—	—	—
AlphaMissense scores	12,463	—	—	—

3.4 Use Case 4: TP53 Transcription Factor Regulatory Network

Motivation. Understanding the regulatory network of a transcription factor requires integrating data from specialized databases that catalog TF-target relationships, binding motifs, protein similarity, and protein-protein interactions. TP53 is the most studied gene in cancer biology, making it an excellent test of whether BioBTree adds value beyond what LLMs can synthesize from extensive training data.

TP53 is the master tumor suppressor regulating cell fate decisions. For a comprehensive transcription factor network analysis: (1) How many genes does TP53 regulate? List key targets with regulation type. (2) Which targets encode cell surface receptors? (3) How many JASPAR binding profiles exist? (4) How many evolutionarily similar proteins exist? (5) How many proteins share sequence homology? (6) How many high-confidence protein interactions? (7) What transcription factors regulate TP53 itself?

All baseline LLMs performed well on core TP53 biology, correctly identifying key targets (CDKN1A, MDM2, BAX, BBC3/PUMA), death receptor targets (FAS, DR5), and three JASPAR binding profiles (MA0106.1–3). Claude provided particularly detailed mechanistic context, including the indirect repression mechanism via the p21-DREAM/RB-E2F pathway and evolutionary insights. The most substantial differences involved specialized databases that BioBTree integrates (Table 5).

The most striking finding involves upstream transcription factors that regulate TP53 itself. BioBTree identified 165 transcription factors via CollecTRI, with classified activation and repression relationships (e.g., E2F1, MYC, HIF1A as activators; ESR1, JUN as repressors). Baseline LLMs cited only 7–20 well-known regulators from published literature. This 8–24-fold difference in regulatory coverage illustrates the complementary value of specialized database access.

BioBTree also provided quantitative data unavailable to baseline LLMs: ESM2 structural similarity scores for 57 proteins (average similarity 0.98–0.99), DIAMOND sequence identity percentages for 40 homologs (TP63: 96.3%, TP73: 97.5%, mouse p53: 99.6%), and precise STRING interaction counts with confidence scores (14,764 total, ∼197 high-confidence at score ≥0.9). While baseline LLMs correctly identified TP63 and TP73 as paralogs, they could not provide similarity metrics or systematic enumeration of homologs. For cell surface receptor targets, BioBTree identified 14 receptors including growth factor receptors (EGFR, PDGFRB, FLT1/VEGFR1) and repressed receptors (IGF1R, NOTCH1, MET), whereas baseline LLMs focused primarily on death receptors that dominate the published literature.

Table 5: TP53 regulatory network comparison. BioBTree integrates specialized databases (CollecTRI, ESM2, DIAMOND) that provide quantitative data unavailable through literature synthesis.

Metric	BioBTree + Claude	ChatGPT	Gemini	Claude
Upstream TF regulators	165	10–20	8	7–9
Cell surface receptor targets	14+	3–6	2	7–9
ESM2 similar proteins	57 (with scores)	3	3	“hundreds”
DIAMOND homologs	40 (with metrics)	3	3	“hundreds”
STRING interactions (≥0.9)	∼200	40–50	11–40	50–100
STRING total	14,764	—	“thousands”	—
JASPAR profiles	3	3	3	3

3.5 Cross-Use Case Summary

Across all four use cases, three systematic patterns emerged. First, identifier accuracy: baseline LLMs produced identifier errors in two of four use cases—fabricated RefSeq transcripts (BRCA1), incorrect ChEMBL and PubChem identifiers (Alectinib), and wrong UniProt accessions (Alectinib). BioBTree returned verified identifiers from its graph database in all cases. Second, quantitative data access: BioBTree consistently provided quantitative metrics (expression scores, IC50 values, similarity percentages, interaction confidence scores) and verifiable counts (66 clinical trials, 96 activity records for Alectinib) while baseline LLMs provided qualitative descriptions, approximate ranges, or inflated counts that could not be traced to source databases. Third, specialized database coverage: several data types were accessible only through BioBTree, including SpliceAI splice predictions, AlphaMissense pathogenicity scores, CollecTRI TF-target networks, ESM2 structural similarity, DIAMOND sequence homology, and Bgee quantitative expression data (Table 6).

Table 6: Cross-use case summary of BioBTree’s unique contributions. Each row represents a capability demonstrated in the corresponding use case that was unavailable or unreliable in baseline LLM responses.

Capability	Use Case	BioBTree + Claude	Baseline LLMs
Quantitative expression scores	SCN9A	Ranked scores (0–100)	Qualitative only
Safety-relevant tissue discovery	SCN9A	Endometrium rank #3	Missed by all
Verified compound identifiers	Alectinib	100% accurate	Errors observed
Complete off-target profile	Alectinib	33 targets	2–7 targets
Current variant counts	BRCA1	15,445	12,700–13,444
AI variant predictions	BRCA1	SpliceAI + AlphaMissense	Inaccessible
TF regulatory network	TP53	165 upstream TFs	7–20 TFs
Sequence/structural similarity	TP53	ESM2 + DIAMOND scores	Family names only

Baseline LLMs demonstrated clear strengths in biological interpretation, mechanistic reasoning, and literature synthesis—from Claude’s detailed clinical annotations and mechanistic context to ChatGPT’s pathway-level explanations and Gemini’s evolutionary framing. These capabilities complement BioBTree’s structured data access, suggesting that the most effective workflow combines both approaches: BioBTree for verified, quantitative data retrieval and LLMs for biological interpretation and synthesis. Notably, all four use cases involved well-characterized genes and drugs with extensive literature coverage. For less-studied targets, where LLM training data is sparser, the advantage of direct database access is likely to be even greater.

4 Discussion

The four use cases demonstrate that BioBTree’s structured database access provides information that current LLMs—even those with web search and tool-use capabilities—do not reliably surface. The endometrial expression finding for SCN9A (Use Case 1), the comprehensive off-target profile and identifier errors for Alectinib (Use Case 2), the data currency advantages for BRCA1 (Use Case 3), and the 8–24-fold difference in upstream TF regulators for TP53 (Use Case 4) each illustrate a distinct category of complementary value: quantitative database content, complete target coverage, recently updated records, and curated interaction networks, respectively.

These findings align with an emerging body of work showing that grounding LLMs with structured biomedical data improves accuracy. KG-RAG demonstrated that augmenting LLMs with the SPOKE knowledge graph significantly improves performance on biomedical benchmarks, with Llama-2 accuracy increasing by 71% on multiple-choice questions [70]. BTE-RAG showed that federated knowledge graph retrieval raises LLM accuracy from 51% to 75.8% on drug mechanism benchmarks [71]. AutoBioKG highlighted that LLMs achieve only 87% extraction accuracy from literature, underscoring the need for verified database sources [72]. The BioContextAI initiative further underscores this trend by establishing a community registry for biomedical MCP servers, recognizing that web search alone is insufficient for accessing specialized databases and that standardized tool interfaces are needed [12]. While the RAG-based approaches above augment LLM prompts with retrieved context, BioBTree takes a complementary approach: it exposes a unified graph index over diverse databases through an MCP interface, allowing the LLM to directly query structured data as a tool rather than retrieving pre-processed context.

A key architectural decision is the use of the LLM itself as the query planner. Rather than implementing embedding-based retrieval or complex RAG pipelines, BioBTree provides the graph schema—dataset connections and query syntax—directly to the LLM through MCP tool descriptions. The LLM then reasons about which traversal paths to follow for a given question. This design has several implications. First, it is simple to maintain: the system requires no vector database, no embedding model, and no custom retrieval logic. Second, it naturally improves as LLM reasoning capabilities advance, without requiring changes to the data infrastructure. Third, it keeps all data grounded in authoritative sources—the LLM orchestrates queries but cannot fabricate the results, since every returned record traces to a specific database entry with a verifiable identifier.

The bidirectional cross-referencing architecture provides a subtle but important advantage for data currency. Because BioBTree indexes both forward and reverse references independently from each source database, the merged graph can contain cross-references that individual databases have not yet incorporated. In the BRCA1 use case, BioBTree reported one additional PDB structure compared to the current UniProt release, because PDB/SIFTS had deposited the structure before UniProt’s next cross-reference update cycle. This is a direct consequence of the bidirectional indexing: each source contributes its own perspective on the relationships, and the merge combines them into a more complete picture.

The comparison with the Open Targets Platform on the SCN9A use case (Section 3.1) illustrates an important point about complementary database tools. The two platforms produced substantially different expression rankings for the same gene, reflecting differences in their underlying data sources and normalization methods. Open Targets provided valuable pre-computed assessments including mouse knockout phenotypes and genetic constraint scores that BioBTree does not offer. Conversely, BioBTree’s Bgee-derived expression data surfaced the endometrial finding that Open Targets did not rank highly. This suggests that for comprehensive analyses, researchers benefit from integrating multiple database tools rather than relying on any single source.

At the same time, we note that LLMs demonstrated genuine strengths in our evaluation. Claude produced comprehensive, well-structured reports with mechanistic context—such as explaining TP53’s indirect repression via the p21-DREAM/RB-E2F pathway—that go beyond what any database can provide. All baseline LLMs correctly identified core biological facts and provided useful literature context. The value of BioBTree lies not in replacing this reasoning capacity but in complementing it with structured data that the LLM cannot generate from its training knowledge or web access alone.

4.1 Limitations

Several limitations should be noted. First, the use cases were selected to demonstrate scenarios where structured database access adds value; they do not represent a comprehensive benchmark of all possible biomedical queries. Questions requiring primarily literature synthesis or mechanistic interpretation may benefit less from database augmentation. Second, all MCP evaluations were performed with Claude Opus 4.5; while the MCP interface is model-agnostic, performance may vary across LLMs with different reasoning capabilities. Third, because the LLM acts as the query planner, it naturally selects the traversal paths it considers most relevant—for example, returning Ensembl transcripts by default rather than also querying RefSeq. This is generally appropriate, but for comprehensive analyses users may need to guide the LLM toward alternative databases or deeper exploration through follow-up questions or more specific prompts. Additionally, due to the non-deterministic nature of LLMs, repeated queries may produce varying levels of detail, although the underlying database results remain consistent. Users can also access the BioBTree API directly for deterministic results. Fourth, BioBTree provides a snapshot of databases at build time. The pipeline tracks which source databases have been updated and supports incremental rebuilds that process only changed datasets, enabling individual database updates within hours rather than requiring a full rebuild; nevertheless, some latency between source updates and index updates is inherent. Fifth, cross-reference completeness depends on what source databases provide—some mappings may be incomplete, and not all biomedical databases are currently integrated. Sixth, while the pipeline supports full-species builds, the public instance prioritizes 16 model organisms for the largest datasets (Ensembl, RefSeq, STRING, and AlphaFold) and indexes Swiss-Prot reviewed entries for UniProt; UniParc, UniRef, and TrEMBL are supported by the pipeline but not included in the current public deployment to keep costs manageable while covering the most commonly queried species.

4.2 Future Work

Several directions could extend this work. Expanding dataset coverage to include additional resources such as Guide to Pharmacology and UniProt TrEMBL would broaden the system’s utility, particularly for non-model organisms. The current evaluation focuses on single-turn queries with multiple sub-questions; shorter, focused queries with iterative follow-up may yield deeper analysis on specific topics. Testing with additional LLM platforms (OpenAI, Gemini) via their MCP integrations would validate the generalizability of the approach. Systematic quantitative benchmarking against standardized biomedical question-answering datasets would complement the current use case evaluations, though designing benchmarks that adequately capture the value of structured database grounding remains an open challenge. Finally, while we deliberately avoid complex scoring or ranking systems—preferring to let the LLM reason over complete data—domain-specific prompt tuning for specialized workflows such as drug repurposing or rare disease diagnosis could further improve response quality.

5 Availability

BioBTree v2 is open source software released under the GPL-v3 license. BioBTree v1 was previously published in F1000Research [69]; v2 represents a major update with expanded dataset coverage, chain query syntax, token-efficient response formats, and the MCP server for LLM integration. The system includes a comprehensive test suite with over a thousand test cases to verify data integrity and cross-reference correctness across code changes and data updates.

Source code and documentation: https://github.com/tamerh/biobtree
Public instance: https://sugi.bio
MCP endpoint: https://sugi.bio/biobtree/mcp
REST API: https://sugi.bio/biobtree/api
Use case data and raw responses: https://github.com/tamerh/biobtreev2_paper

Acknowledgments

Some initial exploratory computational work was performed using the Scientific Compute Cluster at the University of Konstanz.

References

[1]: A. Kasprzyk. “BioMart: driving a paradigm change in biological data management”. In: Database: The Journal of Biological Databases and Curation 2011 (2011), bar049. issn: 1758-0463. doi: 10.1093/database/bar049.
[2]: S. Lelong et al. “BioThings SDK: a toolkit for building high-performance data APIs in biomedical research”. In: Bioinformatics 38.7 (Mar. 2022), pp. 2077–2079. issn: 1367-4803. doi: 10.1093/bioinformatics/btac017.
[3]: D. S. Himmelstein, A. Lizee, C. Hessler, L. Brueggeman, S. L. Chen, D. Hadley, A. Green, P. Khankhanian, and S. E. Baranzini. “Systematic integration of biomedical knowledge prioritizes drugs for repurposing”. In: eLife 6 (Sept. 2017). Ed. by A. Valencia, e26726. issn: 2050-084X. doi: 10.7554/eLife.26726.
[4]: P. Chandak, K. Huang, and M. Zitnik. “Building a knowledge graph to enable precision medicine”. In: Scientific Data 10.1 (Feb. 2023), p. 67. issn: 2052-4463. doi: 10.1038/s41597-023-01960-3.
[5]: D. Ochoa et al. “The next-generation Open Targets Platform: reimagined, redesigned, rebuilt”. In: Nucleic Acids Research 51.D1 (Jan. 2023), pp. D1353–D1359. issn: 0305-1048. doi: 10.1093/nar/gkac1046.
[6]: M. P. van Iersel, A. R. Pico, T. Kelder, J. Gao, I. Ho, K. Hanspers, B. R. Conklin, and C. T. Evelo. “The BridgeDb framework: standardized access to gene, protein and metabolite identifier mapping services”. In: BMC Bioinformatics 11 (Jan. 2010), p. 5. issn: 1471-2105. doi: 10.1186/1471-2105-11-5.
[7]: K. Fecho et al. “Announcing the Biomedical Data Translator: Initial Public Release”. In: Clinical and Translational Science 18.7 (July 2025), e70284. issn: 1752-8054. doi: 10.1111/cts.70284.
[8]: L. Luebbert and L. Pachter. “Efficient querying of genomic reference databases with gget”. In: Bioinformatics 39.1 (Jan. 2023), btac836. issn: 1367-4811. doi: 10.1093/bioinformatics/btac836.
[9]: K. Singhal et al. “Large language models encode clinical knowledge”. In: Nature 620.7972 (Aug. 2023), pp. 172–180. issn: 1476-4687. doi: 10.1038/s41586-023-06291-2.
[10]: Z. Bai, P. Wang, T. Xiao, T. He, Z. Han, Z. Zhang, and M. Z. Shou. Hallucination of Multimodal Large Language Models: A Survey. Apr. 2025. doi: 10.48550/arXiv.2404.18930.
[11]: X. Hou, Y. Zhao, S. Wang, and H. Wang. Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions. Mar. 2025. doi: 10.48550/arXiv.2503.23278.
[12]: M. Kuehl et al. “BioContextAI is a community hub for agentic biomedical systems”. In: Nature Biotechnology 43.11 (Nov. 2025), pp. 1755–1757. issn: 1546-1696. doi: 10.1038/s41587-025-02900-9.
[13]: S. C. Dyer, O. Austine-Orimoloye, A. G. Azov, et al. “Ensembl 2025”. In: Nucleic Acids Research 53.D1 (2025), pp. D948–D957. doi: 10.1093/nar/gkae1071.
[14]: R. L. Seal, B. Braschi, K. Gray, T. E. M. Jones, S. Tweedie, L. Haim-Vilmovsky, and E. A. Bruford. “Genenames.org: the HGNC resources in 2023”. In: Nucleic Acids Research 51.D1 (2023), pp. D1003–D1009. doi: 10.1093/nar/gkac888.
[15]: D. Maglott, J. Ostell, K. D. Pruitt, and T. Tatusova. “Entrez Gene: gene-centered information at NCBI”. In: Nucleic Acids Research 39.Database issue (2011), pp. D52–D57. doi: 10.1093/nar/gkq1237.
[16]: N. A. O’Leary, M. W. Wright, J. R. Brister, et al. “Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation”. In: Nucleic Acids Research 44.D1 (2016), pp. D733–D745. doi: 10.1093/nar/gkv1189.
[17]: S. T. Sherry, M. H. Ward, M. Kholodov, J. Baker, L. Phan, E. M. Smigielski, and K. Sirotkin. “dbSNP: the NCBI database of genetic variation”. In: Nucleic Acids Research 29.1 (2001), pp. 308–311. doi: 10.1093/nar/29.1.308.
[18]: J. E. Moore et al. “Expanded encyclopaedias of DNA elements in the human and mouse genomes”. In: Nature 583.7818 (July 2020), pp. 699–710. issn: 1476-4687. doi: 10.1038/s41586-020-2493-4.
[19]: The UniProt Consortium. “UniProt: the Universal Protein Knowledgebase in 2025”. In: Nucleic Acids Research 53.D1 (2025), pp. D609–D617. doi: 10.1093/nar/gkae1010.
[20]: M. Varadi, S. Anyango, M. Deshpande, et al. “AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models”. In: Nucleic Acids Research 50.D1 (2022), pp. D439–D444. doi: 10.1093/nar/gkab1061.
[21]: J. Cheng, G. Novati, J. Pan, et al. “Accurate proteome-wide missense variant effect prediction with AlphaMissense”. In: Science 381.6664 (2023), eadg7492. doi: 10.1126/science.adg7492.
[22]: D. R. Armstrong et al. “PDBe: improved findability of macromolecular structure data in the PDB”. In: Nucleic Acids Research 48.D1 (Jan. 2020), pp. D335–D343. issn: 0305-1048. doi: 10.1093/nar/gkz990.
[23]: M. Blum, A. Andreeva, L. C. Florentino, et al. “InterPro: the protein sequence classification resource in 2025”. In: Nucleic Acids Research 53.D1 (2025), pp. D444–D456. doi: 10.1093/nar/gkae1082.
[24]: Z. Lin et al. “Evolutionary-scale prediction of atomic-level protein structure with a language model”. In: Science 379.6637 (Mar. 2023), pp. 1123–1130. doi: 10.1126/science.ade2574.
[25]: B. Buchfink, K. Reuter, and H.-G. Drost. “Sensitive protein alignments at tree-of-life scale using DIAMOND”. In: Nature Methods 18.4 (Apr. 2021), pp. 366–368. issn: 1548-7105. doi: 10.1038/s41592-021-01101-x.
[26]: I. Rauluseviciute et al. “JASPAR 2024: 20th anniversary of the open-access database of transcription factor binding profiles”. In: Nucleic Acids Research 52.D1 (Jan. 2024), pp. D174–D182. issn: 0305-1048. doi: 10.1093/nar/gkad1059.
[27]: B. Zdrazil, E. Felix, F. Hunter, et al. “The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods”. In: Nucleic Acids Research 52.D1 (2024), pp. D1180–D1192. doi: 10.1093/nar/gkad1004.
[28]: S. Kim, J. Chen, T. Cheng, et al. “PubChem 2025 update”. In: Nucleic Acids Research 53.D1 (2025), pp. D1516–D1525. doi: 10.1093/nar/gkae1059.
[29]: K. Degtyarenko, P. de Matos, M. Ennis, et al. “ChEBI: a database and ontology for chemical entities of biological interest”. In: Nucleic Acids Research 36.Database issue (2008), pp. D344–D350. doi: 10.1093/nar/gkm791.
[30]: D. S. Wishart, A. Guo, E. Oler, et al. “HMDB 5.0: the Human Metabolome Database for 2022”. In: Nucleic Acids Research 50.D1 (2022), pp. D622–D631. doi: 10.1093/nar/gkab1062.
[31]: T. Liu, L. Hwang, S. K. Burley, C. I. Nitsche, C. Southan, W. P. Walters, and M. K. Gilson. “BindingDB in 2024: a FAIR knowledgebase of protein-small molecule binding data”. In: Nucleic Acids Research 53.D1 (2025), pp. D1633–D1644. doi: 10.1093/nar/gkae1075.
[32]: J. Hauenstein, L. Jeske, A. Jäde, M. Krull, K. Dümmer, J. Koblitz, A. Tietz, D. Jahn, L. C. Reimer, and B. Bunk. “BRENDA in 2026: a Global Core Biodata Resource for functional enzyme and metabolic data within the DSMZ Digital Diversity”. In: Nucleic Acids Research 54.D1 (Jan. 2026), pp. D527–D534. issn: 1362-4962. doi: 10.1093/nar/gkaf1113.
[33]: P. Bansal, A. Morgat, K. B. Axelsen, et al. “Rhea, the reaction knowledgebase in 2022”. In: Nucleic Acids Research 50.D1 (2022), pp. D693–D700. doi: 10.1093/nar/gkab1016.
[34]: L. Aimo, R. Liechti, N. Hyka-Nouspikel, et al. “The SwissLipids knowledgebase for lipid biology”. In: Bioinformatics 31.17 (2015), pp. 2860–2866. doi: 10.1093/bioinformatics/btv285.
[35]: M. Sud, E. Fahy, D. Cotter, et al. “LMSD: LIPID MAPS structure database”. In: Nucleic Acids Research 35.Database issue (2007), pp. D527–D532. doi: 10.1093/nar/gkl838.
[36]: G. Papadatos, M. Davies, N. Dedman, et al. “SureChEMBL: a large-scale, chemically annotated patent document database”. In: Nucleic Acids Research 44.D1 (2016), pp. D1220–D1228. doi: 10.1093/nar/gkv1253.
[37]: M. Milacic, D. Beavers, P. Conley, et al. “The Reactome Pathway Knowledgebase 2024”. In: Nucleic Acids Research 52.D1 (2024), pp. D672–D678. doi: 10.1093/nar/gkad1025.
[38]: D. Szklarczyk, R. Kirsch, M. Koutrouli, et al. “The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest”. In: Nucleic Acids Research 51.D1 (2023), pp. D638–D646. doi: 10.1093/nar/gkac1000.
[39]: N. Del Toro, A. Shrivastava, E. Ragueneau, et al. “The IntAct database: efficient access to fine-grained molecular interaction data”. In: Nucleic Acids Research 50.D1 (2022), pp. D648–D653. doi: 10.1093/nar/gkab1006.
[40]: R. Oughtred, C. Stark, B.-J. Breitkreutz, et al. “The BioGRID interaction database: 2019 update”. In: Nucleic Acids Research 47.D1 (2019), pp. D529–D541. doi: 10.1093/nar/gky1079.
[41]: P. Lo Surdo et al. “SIGNOR 4.0: the 2025 update with focus on phosphorylation data”. In: Nucleic Acids Research 54.D1 (Jan. 2026), pp. D682–D690. issn: 1362-4962. doi: 10.1093/nar/gkaf1237.
[42]: S. Müller-Dott, E. Tsirvouli, M. Vazquez, R. O. Ramirez Flores, P. Badia-i-Mompel, R. Fallegger, D. Türei, A. Lægreid, and J. Saez-Rodriguez. “Expanding the coverage of regulons from high-confidence prior knowledge for accurate estimation of transcription factor activities”. In: Nucleic Acids Research 51.20 (Nov. 2023), pp. 10934–10949. issn: 0305-1048. doi: 10.1093/nar/gkad841.
[43]: K. Troulé, R. Petryszak, B. Cakir, J. Cranley, A. Harasty, M. Prete, Z. K. Tuong, S. A. Teichmann, L. Garcia-Alonso, and R. Vento-Tormo. “CellPhoneDB v5: inferring cell-cell communication from single-cell multiomics data”. In: Nature Protocols 20.12 (Dec. 2025), pp. 3412–3440. issn: 1750-2799. doi: 10.1038/s41596-024-01137-1.
[44]: G. Tsitsiridis, R. Steinkamp, M. Giurgiu, B. Brauner, G. Fobo, G. Frishman, C. Montrone, and A. Ruepp. “CORUM: the comprehensive resource of mammalian protein complexes–2022”. In: Nucleic Acids Research 51.D1 (Jan. 2023), pp. D539–D545. issn: 0305-1048. doi: 10.1093/nar/gkac1015.
[45]: M. J. Landrum, J. M. Lee, G. R. Riley, W. Jang, W. S. Rubinstein, D. M. Church, and D. R. Maglott. “ClinVar: public archive of relationships among sequence variation and human phenotype”. In: Nucleic Acids Research 42.Database issue (2014), pp. D980–D985. doi: 10.1093/nar/gkt1113.
[46]: N. A. Vasilevsky, S. Toro, N. Matentzoglu, et al. “Mondo: Integrating Disease Terminology Across Communities”. In: Genetics (2025), iyaf215. doi: 10.1093/genetics/iyaf215.
[47]: M. A. Gargano, N. Matentzoglu, B. Coleman, et al. “The Human Phenotype Ontology in 2024: phenotypes around the world”. In: Nucleic Acids Research 52.D1 (2024), pp. D1333–D1346. doi: 10.1093/nar/gkad1005.
[48]: A. Rath, A. Olry, F. Dhombres, M. M. Brandt, B. Urbero, and S. Ayme. “Representation of rare diseases in health information systems: the Orphanet approach to serve a wide range of end users”. In: Human Mutation 33.5 (May 2012), pp. 803–808. issn: 1098-1004. doi: 10.1002/humu.22078.
[49]: E. Sollis, A. Mosaku, A. Abid, et al. “The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource”. In: Nucleic Acids Research 51.D1 (2023), pp. D977–D985. doi: 10.1093/nar/gkac1010.
[50]: M. T. DiStefano, S. Goehringer, L. Babb, et al. “The Gene Curation Coalition: A global effort to harmonize gene-disease evidence resources”. In: Genetics in Medicine 24.8 (2022), pp. 1732–1742. doi: 10.1016/j.gim.2022.04.017.
[51]: D. A. Zarin, T. Tse, R. J. Williams, R. M. Califf, and N. C. Ide. “The ClinicalTrials.gov Results Database – Update and Key Issues”. In: New England Journal of Medicine 364.9 (2011), pp. 852–860. doi: 10.1056/NEJMsa1012065.
[52]: L. Gong, M. Whirl-Carrillo, and T. E. Klein. “PharmGKB, an Integrated Resource of Pharmacogenomic Knowledge”. In: Current Protocols 1.8 (2021), e226. doi: 10.1002/cpz1.226.
[53]: M. Ashburner, C. A. Ball, J. A. Blake, et al. “Gene ontology: tool for the unification of biology”. In: Nature Genetics 25.1 (2000), pp. 25–29. doi: 10.1038/75556.
[54]: J. Malone, E. Holloway, T. Adamusiak, et al. “Modeling sample variables with an Experimental Factor Ontology”. In: Bioinformatics 26.8 (2010), pp. 1112–1118. doi: 10.1093/bioinformatics/btq099.
[55]: C. J. Mungall, C. Torniai, G. V. Gkoutos, S. E. Lewis, and M. A. Haendel. “Uberon, an integrative multi-species anatomy ontology”. In: Genome Biology 13.1 (2012), R5. doi: 10.1186/gb-2012-13-1-r5.
[56]: A. D. Diehl, T. F. Meehan, Y. M. Bradford, et al. “The Cell Ontology 2016: enhanced content, modularization, and ontology interoperability”. In: Journal of Biomedical Semantics 7.1 (2016), p. 44. doi: 10.1186/s13326-016-0088-7.
[57]: C. E. Lipscomb. “Medical Subject Headings (MeSH)”. In: Bulletin of the Medical Library Association 88.3 (2000), pp. 265–266.
[58]: S. Nadendla et al. “ECO: the Evidence and Conclusion Ontology, an update for 2022”. In: Nucleic Acids Research 50.D1 (Jan. 2022), pp. D1515–D1521. issn: 0305-1048. doi: 10.1093/nar/gkab1025.
[59]: S. Abeyruwan et al. “Evolving BioAssay Ontology (BAO): modularization, integration and applications”. In: Journal of Biomedical Semantics 5.1 (June 2014), S5. issn: 2041-1480. doi: 10.1186/2041-1480-5-S1-S5.
[60]: F. B. Bastian, J. Roux, A. Niknejad, et al. “The Bgee suite: integrated curated expression atlas and comparative transcriptomics in animals”. In: Nucleic Acids Research 49.D1 (2021), pp. D831–D847. doi: 10.1093/nar/gkaa793.
[61]: CZI Cell Science Program, S. Abdulla, et al. “CZ CELLxGENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated data”. In: Nucleic Acids Research 53.D1 (2025), pp. D886–D900. doi: 10.1093/nar/gkae1142.
[62]: P. Madrigal, A. S. Thanki, S. Fexova, et al. “Expression Atlas in 2026: enabling FAIR and open expression data through community collaboration and integration”. In: Nucleic Acids Research 54.D1 (2026), pp. D147–D157. doi: 10.1093/nar/gkaf1238.
[63]: T. Nobusada et al. “Update of the FANTOM web resource: enhancement for studying noncoding genomes”. In: Nucleic Acids Research 53.D1 (Jan. 2025), pp. D419–D424. issn: 1362-4962. doi: 10.1093/nar/gkae1047.
[64]: RNAcentral Consortium. “RNAcentral in 2026: genes and literature integration”. In: Nucleic Acids Research 54.D1 (2026), pp. D303–D313. doi: 10.1093/nar/gkaf1329.
[65]: Y. Chen and X. Wang. “miRDB: an online database for prediction of functional microRNA targets”. In: Nucleic Acids Research 48.D1 (Jan. 2020), pp. D127–D131. issn: 0305-1048. doi: 10.1093/nar/gkz757.
[66]: S. Federhen. “The NCBI Taxonomy database”. In: Nucleic Acids Research 40.D1 (2012), pp. D136–D143. doi: 10.1093/nar/gkr1178.
[67]: A. P. Davis, T. C. Wiegers, D. Sciaky, et al. “Comparative Toxicogenomics Database’s 20th anniversary: update 2025”. In: Nucleic Acids Research 53.D1 (2025), pp. D1328–D1334. doi: 10.1093/nar/gkae883.
[68]: A. Liberzon, C. Birger, H. Thorvaldsdóttir, M. Ghandi, J. P. Mesirov, and P. Tamayo. “The Molecular Signatures Database (MSigDB) hallmark gene set collection”. In: Cell Systems 1.6 (2015), pp. 417–425. doi: 10.1016/j.cels.2015.12.004.
[69]: T. Gür. “Biobtree: A tool to search, map and visualize bioinformatics identifiers and special keywords”. In: F1000Research 8 (2019), ISCB Comm J–145. issn: 2046-1402. doi: 10.12688/f1000research.17927.4.
[70]: K. Soman et al. “Biomedical knowledge graph-optimized prompt generation for large language models”. In: Bioinformatics 40.9 (Sept. 2024), btae560. issn: 1367-4811. doi: 10.1093/bioinformatics/btae560.
[71]: J. Joy and A. I. Su. “Federated knowledge retrieval elevates large language model performance on biomedical benchmarks”. In: GigaScience 15 (Jan. 2026), giag007. issn: 2047-217X. doi: 10.1093/gigascience/giag007.
[72]: Y. Zheng, W. Liu, B. Zeng, Y. Feng, X. Du, L. Zhou, and Y. Li. Automating Biomedical Knowledge Graph Construction For Context-Aware Scientific Inference. Jan. 2026. doi: 10.64898/2026.01.14.699420.