Method
Following is the Sugi Atlas preprint — the methodology, build pipeline, and worked examples. · Download PDF
Sugi Atlas: A Comprehensive, Deterministic Biomedical Catalog from a Knowledge Graph for Humans and AI Agents
Abstract
Biomedical knowledge is spread across hundreds of specialized databases, each with its own identifiers, formats, and update cycles. Each is authoritative within its domain, yet most research needs a connected view across them, so assembling a consolidated, current, and trustworthy picture of any single gene, drug, or disease still takes considerable manual effort. Such reference content is increasingly consumed not only by researchers but by AI agents, raising the bar on how correct and how current it must be. Large language models can write it, but doing so at the scale of an entire catalog is costly and carries a hallucination surface; and even when grounded against authoritative data through a protocol such as MCP, the path a model takes through a large graph varies between runs and models and is hard to reproduce. We present Sugi Atlas, a comprehensive catalog of genes (together with their protein products), drugs, and diseases built by deterministically mining BioBTree, a graph that unifies around seventy primary databases into billions of cross-references. Rather than let a model explore the graph at query time, Sugi Atlas traverses it along fixed query plans, hundreds to thousands of chained graph queries per page, so the catalog is built the same way on each run: every entity is covered at uniform breadth, every value comes straight from the source data, and, for a given BioBTree snapshot, the whole corpus is reproducible. A curated cross-entity mesh links the three corpora into one connected resource, so a relationship recorded for a gene is navigable from the drug or disease it implicates. By consolidating otherwise scattered data into one openly published, refreshable place, with a stable page structure and schema.org records that ease programmatic use, Sugi Atlas offers researchers and AI agents alike a grounded, current reference. Sugi Atlas comprises more than 52,000 pages and is openly available at https://sugi.bio/atlas/.
1 Introduction
Understanding a gene, a drug, or a disease means assembling facts that live in many different places. Genomic coordinates and transcripts sit in one database, reviewed protein products and their domains in another, variants and their clinical interpretations in a third, and drug targets, trials, and pharmacogenomics in still others, each with its own identifiers, formats, and release schedule. Building a complete picture of a single entity means visiting and reconciling a dozen such resources by hand, and because each is updated on its own cycle, any consolidated view begins to go stale the moment it is written down. Comprehensive, current, and trustworthy reference content for biomedical entities is therefore a challenge.
This content is increasingly read by machines as well as people. Large language models encode a substantial body of biomedical and clinical knowledge [1] and have become a common interface for asking such questions, and a growing ecosystem of autonomous agents consumes biomedical reference data directly [2]. Both raise the bar on the same two properties: the content must be correct, and, because the data beneath it changes continually, current. An agent has even less scope than a person to notice when a fact is stale or wrong, so what it most needs from a reference is that the facts be grounded in authoritative data and kept up to date. Language models are an imperfect fit for producing such content at catalog scale. Generating deep reference content for an entire catalog with a model is costly, and that cost recurs each time the underlying data changes; models also retain a hallucination surface, producing plausible but unsupported statements [3]. Grounding a model against authoritative records, for instance through the Model Context Protocol [4], substantially mitigates fabrication, and in early prototyping we found that a model querying a biomedical graph through such an interface produces well-grounded results. It does not, however, yield a catalog: the traversal a model takes through a large graph varies from run to run and from model to model, so coverage is neither complete nor reproducible, and the per-entity cost makes rebuilding thousands of pages on every refresh impractical. This mode of access suits interactive, human-steered analysis of a single question, where follow-ups fill gaps, rather than the construction of a consistent, versioned reference catalog.
Several lines of work consolidate biomedical data so that it need not be gathered by hand each time. Per-entity reference resources aggregate many primary sources into a single summary page for each gene, drug, or disease [5–8]; integrated evidence platforms assemble and pre-score relationships within a domain, such as target-disease associations; and biomedical knowledge graphs encode entities and their relationships as networks for computational analysis. BioBTree [9] addresses the integration layer itself, unifying around seventy primary databases into a single queryable graph with a chain-query syntax, and is engineered to re-ingest its sources efficiently so the graph stays current. These efforts resolve the fragmentation of the underlying data, but a gap remains for a reference layer built on top of such a graph: one that is at once comprehensive across genes, drugs, and diseases; deterministic, and therefore reproducible and inexpensive to refresh; and connected, linking those entities to one another, published openly in one place that researchers and AI agents can rely on.
We present Sugi Atlas, a comprehensive, openly published catalog of human genes, drugs, and diseases produced by deterministically mining BioBTree [9], a biomedical graph database that integrates around seventy primary sources, spanning sequence, structure, expression, pathway, variant, disease, and pharmacology, through billions of cross-references, and stays current as those sources release updates. A pipeline first identifies the entities that constitute each corpus, then builds one page per entity by issuing hundreds to thousands of chained graph queries that traverse this network, each query resolving a specific set of source datasets; the query plans were drafted with model assistance and then expanded and fixed, so that mining runs deterministically and presents every entity at the same breadth. For a gene, the traversals assemble its molecular, structural, functional, and clinical profile, such as its transcripts and expression, protein products and structures, interactions and pathways, regulation, variants, and disease and drug associations, with analogous plans for drugs and diseases; relationships captured along the way are woven into a cross-entity mesh, so an association recorded for a gene is navigable from the drug or disease it implicates and is also emitted as machine-readable annotation for AI agents. The collected data is rendered into a markdown page per entity, and an integration test suite checks the built corpus for structural, linkage, and data-quality problems before publication; because the catalog is a deterministic function of BioBTree’s contents, it can be regenerated and refreshed as those sources are re-ingested. The catalog comprises more than 52,000 pages (around 29,000 genes, 4,700 drugs, and 18,600 diseases). Section 3 presents example pages and the breadth of information they carry, and Figure 1 shows the overall pipeline.
2 Methods
Sugi Atlas is built on the BioBTree knowledge graph [9], consuming it through its API. It traverses the graph by chain queries, which follow cross-reference edges from a starting record through a sequence of datasets, optionally filtering at a step, and return the records reached. Each page is assembled by issuing a fixed plan of such queries from one resolved entity. Because the plan is fixed in advance rather than chosen at query time, every entity of a type is traversed along the same chains and assembled the same way on each build, so coverage is uniform across the corpus and, for a given BioBTree snapshot, the build is reproducible. The pipeline (Figure 1) identifies each corpus, collects every entity, links the entities to one another, renders each as a page, validates the result, and publishes the catalog; the following subsections describe these phases in turn.
2.1 Identify corpus
Sugi Atlas is organised as three corpora, one page per entity: genes, drugs, and diseases. A page consolidates more than its title entity, so a gene page also profiles related content such as the gene’s protein products, variants, and disease and drug associations; how each page is assembled is described in the steps that follow. The first step is to fix which entities each corpus contains.
Each corpus is seeded by stable identifier rather than by name, drawing on the BioBTree-ingested source that defines its entity type. Every BioBTree entry is keyed by a unique, stable identifier that addresses it precisely; a name, though searchable, can match a related entry rather than the intended one, so seeding by identifier pins each entity exactly.
Genes are the complete HGNC catalogue. Each gene page is keyed by its HGNC symbol and resolves through HGNC, so that catalogue serves as both the complete and the canonical gene corpus; a locus with no approved symbol is not included.
Diseases are drawn from Mondo and seeded by Mondo identifier, since name resolution is unreliable for diseases: a search for cardiomyopathy, for instance, returns a specific dilated-cardiomyopathy subtype rather than the umbrella term. A Mondo term is admitted if it carries at least one disease-relevant evidence cross-reference, such as a curated gene-disease validity record, a clinical or somatic-cancer annotation, a genome-wide or variant association, or a clinical trial. Terms carrying none, together with the disease-characteristic qualifier subtree and the non-human and veterinary subtrees, are dropped, restricting the corpus to evidenced human disease.
Drugs are drawn from ChEMBL and likewise seeded by identifier: ChEMBL holds far more experimental compounds than approved drugs, many without a recognised name, so a name match is especially unreliable here. The seed is the set of molecules that are approved or in late-stage (phase 3) clinical development; from it we remove non-therapeutic reagents and salt-form children whose parent compound is already present.
Together, the three seed lists define the full corpus, one page per admitted entity: 29,337 genes, 18,616 diseases, and 4,701 drugs (Table 1).
| Entity type | Pages | Share |
| Genes | 29,337 | 55.7% |
| Diseases | 18,616 | 35.4% |
| Drugs | 4,701 | 8.9% |
| Total | 52,654 | — |
2.2 Collect content
Collection turns each entity from the identified corpora into a page. The entity’s identifier is first resolved to a single canonical record, the anchor, which every section of its page then shares, so the same identifier is not looked up again for each section. An identifier that resolves to nothing is recorded and skipped rather than aborting the build.
A page is organised into sections, each grouping related information under a heading, often with subsections: a gene’s Gene structure section, for example, holds subsections such as transcripts, expression profiles, and regulation. These sections are curated to read like a structured reference entry, and every entity of a type is organised the same way. Each section is populated by its own chain queries, issued from the anchor. Because the anchor fixes the entity’s identifier, the records related to it are reached directly across the graph, so a section that needs an exact count, for example the number of ClinVar variants in each clinical-significance class, or of interaction partners or pathway memberships, reads that total directly rather than reconstructing it record by record. These steps apply to genes, diseases, and drugs alike; what differs is the query plan each entity type follows.
Gene content mining
A gene is described together with its protein products, in sections spanning its structure, protein, function, clinical, and pharmacology data. Most of this content is reached from two hubs in the graph: UniProt, for the protein and its structure, function, and drug targets, and Ensembl, for genomic and expression data, while HGNC carries the direct clinical, regulatory, and functional-genomics links. Figure 2 traces these chains for TP53. Where a gene encodes more than one protein product, the union across products is taken so that no product’s domains, structures, or pathway memberships are dropped. A non-coding gene often overlaps a protein-coding gene on the genome; variants and disease associations tied to that shared position belong to the coding gene, so they are not carried onto the non-coding gene’s page as if they were its own.
A gene page also reports a functional readout of its neighbourhood: the Reactome pathways and Gene Ontology processes over-represented among the gene’s physical interaction partners, by the same over-representation test, restricted to gene sets of between fifteen and five hundred members and ranked by fold enrichment. Because an interactome is biased toward well-studied proteins, this is offered as themes rather than proof; a gene with too few partners carries no enrichment, and a promiscuous hub yields a diffuse signal rather than a sharp one.
Disease content mining
A disease page covers a disease’s clinical features, genetics and variants, associated genes and proteins, function, therapeutics, and clinical trials, together with its place in the disease ontology. Most of its molecular content is not read from the disease directly but assembled from the genes associated with it: the disease is resolved to a cohort of associated genes, over which the gene query plan is re-applied (Figure 3).
This cohort is the union of several typed, curated evidence routes (genome-wide association, curated gene-disease validity, germline variants, and somatic evidence), with each gene ranked by the number of routes in which it appears. Two cohorts are drawn from the ranked list. A profiled cohort of roughly seventy-five genes is mined with the full gene query plan and reported per gene, an evidence floor retaining every curated or multi-route gene before the ranked remainder fills the rest; because that plan is re-issued for each profiled gene, it is what makes a disease page a matter of thousands of graph queries rather than the hundreds a gene requires. A larger aggregate cohort of roughly two hundred and fifty genes is queried with only the one or two chains each cohort-level statistic requires (pathway enrichment, for example, via >>hgnc>>ensembl>>reactome), so that enrichment and druggability are computed over a broader set without enlarging the per-gene tables or the build. Pathway and Gene Ontology enrichment is reported as a statistical over-representation test, a hypergeometric test against a genome-wide background with Benjamini-Hochberg false-discovery-rate control, so the cohort’s significantly enriched terms are surfaced rather than the largest generic categories; the background is fixed per BioBTree release, keeping the result deterministic.
Because a disease’s molecular content comes from this cohort, its depth varies with how gene-defined the disease is (Table 2). About 55% of diseases are gene-defined, such as cancers and Mendelian disorders, and carry the full molecular sections. About 20% resolve no cohort, such as antibody-mediated, autoimmune, or idiopathic conditions; these still present their clinical features, genome-wide susceptibility, and trials, while their molecular sections are shown as an explicit “no data” state rather than inferred from elsewhere. The remaining 25% are understudied terms reported with identifiers and ontology context only. Even these thin pages are not isolated: because annotation in a disease ontology concentrates on broader terms, each disease is placed within its Mondo neighbourhood, a path of broader ancestors with the parent term, sibling subtypes, and its own subtypes, so that a sparsely annotated page resolves toward the richer terms around it rather than inheriting their evidence as its own.
| Disease page | Pages | Share |
| Rich (associated-gene cohort, full molecular sections) | 10,267 | 55% |
| Clinical / genetic only (no cohort; e.g. antibody-mediated) | 3,748 | 20% |
| Thin (identifiers + ontology family only) | 4,601 | 25% |
Drug content mining
A drug page covers a drug’s targets, indications, clinical trials, clinical evidence, and pharmacology. The step that most shapes it is target selection: primary targets are restricted to the mechanism targets curated by the Guide to Pharmacology, reached through a curated chain (ChEMBL molecule → GtoPdb ligand → interaction → target → UniProt → HGNC) that resolves to a gene and carries an action and an affinity (Figure 4). The much larger set of ChEMBL bioactivity measurements is kept separately as a labelled secondary signal and never promoted to a target, so that an assay hit is not mistaken for a mechanism of action and a drug does not appear to target proteins it merely binds in an assay. Indications are mapped to the corresponding disease pages, and clinical trials and curated clinical evidence are reported alongside. ChEMBL is the primary source throughout; PubChem supplements it, contributing related molecules, the approved drugs that act on the same targets.
2.3 Link entities
Collection treats each entity independently; linking is the step that connects them, weaving the three corpora into a cross-entity mesh. A curated relationship recorded on one page, such as a drug’s target or a disease’s associated gene, is inverted into a matching link on the page it refers to, so that a relationship recorded once is navigable from both ends. A single function performs every inversion, so that the human-readable links, the structured-data cross-references, and the reverse index remain mutually consistent, and it runs only after the full set of pages is known, so that every link resolves to a page that exists. Only curated relationships are inverted, each labelled by the relationship as read from the target’s end, so that a gene that is a biomarker for a drug is not recorded as a target of it.
A second relationship is constructed in the same pass. The diseases for which a drug is indicated are recorded on the drug side, in ChEMBL; inverting them yields, for each disease, the drugs indicated directly for it, tiered by development stage: approved and late-stage (phase 3) indications as established uses, phase-2 candidates separately as investigational, and earlier stages omitted. Because this relationship does not pass through the gene cohort, it provides registered therapeutics even for diseases that resolve no cohort; in the current build, 1,623 diseases (9%) carry at least one such directly indicated drug. Its scope is registration: a use that is not a recorded indication, such as an off-label treatment, does not appear, and the graph records no link between an autoantibody and its antigen, so such connections remain explicit absences rather than inferences.
Finally, each entity is assigned an evidence score: a percentile rank, within its own entity type, under a composite that weights curated and clinical evidence above broad or readily inflated counts. The percentile requires the corpus-wide distribution, so it is computed in this same pass; linking and scoring together form the one step that runs over the complete corpus, between the parallel collection of entities and the parallel rendering of their pages.
2.4 Render pages
Every page of a type presents the same canonical sequence of top-level sections, in a fixed order, each under a stable anchor identifier. A section for which no data resolves is shown as an explicit “no data” state rather than omitted, so that every page of a type has identical structure and a deep link to any section resolves on every page. Where a section would instead list a great many records, the page shows the top-ranked entries rather than the full set; this is a current limitation, the complete list remaining reachable through the BioBTree API, and surfacing more of it within the page is planned for future releases. These anchor identifiers are treated as a stable interface, since external references and the page’s own structured metadata depend on them.
Beyond its sections, each page carries a one-sentence description and a short list of headline facts, both composed by template from the collected values. Every element of a page is produced in this way, by deterministic composition from the record’s data; no part of a page is generated by a language model. Each page is written as a Markdown document with a machine-readable metadata header and carries a schema.org record (a Gene, Drug, or MedicalCondition) whose federated cross-references allow a consumer to judge whether two records denote the same entity. Where a gene encodes more than one protein product, each protein is also carried as an individually addressable typed node, so that a consumer may refer to a specific product rather than the gene alone.
2.5 Validate and publish
Validation is layered, and begins upstream: BioBTree re-runs over a thousand test cases on every release to check the integrity of the underlying data and its cross-references [9]. On top of that, Sugi Atlas validates its own build at two scales. At the level of the individual entity, each freshly collected record is compared against a stored snapshot from the previous build, and a regression, such as a populated field falling empty or a count collapsing, is flagged rather than published silently. At the level of the whole corpus, a suite of 180 unit tests covers the renderers, the evidence and percentile logic, the declarative leads, and the structured-data builders, while 47 integration checks run over the entire corpus rather than a sample, verifying the rendered output against the frozen page contract, the integrity of the cross-entity mesh (no link without a target, and every inverted link resolving to a page that exists), the validity of the structured metadata, and a set of data-quality invariants: no numeric artefacts, no unescaped markup, no duplicated rows, and no raw ontology identifier shown where a label belongs. All checks pass over the full 52,654-page build. The validated corpus is then published as a single open collection of gene, drug, and disease pages linked by the cross-entity mesh, and regenerated in full as BioBTree re-ingests its sources, so its currency tracks the underlying databases rather than a manual release cycle.
3 BRAF, Dabrafenib, and Melanoma in Sugi Atlas
We illustrate the catalog, and the cross-entity mesh that connects it, through one connected example: the gene BRAF, the drug dabrafenib, and the disease melanoma. The three form a single precision-oncology thread, the activating BRAF V600E mutation, the BRAF inhibitors developed against it, and the melanoma in which they are used, so the same relationship can be entered from any of the three pages. Every other gene, drug, and disease in Sugi Atlas follows the same page contract; this thread is a representative slice, not a special case.
3.1 The gene page: BRAF
The BRAF page opens with a deterministic lead sentence assembled from the collected fields: BRAF (HGNC:1097) is a protein-coding gene on chromosome 7q34 encoding the serine/threonine-protein kinase B-raf (UniProt P15056), a RAF-family kinase in the MAPK/ERK pathway, and the top curated precision-oncology verdict is stated in the same sentence, that BRAF V600E confers sensitivity to a dabrafenib/trametinib regimen at CIViC evidence level A. An at-a-glance digest then gives the figures a reader checks first, each an exact count read from BioBTree rather than an estimate: 1,561 ClinVar variants (60 pathogenic, 39 likely-pathogenic), 391 HPO phenotypes, 10 GWAS associations, the MANE-Select transcript NM_004333, and an intOGen cancer-driver call (activating, across 28 cancer types).
Below the summary the page lays out the eight canonical zones common to every gene page. The Disease and clinical zone carries the curated cancer interpretation in full, including the CIViC narrative that distinguishes the three classes of oncogenic BRAF mutation: the RAS-independent V600 mutants that signal as monomers and respond to current RAF inhibitors, the class-2 mutants that signal as constitutive dimers and resist vemurafenib, and the RAS-dependent class-3 mutants with low kinase activity. The same zone gives the ClinVar breakdown by clinical significance, AlphaMissense pathogenicity scored on the canonical transcript, and SpliceAI predictions. The Protein zone gives the reviewed product’s domains and experimental and predicted structures; the Gene structure zone its transcripts, expression, and regulation.
The Drugs and pharmacology zone is where the thread to the rest of the catalog begins. It tables the molecules with BRAF bioactivity by development phase and, separately, the curated CIViC evidence as drug-variant-indication triples ranked by evidence level: BRAF V600E with vemurafenib or dabrafenib in melanoma, with encorafenib plus cetuximab in colorectal cancer, and with a dabrafenib plus trametinib regimen in several gliomas and in anaplastic thyroid carcinoma. Each therapy and each indication is a link to its own Atlas page. The Related Atlas pages block then summarises the curated cross-entity edges for the gene: the diseases it is associated with, the drugs that target it, and, kept separate, the drugs for which BRAF is only a biomarker. The page is at https://sugi.bio/atlas/gene/BRAF/.
3.2 The drug page: dabrafenib
Dabrafenib is one of the BRAF inhibitors the gene page points to. Its page is anchored at a ChEMBL molecule id and leads, again deterministically, with an approved small-molecule antineoplastic agent (ATC L01EC02) that targets BRAF and is recorded against 22 indications, and the lead names its precision-oncology evidence directly: CIViC clinical evidence for 75 variant-indication associations, for example BRAF V600E in melanoma.
The Targets zone is where the drug page’s central discipline shows. Its primary table lists only the GtoPdb-curated mechanism target, BRAF, resolved to its Atlas gene page and annotated with the curated action (inhibition), the affinity (pAffinity 8.49), and the DepMap cancer-dependency signal. The broader, assay-derived ChEMBL bioactivity set, 55 targets here, is reported separately as a count and a sample and is never promoted to a mechanism target, because that promiscuous cloud would otherwise make a kinase inhibitor appear to “target” unrelated proteins. The Indications zone cross-walks the labelled uses to MONDO and EFO so each links to its Atlas disease page (melanoma and metastatic melanoma, several thyroid carcinomas, non-small-cell lung carcinoma, and others), and the clinical zone records the trial-phase distribution together with the CIViC precision-oncology evidence as drug-variant-indication triples, the BRAF V600E sensitivities in melanoma and other cancers. The Related Atlas pages block lists the curated target (BRAF) and, kept separate, the biomarker genes whose variants the CIViC evidence associates with the drug (AKT1 and NF1), encoding the distinction that a biomarker for a drug is not the same as a target of it. The page is at https://sugi.bio/atlas/drug/dabrafenib/.
3.3 The disease page: melanoma
Melanoma is the cancer in which the BRAF V600E mutation is treated with dabrafenib. Its page is seeded by MONDO id and leads with a deterministic sentence: a cancer (an umbrella term over 15 Mondo subtypes) with 75 cohort genes drawn from 201 GWAS associations across 44 studies, 52 CIViC somatic drivers, and 21 ClinVar predisposition records, and 2,417 clinical trials, with dacarbazine, dabrafenib, and vemurafenib among the top interventions. The schema.org record carries the federated identifiers (MONDO, EFO, MeSH, Orphanet, DOID, NCIt) that let a consumer confirm this is the same disease across resources.
Most of the page’s biology is assembled by reusing the gene collectors over the associated-gene cohort, then aggregating. The cohort itself (BRAF, CDKN2A, TERT, MC1R, and others) is profiled gene by gene, while the cohort-level analyses run over the wider aggregate set: ranked by statistical over-representation, the cohort’s top enriched Reactome pathways are coherent for melanoma, led by the RAF/MAP-kinase cascade and MITF-M-regulated melanocyte development, while the broad umbrella pathways that would top a raw count fall away. The Therapeutics zone gives two views: the drugs indicated directly for melanoma, 15 approved and 58 in late-stage (phase 3) trials drawn from ChEMBL rather than the cohort, and a cohort druggability view, how many cohort genes carry an approved, phased, or no drug, which carry the most molecules (BRAF leads), and which drugs hit cohort genes. The Clinical trials zone gives the disease-level trials and the CIViC subtype map. The Related Atlas pages block labels the gene group “cohort genes” rather than causal genes, so a polygenic association is not read as causation. The page is at https://sugi.bio/atlas/disease/melanoma/.
3.4 One relationship, three pages
The three pages are not independent write-ups; they are three views of the same curated relationships, built in one pass and linked by the cross-entity mesh. The BRAF V600E to dabrafenib to melanoma association, recorded as CIViC evidence on both the gene and the drug page, is the same edge that lists melanoma among dabrafenib’s indications and dabrafenib among melanoma’s drugs, so a reader (or an agent) can enter the thread from any of the three and reach the other two. The mesh keeps the predicates distinct as it inverts them: on the gene page BRAF is targeted by its inhibitors and associated with its diseases; on the drug page BRAF is a target while AKT1 and NF1 are only biomarkers; on the disease page BRAF is one of the cohort genes. The same contract, and the same kind of connection, holds across every gene, drug, and disease in the catalog.
4 Discussion
Sugi Atlas mines BioBTree deterministically into a comprehensive catalog of genes, drugs, and diseases. Each page is assembled by issuing a fixed plan of chain queries that traverse the graph and rendering the records they return, and the curated relationships gathered along the way are woven into a cross-entity mesh that links the three corpora. We took this deterministic route for what a reference catalog most needs: each value is exactly what the source data records, every entity of a type is covered at the same breadth, and, for a given BioBTree snapshot, the corpus is reproducible and inexpensive to regenerate as that graph is refreshed. The result, as the BRAF, dabrafenib, and melanoma pages illustrate, is pages that consolidate curated content in depth together with a graph of curated links by which a single relationship is reachable from the gene, the drug, and the disease alike.
This design grew out of an LLM-driven prototype. Querying the same graph through an MCP interface produced grounded, well-formed pages, but it did not produce a catalog: the traversal a model chose varied between runs and models, coverage was neither uniform nor reproducible, and the per-entity cost made rebuilding the whole corpus on every data refresh impractical. Fixing the query plans and removing the model from the build trades the model’s flexibility for the properties a reference catalog needs. The two modes are complementary rather than competing. An interactive model querying the graph is well suited to a single, human-steered question with follow-ups [9], whereas the deterministic catalog provides the stable, linkable, browsable reference that such exploration does not, and that both people and agents can consult.
Determinism alone does not make a catalog trustworthy; the curation discipline does. The cross-entity mesh is built only from curated relationships and keeps its predicates distinct, so a gene that is a biomarker for a drug is not recorded as a target of it. Where an edge is contaminated or a value is absent, the page shows an honest empty rather than a confident guess. A fine-grained disease subtype shows this at work: a granular term such as IDH-wildtype glioblastoma carries almost no annotations of its own, so its molecular sections render honest no-data placeholders instead of borrowing the abundant evidence curated on its broader parent, glioblastoma. Rather than copy that evidence down to a subtype the sources do not support, the Disease family section places the term among its broader ancestors, parent, and sibling subtypes and routes the reader to where the evidence is curated. These are deliberate, opinionated choices, and they are what separate a deterministic catalog from a large but indiscriminate scrape.
Several established resources also publish per-entity reference pages by aggregating many primary sources [5–8], and integrated evidence platforms and biomedical knowledge graphs consolidate relationships within a domain. These are valuable and offer much that Sugi Atlas does not; equally, Sugi Atlas offers what most of them do not, bringing genes, drugs, and diseases into one comprehensive view, mined deterministically from a single integrated graph, linked across the three by a curated mesh, and refreshable as the graph updates. Taken together, these properties make it a strong complementary reference, useful to researchers and to the agents that increasingly consult such data.
4.1 Future work
Several directions would extend Sugi Atlas. The first is freshness in practice: because the catalog is a deterministic function of BioBTree and inexpensive to rebuild, we plan to regenerate it on a monthly cadence so that it tracks the underlying sources as they release updates. The second is coverage: Sugi Atlas currently builds human pages, whereas BioBTree already spans many model organisms, so the same pipeline can extend the catalog to those species, and broaden it further as BioBTree integrates additional datasets. A third is depth: the per-entity pages can be enriched with further curated content as it becomes available, within the same deterministic page contract. The catalog can also extend beyond per-entity pages: the same deterministic mining can assemble specialised analyses, built over the BioBTree graph for specific research questions and refreshed alongside the gene, drug, and disease pages. Finally, the catalog is a natural source of grounding for AI agents. BioBTree showed that giving a language model structured access to authoritative databases improves its responses [9]; because each Atlas page is a consolidated, current view of a single entity, supplying the relevant page as additional context to an agent is a promising way to improve responses further, which we leave to future evaluation.
5 Availability
The Sugi Atlas pipeline is open-source software, released under the MIT license; the source code and documentation are available in the project repository. The catalog it produces is published openly at https://sugi.bio/atlas/, where every gene, drug, and disease has a stable, browsable page (for example https://sugi.bio/atlas/gene/BRAF/). Sugi Atlas is built on BioBTree [9], the biomedical graph it mines, and can be regenerated as that graph is refreshed.
- Sugi Atlas pipeline (source and documentation): https://github.com/tamerh/sugi-atlas
- Published catalog: https://sugi.bio/atlas/
- BioBTree: https://github.com/tamerh/biobtree, public instance https://sugi.bio
References
- [1]
K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, et al. “Large language models encode clinical knowledge”. In: Nature 620.7972 (2023), pp. 172–180. doi: 10.1038/s41586-023-06291-2.
- [2]
M. Kuehl, D. P. Schaub, F. Carli, L. Heumos, M. Hellmig, C. Fernández-Zapata, et al. “BioContextAI is a community hub for agentic biomedical systems”. In: Nature Biotechnology 43.11 (2025), pp. 1755–1757. doi: 10.1038/s41587-025-02900-9.
- [3]
Z. Bai, P. Wang, T. Xiao, T. He, Z. Han, Z. Zhang, and M. Z. Shou. Hallucination of Multimodal Large Language Models: A Survey. Apr. 2025. doi: 10.48550/arXiv.2404.18930.
- [4]
X. Hou, Y. Zhao, S. Wang, and H. Wang. Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions. Mar. 2025. doi: 10.48550/arXiv.2503.23278.
- [5]
G. Stelzer et al. “The GeneCards Suite: From Gene Data Mining to Disease Genome Sequence Analyses”. In: Current Protocols in Bioinformatics 54.1 (2016), pp. 1.30.1–1.30.33. doi: 10.1002/cpbi.5.
- [6]
C. Knox, M. Wilson, C. M. Klinger, et al. “DrugBank 6.0: the DrugBank Knowledgebase for 2024”. In: Nucleic Acids Research 52.D1 (2024), pp. D1265–D1275. doi: 10.1093/nar/gkad976.
- [7]
K. J. Kelleher et al. “Pharos 2023: an integrated resource for the understudied human proteome”. In: Nucleic Acids Research 51.D1 (2023), pp. D1405–D1416. doi: 10.1093/nar/gkac1033.
- [8]
T. E. Putman, K. Schaper, N. Matentzoglu, V. P. Rubinetti, et al. “The Monarch Initiative in 2024: an analytic platform integrating phenotypes, genes and diseases across species”. In: Nucleic Acids Research 52.D1 (2024), pp. D938–D949. doi: 10.1093/nar/gkad1082.
- [9]
T. Gür. BioBTree v2: Grounding LLM Responses with Large-Scale Structured Biomedical Data. 2026. doi: 10.5281/zenodo.18962899.