RTX-KG2 knowledge graph build system

research assistant · Apr 2025 – present · github.com/RTXteam/RTX-KG2

Python, Bash, Snakemake, Neo4j, AWS S3, Docker, Biolink Model, OWL

The Snakemake build directed acyclic graph for RTX-KG2: a Validate row at the top, a wide row of about 20 per-source converter stages (UMLS, UniProtKB, SemMedDB, ChEMBL, Ensembl, UniChem, NCBIGene, DGIdb, RepoDB, DrugBank, SMPDB, HMDB, GO Annotations, Reactome, miRBase, jensenLab, DrugCentral, IntAct, DisGeNET, KEGG, plus an Ontologies-and-TTL bundle), all feeding into a single Merge node, then a chain of Stats, Simplify, Simplify_Nodes, Slim, Simplify_Stats, TSV, and Finish nodes that produce the final canonicalized graph. — Figure 1 | the Snakemake build DAG for RTX-KG2. the top row validates and the second row holds about 20 per-source converter stages (one per upstream knowledge base). all converters feed into a single Merge node, after which the Stats, Simplify, Simplify_Nodes, Slim, Simplify_Stats, TSV, and Finish stages produce the final canonicalized release artifacts. the build runs on a dedicated EC2 instance and is parallelized at the per-source converter level by Snakemake.

RTX-KG2 is the build system that produces the biomedical knowledge graph (called KG2) used by the ARAX reasoning agent inside the NIH Biomedical Data Translator project (Wood et al. 2022; BMC Bioinformatics 23:400). the pipeline pulls about 20 source databases (drugs, genes, proteins, diseases, pathways, ontologies, literature), normalizes their identifiers through the project's shared identifier service, applies the Biolink Model's category and predicate vocabulary, and produces a single canonicalized graph (KG2c). the latest production release is KG2.10.3 (August 2025) with about 8.7 million nodes and 56 million edges.

it was originally authored by a team led by Erica C. Wood and Amy K. Glen at Oregon State University. i joined the maintainer team in April 2025. my focus is the build pipeline itself, the cross-repo rollout that propagates a new graph version through the rest of the project, and the integrity checks that decide whether a build can ship at all.

Why this work was needed

Overall RTX-KG2 system workflow: a column of source databases on the left feeds into a JSON-format RTX-KG2pre knowledge graph, which is exported as a KGX-format TSV file set, loaded into a Neo4j endpoint, and combined with the SRI Node Normalizer through the Node Synonymizer SQLite database to produce RTX-KG2c. KG2c is then exported as KGX TSV and JSON archives (one feeding the NCATS Knowledge Graph Exchange Archive, the other feeding the RTX-KG2c PloverDB API endpoint, which serves ARAX). — Figure 2 | high-level data flow from upstream sources through KG2pre to the canonicalized KG2c served via PloverDB to ARAX. the Node Synonymizer SQLite database, fed by the SRI Node Normalizer, is what turns KG2pre's source-specific identifiers into the canonical concepts used by KG2c. Reproduced from Wood et al. 2022 (BMC Bioinformatics 23:400), CC-BY 4.0.

a new release of KG2 is not just one rebuild. each release rebuilds the precursor graph (KG2pre) from sources, runs identifier normalization through the Translator's shared identifier service, derives the canonicalized graph (KG2c) and a node-synonymizer SQLite database from KG2pre, then propagates the new artifacts through PloverDB for serving and through ARAX for reasoning. several downstream databases that depend on KG2 (the citation-distance index, the path-finder index, two trained models) need to be rebuilt in the right order, four OpenAPI specs need their version strings bumped, and five named endpoints need to be updated by SSH-ing into Docker containers and running a per-endpoint pull-and-restart. each new release used to be an undocumented multi-day sequence held together by tribal knowledge.

my contributions split between this rollout work and changes inside the RTX-KG2 build that prevent unfit graphs from leaving the pipeline in the first place.

Co-leading the KG2.10.2c rollout

the first major release i worked on was KG2.10.2c, tracked in Issue-2456 on the cross-repo tracker. i co-led the rollout with another Ramsey Lab member. the sequence walks through three repositories (RTX-KG2, RTX, PloverDB) and includes building the canonicalized graph plus the node-synonymizer database on a dedicated build host (buildkg2c.rtx.ai), loading the new graph into Neo4j and verifying it with sample Cypher queries, deploying PloverDB on the matched graph and running its regression tests, rebuilding four downstream databases in dependency order (the citation-distance database, the path-finder index, an explainable drug-treats-disease model, and a chemical-gene regulation-graph model), bumping four OpenAPI version strings, updating the central paths-and-versions config so every endpoint can find the new artifacts, uploading the new database files to the production host and to the project's central file-transfer server, then walking the rollout through five named endpoints (devED, devLM, beta, test, production) by SSH-ing into a Docker container and running a git pull plus a service restart on each, running the full ARAX test suite on each endpoint after the restart, and finally waiting for the central Jenkins pipeline to build the matched PloverDB and pick up the new version.

i documented every step as we went, and the previously-tribal sequence is now the runbook that lives in the issue templates of the cross-repo tracker.

Build integrity for required Translator metadata

the Translator project requires every edge in the graph to declare two attribution fields: knowledge_level (whether the assertion is, for example, an automated text-mining call or a curator-asserted fact) and agent_type (which agent or pipeline produced it). before this change, the build's edge filter would emit a console warning when those fields were missing on an incoming edge but continue building anyway, which meant edges without the required attribution could quietly reach production. Issue-441

i changed the filter in process/filter_kg_and_remap_predicates.py so that any edge missing its knowledge-source CURIE or its required metadata fails the build immediately with a non-zero exit code instead of printing a warning. commit f0379a3 the first build run after the change surfaced several real upstream gaps that the warning had been hiding for months; they were fixed in the source ingest before the next release shipped.

SemMedDB edge property: rename and revert

a second change tightened how SemMedDB edges carry their supporting text. the Biolink Model's evolving conventions suggested renaming the edge publication-text property from sentence to supporting_text, lining the predicate up with how every other knowledge-source attaches its evidence. i made the rename in the SemMedDB converter commit 2df6579 tracked under Issue-452, then reverted it the next day commit 47a03b2 when the upstream Biolink schema update turned out to still be in draft. the property is back to sentence until the upstream change is finalized, and the rename plus revert paths are both documented for whoever picks the change back up later.

A reproducible KGX-validation workflow

the released KG2 artifacts are KGX node and edge files (line-delimited JSON) plus build metadata. the graph is validated against the Biolink Model at build time, but the team did not have a step-by-step procedure for re-running that validation against an arbitrary released artifact, which made spot-checks during a rollout, or independent inspection of a different upstream pipeline's output, ad-hoc and error-prone. i wrote a step-by-step KGX-validation guide that walks through running the validator from the translator-ingests repository against KG2.10.3c, reading the resulting validation report, and triaging the typical failure classes (CURIE-format issues, unrecognized categories or predicates, missing knowledge-source attribution, oversized edge properties). Issue-487 commit 01677d2 the guide ships in the repository's docs/ folder and is now the QA reference for new builds.

Smaller fixes and predicate remaps

a handful of smaller tickets cleaned up the edges of the graph as the Biolink Model evolved. the DRUGBANK ingest had been mapping its drug-interaction link onto biolink:physically_interacts_with, which conflated drug-drug pharmacological interactions with the much narrower notion of physical interaction; i remapped it to a more accurate predicate. Issue-340 a similar predicate remap landed for the DisGeNET ingest. Issue-434 a KG2.9.0pre build had emitted incorrect subclass_of edges, traced to a mislabeled converter pass. Issue-377 pruning the build's requirements file Issue-433 dropped the now-unused ontobio library and its transitive dependencies, shrinking the build image and removing a long chain of pinned versions that no longer constrained anything. coordination with the upstream Biolink Model team on adding an in_taxon node slot landed in February 2026. Issue-468

Reading list

RTXteam/RTX-KG2 kg2cploverdb.ci.transltr.io BMC Bioinformatics 2022 (DOI)PMC9520835 KGX validation guide Biolink Model