Research project

NCATS Translator Ingest Pipelines

Software Engineer · Aug 2025 – present · github.com/NCATSTranslator/translator-ingests

Python, uv, Koza, ORION, Biolink Model, AWS S3, AWS EC2, Flask, Gunicorn, nginx, Docker

Translator ingest pipeline architecture: about 30 third-party biomedical sources on the left feed into per-source ingest specifications (download.yaml, transform config, RIG.yaml, Python ingest script, unit tests), which flow through a shared pipeline of download, parse-and-transform, Biolink-constrained validation, normalization, and graph-object generation, producing standardized KGX node and edge JSONL files plus ingest metadata (with optional summary and validation reports), all consumed by the Translator Tier-1 knowledge graph. — Figure 1 | Translator ingest pipeline. Each of about 30 third-party biomedical sources flows through the same five-stage pipeline. The per-source layer keeps the source-specific download specification, transform code, the Resource Ingest Guide (a machine-readable spec of what is being ingested and how), and unit tests in one folder. The shared layer downloads the raw dump, runs a Koza transform that emits Biolink-typed objects, validates against the Biolink Model, normalizes identifiers through the SRI Node Normalizer, and writes the standardized graph files (called KGX, for nodes and edges in line-delimited JSON) plus ingest metadata. Downstream Translator services build on those files.

NCATS Translator Ingests is the consolidated data-ingest pipeline for the NIH Biomedical Data Translator project, run by the project's data-ingest working group. The repository replaces the per-knowledge-provider Phase 2 ingests with one reviewed pipeline that produces a standardized graph format (called KGX, for nodes and edges in line-delimited JSON) for about 30 biomedical sources. Outputs are validated against the Biolink Model (the project's shared vocabulary for biological categories and predicates), normalized through the SRI Node Normalizer (the project's shared identifier service), released as compressed archives, and uploaded to a shared S3 bucket.

I joined the team in summer 2025. My focus is end-to-end ownership of three of the largest sources (the Gene Ontology Annotation database, SemMedDB, and PathBank), plus the upload and orchestration layer that ships builds to the rest of the project.

Why this work was needed

The previous Phase 2 design had each knowledge provider maintaining its own ingest scripts, with no shared validation step and no shared release format. Two providers ingesting the same upstream source could produce subtly different graphs, and downstream consumers had to know each provider's quirks. Phase 3 collapses that into one pipeline with one validation layer, one normalization layer, and one release artifact format. Each ingest is one self-contained folder under src/translator_ingest/ingests/, with a download specification, a transform written against a small graph-transform domain-specific language (called Koza, from the Monarch Initiative), a per-source configuration file, unit tests, and a Resource Ingest Guide (a yaml document that describes scope, modeling choices, and meta-knowledge-graph fields in a way the project leadership can review).

My work covers three of those self-contained ingests end-to-end, plus the upload step, plus the orchestrator that runs everything.

Building the GOA ingest

The Gene Ontology Annotation database (GOA) ties gene products to Gene Ontology terms with evidence codes. It was my first contribution, shipped as PR-53 in August 2025. The first version covered the Resource Ingest Guide draft and the working transform code. A second pass added a versioning fallback so the ingest can recover when the upstream release file is missing or its version string fails to parse PR-108. A third pass landed the manual-QA fixes the project leadership flagged during the GOA Tier-1 review PR-288.

The QA pass is where I learned how to read the project's reviewer comments and turn them into specific code changes: aligning the predicates to the canonical Biolink terms, attaching the right knowledge-source attribution, and making sure every edge carried the evidence-code metadata the downstream tools expect.

SemMedDB end-to-end ownership

SemMedDB is the Semantic MEDLINE Database, a corpus of about 100 million predications mined automatically from PubMed abstracts by the National Library of Medicine. It is the largest single source in the pipeline by edge count, and it is the most opinionated to model: every edge is a noisy machine-extracted statement, the raw dump is gated behind a UMLS license that prevents public redistribution, and the predicate vocabulary needs mapping onto the project's standard predicate list before any downstream tool can use it.

I owned the SemMedDB ingest from initial scaffolding through every subsequent revision. The arc spans eight pull requests over six months. The initial pair shipped the working ingest PR-103 and added retrieval from a shared S3 bucket so the team could distribute the licensed dump without redistributing it publicly PR-122, with a follow-up landing the remaining transform fixes PR-169.

The manual-QA cycle ran from January through February 2026 across three pull requests PR-253 PR-287 PR-306. The reviewer flagged that every edge needed an explicit knowledge-level and agent-type attribution (text-mining, with a not-provided knowledge level, since the predications are machine-extracted), that the preventative_for_condition predicate was outside the project's standard list and needed to be remapped to treats_or_applied_or_studied_to_treat, that variant-form qualifiers were missing on the gene-mention edges, and that the supporting text snippets the upstream extractor produces should not be dropped during transform but routed through a nested has_supporting_studies and has_study_results chain so they reach the final graph as inspectable evidence.

A separate problem surfaced once the validated graph started flowing into downstream tools: a small fraction of SemMedDB edges carry tens of thousands of supporting PubMed identifiers, which blew through the project's per-edge size limits and caused some downstream consumers to fail on load. The fix PR-343 introduced a configurable cap (default on, threshold 200 publications per edge, configurable through environment variables) that keeps the union of the top 100 edges by minimum subject-object confidence and the top 100 by publication year, so the most-cited and most-recent evidence both survive. That pull request was later superseded by PR-353, which folded the cap into a larger refactor that aligned the SemMedDB qualifiers with the predicate scheme used by another Translator agent (called BTE, the Biomedical Translator Explorer), expanded the reader's allowed-predicate filter from 10 entries to 20 to admit Biolink predicates the previous pass had been silently dropping, and emitted specific Biolink association subclasses (ChemicalAffectsGeneAssociation, GeneAffectsChemicalAssociation, GeneRegulatesGeneAssociation, CausalGeneToDiseaseAssociation) instead of generic associations.

PathBank with strict normalization

PathBank is a curated database of about 110,000 small-molecule pathways. The challenge with PathBank is that it uses its own private identifier scheme (the PWML namespace for pathway markup) that the project's identifier-normalization service does not recognize, so a naive ingest loses every pathway node the moment normalization runs. I built the ingest using a converter from a sister project as reference PR-138, then added an identifier-mapping step that rewrites the PWML identifiers to the Small Molecule Pathway Database (SMPDB) namespace before normalization runs PR-197, since SMPDB is one of the namespaces the normalizer recognizes. After the mapping step landed, every pathway node survived normalization and the previously-empty pathway subgraph populated correctly.

A documentation pass PR-289 recorded the longer-term modeling open questions in the Resource Ingest Guide, so a future maintainer can pick up the thread without re-discovering the PathBank quirks. The manual-QA pass PR-315 landed the remaining reviewer fixes and brought the validation report to zero errors. With validation clean, I then turned on strict normalization for PathBank PR-355, which makes the pipeline drop edges whose endpoints fail to normalize instead of letting them through as warnings.

The S3 upload and EBS cleanup pipeline

The Phase 3 pipeline stores released artifacts in an S3 bucket so downstream consumers can pull them by source release version. The build host (an AWS EC2 instance with a 200-gigabyte attached EBS volume) accumulates intermediate artifacts as the pipeline runs, and previous versions of the team's upload tooling did not clean up after themselves, so the EBS volume filled within a few release cycles and stalled the next build.

I authored the upload module from scratch, about 960 new lines of Python split between a command-line entry point (upload_s3.py) and a reusable library (util/storage/s3.py) PR-217. The behavior is rsync-like: every upload always overwrites whatever is already at the destination key, so a partial upload from an earlier failed run cannot leave stale content behind. After a successful upload the cleanup step removes every local copy of the source except the latest version, which keeps the EBS volume from filling. Older versions remain in S3 and can be pulled back to the host on demand. The destructive S3-side cleanup path (the one that actually deletes objects from the bucket) requires a typed two-step confirmation, since one accidental run there would lose the team's release history. The module also auto-discovers what is sitting under data/ and releases/ on the build host and treats them as two independent upload lists, so the operator does not have to know the layout in advance.

A small follow-up PR-232 wired two configuration variables (INGESTS_STORAGE_URL and INGESTS_RELEASES_URL) so the pipeline knows the public URL where its uploaded artifacts will be served. Those two variables are the seam between this repository and the companion web-server repository, RTXteam/kgx-storage, described next.

The companion web server

Hand-drawn architecture diagram showing the position of KGX Storage within the broader Translator system. KGX Storage sits in the middle, with the merge pipeline writing into it and a fan-out of downstream consumers reading from it: the automated routing service (ARS), the Shepherd-managed reasoners (ARAGORN and ARAX), the Biomedical Translator Explorer (BTE), legacy knowledge providers, the retriever, the Tier-1 deployments, the Prototype-0 robokop-on-neo4j stack, the automated testing harness, and a NodeAnnotator/Normalizer component pulling node properties from the bucket. Dogman feeds into KGX Storage from the bottom. — Figure 2 | Position of KGX Storage within the Translator system. The companion web server (RTXteam/kgx-storage at kgx-storage.rtx.ai) sits between the merge pipeline that writes the S3 bucket and the downstream consumers that read it: the project's automated routing service, the Shepherd-managed reasoners, the Biomedical Translator Explorer, the legacy knowledge providers, the retriever and Tier-1 deployments, the Prototype-0 graph database stack, and the automated testing harness.

The pipeline writes its release artifacts to S3, but team members and downstream consumers also need a browser-friendly way to inspect those artifacts without AWS credentials. The companion repository (RTXteam/kgx-storage) is a small Flask, Gunicorn, and nginx web server that does this, deployed as the public site at kgx-storage.rtx.ai. I built and maintain it solo; every commit is mine.

The server runs on a single Ubuntu EC2 instance, reads the bucket through an EC2 instance IAM role (so no credentials sit on disk), and exposes folder-style URLs that mirror the S3 layout. Downloads use one-hour presigned S3 URLs so the browser never sees AWS keys directly. A precomputed metrics file keeps folder size and file count fast to render on the page; a cron job recomputes it and signals Gunicorn to reload. TLS comes from Let's Encrypt via certbot with auto-renewal. The systemd unit hardens the process with no-new-privileges, a read-only home, a private temporary directory, and a half-gigabyte memory ceiling.

The reason this server matters for the orchestrator section that follows is that the project's central deployment denies SSH onto the build host, so the only way to inspect a build that ran while no human was logged in is to click through to kgx-storage.rtx.ai for the report and log files the orchestrator uploads. The two repositories are deployed independently so the pipeline and the UI can ship on different cadences, but together they form one workflow.

The make build orchestrator

The pipeline ships four canonical Make targets: make run to download and transform a source, make merge to deduplicate across sources into a single named graph, make release to package the merged graph as compressed archives, and make upload to push to S3. Running them by hand for about 30 sources is fragile. Each source has different memory requirements, some can run in parallel and some cannot, and the project's central deployment does not allow SSH onto the build host, so a hung build is invisible until the next morning.

I am the author of an open pull request, PR-361, that adds a fifth target, make build, sitting on top of the four existing ones as a pure orchestrator with no changes to the underlying pipeline code. The orchestrator runs each stage in sequence with wall-clock and resident-memory tracking per source, executes the parallelizable sources concurrently through Python's ProcessPoolExecutor with a configurable worker count, runs the two memory-heavy sources (the Comparative Toxicogenomics Database and SemMedDB) sequentially before the parallel batch so they do not fight each other for RAM, watches system memory and triggers a graceful shutdown if it crosses 95 percent, polls the per-process I/O counters under /proc/<pid>/io to detect a stalled subprocess, and writes a structured build report at the end.

The reports and logs are uploaded to S3 alongside the release artifacts and surfaced through the companion web server, so a build that ran while no human was logged in can still be inspected the next morning by clicking through the kgx-storage.rtx.ai page. The suggestion to auto-generate the reports came from another Ramsey Lab member; I expanded that into the orchestrator.

Reading list

NCATSTranslator/translator-ingests Project documentation kgx-storage.rtx.ai Translator program Biolink Model Koza