Research project

ARAX Biomedical Reasoning Agent

Python, Flask, uWSGI, MySQL, SQLite, Neo4j, Docker, AWS EC2, Kubernetes

ARAX is the modular reasoning agent of the NIH Biomedical Data Translator program (Glen et al. 2023; Bioinformatics 39(3):btad082). It accepts a structured biomedical question, dispatches it to about 40 knowledge providers, merges the answers, applies statistical overlays, prunes and ranks the results, and returns a TRAPI response with the supporting evidence attached.

It was originally authored by a team led by Amy K. Glen at Oregon State University and David Koslicki at Penn State, with Eric W. Deutsch at the Institute for Systems Biology and Stephen A. Ramsey at OSU as senior authors. I joined the maintainer team in early 2025. My focus is the runtime, the release path, and the test infrastructure: the layer that decides whether a new graph version actually lands cleanly in production and whether the test suite gives the team a reliable signal.

Figure 1 | ARAX architecture. Left: how a query graph evolves as ARAX answers it, gathered by Expand, annotated by Overlay, pruned by Filter, enumerated by Resultify, and scored by Ranker. Right: the modular system, accepting three query input forms (a query graph, the ARAXi domain-specific language, or a TRAPI workflow), routing through the query-graph interpreter into the five core modules, and dispatching out to about 40 knowledge providers. Reproduced from Glen et al. 2023 (Bioinformatics 39(3):btad082), CC-BY 4.0.

Why this work was needed

ARAX is real production software running at arax.ncats.io (and on three ITRB-hosted environments at arax.ci.transltr.io, arax.test.transltr.io, and arax.transltr.io). Each new release of the underlying knowledge graph (called KG2, refreshed every few months) is supposed to roll out cleanly, but in practice the rollout was an undocumented multi-day sequence held together by tribal knowledge, with multiple downstream databases that needed rebuilding in the right order, four OpenAPI version strings to bump, and five named endpoints to update one by one inside a Docker container. Concurrent with that, the runtime stack itself was getting old: Python 3.9 was nearing end of security support, an old cryptography dependency had stopped receiving Python-version updates, and the test runner was forcing every CI run to download about 200 gigabytes of databases by default, regardless of whether the test in question actually needed them.

What ARAX actually answers, briefly

Figure 2 | An ARAX query in three pieces. The top is the question (which proteins interact with acetaminophen?). The middle is the larger knowledge graph the question is matched against. The bottom is the three subgraphs that fit the question pattern. Reproduced from Glen et al. 2023 (Bioinformatics 39(3):btad082), CC-BY 4.0.

A user (or a downstream tool) sends ARAX a small graph that defines the question they want answered. One node in that graph is pinned to a specific concept, like a particular drug or disease, and the rest of the nodes and edges define the pattern to look for. ARAX reads the question, pulls in candidate matches from its knowledge providers, and returns the subgraphs that fulfill the pattern, scored and ranked. The figure on the right shows a minimal example: the question asks which proteins interact with a specific drug; the system finds three matching proteins from a larger knowledge graph and returns each as a result.

The reason this mattered for my work is that ARAX is interesting as a research system precisely because it composes a lot of moving pieces (the parser, the dispatcher, the per-knowledge-provider clients, the overlay statistics, the ranker), and each of those pieces ages independently. My job is to keep the composition stable through dependency churn, library upgrades, and graph-version rollouts.

Co-leading the KG2.10.2c rollout

The first major piece was the KG2.10.2c rollout, tracked in Issue-2456. I co-led it with other Ramsey Lab members. The rollout touches three repositories (RTX, RTX-KG2, PloverDB) and includes building the canonicalized graph and node synonymizer on a dedicated build host, loading the new graph into Neo4j and verifying it with Cypher queries, deploying PloverDB on the matched graph and running its regression tests, rebuilding four downstream databases in dependency order (the citation-distance database, the path-finder index, the explainable-drug-treats-disease model, and the chemical-gene regulation-graph model), bumping four OpenAPI version strings (one for the ARAX spec, one for the KG2 knowledge-provider spec, plus a working pair), updating the central paths-and-versions config so every endpoint can find the new artifacts, uploading the new database files to the production host and to the ITRB file-transfer server, then walking the rollout through five named endpoints (devED, devLM, beta, test, production) by SSH-ing into a Docker container and doing a git pull plus a service restart on each, running the full ARAX test suite on each endpoint after the restart, and finally waiting for the ITRB Jenkins pipeline to build the matched PloverDB and pick up the new version.

I documented every step and turned the previously-tribal sequence into the runbook that lives in the repository's issue templates.

Migrating Python from 3.9 to 3.12

The Python migration shipped as PR-2596 in December 2025 and resolved Issue-2521. The diff itself is small (requirements.txt, the Dockerfile, two .start scripts, the per-graph paths config, and a one-line fix in ARAX_expander.py for a 3.12 incompatibility), but it required navigating a thicket of pinned-version compatibility. The biggest move was removing simple-crypt (which is a wrapper around the long-abandoned pycrypto, and pycrypto itself does not support Python 3.12) and replacing it with pycryptodome 3.23, which is the maintained drop-in fork. Beyond that I bumped pandas (1.5.3 to 2.3.1), NumPy (1.24.2 to 1.26.4), lxml (to a version with prebuilt wheels for 3.12), SciPy, scikit-learn, boto3 (1.24.59 to 1.40.16, picking up an urllib3 2.5+ requirement that several other dependencies needed), and PyYAML (6.0 to 6.0.2 for wheel availability). I also cleaned up an asyncio==3.4.3 pin that had been redundant since asyncio entered the standard library in Python 3.4. The migration was coordinated with the matching PloverDB image upgrade so the two services moved together.

A cached dynamic meta-knowledge-graph endpoint

The /meta_knowledge_graph endpoint exposes a summary of which node categories and edge predicates each knowledge provider can answer, used by tools that plan TRAPI queries against the federation. It used to be served from a static file that was hand-regenerated on each release, which meant it was reliably wrong by some amount until someone remembered to refresh it. The replacement, shipped as PR-2557 resolving Issue-2504, builds the meta-graph dynamically by fetching from PloverDB and merging in each knowledge provider's own meta-graph, caches the result for one hour, and refreshes the cache on a background schedule with automatic fallback to the previous cached copy if the upstream is briefly unavailable.

Flipping a 200-gigabyte default

Running the ARAX test suite used to download about 200 gigabytes of per-graph databases on every fresh clone, regardless of which tests you intended to run. The opt-out flag was --nodatabases, which everyone forgot half the time. PR-2674 flipped the default: the database download is now skipped unless you pass --withdatabases explicitly. The reasoning is that the unit-test subset and most of the integration tests do not exercise the per-graph artifacts directly, so the download was wasted bandwidth and disk most of the time. Stephen Ramsey had flagged this as a longstanding papercut.

Fixing the citation-distance build and sizing up the build host

The citation-distance database (built on a dedicated host called ngdbuild2.rtx.ai) generates a per-curie distance metric over a corpus of about 25 million PubMed records. Two things had broken it. First, the build was getting killed by the kernel out-of-memory handler at the "loading data into the intermediate database" step, when it had grown to about 64 gigabytes of resident memory on a 64-gigabyte machine. We provisioned a new host (32 vCPUs, 128 gigabytes of RAM, 400 gigabytes of disk, Ubuntu 22.04), tracked in Issue-2466, and the build now finishes well below the new memory ceiling.

Second, working with another Ramsey Lab member, we tracked down a silent data-dropping bug, tracked in Issue-2470. A previous build-time optimization had introduced a dictionary re-initialization between the baseline-corpus pass and the dump-to-disk pass, and the baseline records were being written into the in-memory dict and then immediately discarded. The fix was a one-line move to keep the baseline pass's results, but identifying it required walking through the build's intermediate artifact files and confirming the missing records had been present in older builds.

Reading list