Research project

PloverDB In-Memory Knowledge Graph Service

Python, Flask, uWSGI, nginx, Docker, AWS EC2, Kubernetes

PloverDB is the in-memory database service that holds the Translator project main knowledge graph (about 7 million biological concepts and 30 million relationships among them) entirely in computer memory and answers questions against it as a web service (Glen, Deutsch, Ramsey 2025; Bioinformatics 41(7):btaf380). It was originally written by Amy K. Glen, and is now maintained by Stephen Ramsey, Frankie Hodges, and me.

My focus is the runtime layer, the parts of the system that decide whether the service stays up under sustained real-world load. The legacy stack was an old prebuilt container image that the upstream maintainer abandoned, pinned to a Python interpreter that no longer received security patches. Worse, the way it spawned worker processes was incompatible with how much data this service holds.

Figure 1 | High-level architecture. A graph file (nodes and edges in tab-separated format) and a small configuration file are folded into a Docker image at build time. Running the container exposes the graph as a web service with three standard endpoints used by the rest of the Translator project. Reproduced from Glen, Deutsch, Ramsey 2025 (Bioinformatics 41(7):btaf380), CC-BY 4.0.

Why this work was needed

Loading the graph into memory takes about 90 gigabytes of RAM. When a Python program copies its workers, it normally lets all the workers share the same memory pages until one of them changes the data. But Python's garbage collector quietly modifies internal bookkeeping bytes whenever a worker reads anything, which forces the system to give each worker its own private copy of the page. In practice that meant 16 workers could each end up with their own 90 gigabyte copy, blowing past the memory limit within seconds.

On the project's central deployment the workers were also entering a permanent deadlock under bursts of incoming traffic. A single dying worker holding an internal lock would freeze every other worker for hours, and the only way out was a manual restart of the pod. My runtime hardening fixes all three problems.

Migrating Python and restoring shared memory

The image swap from the old base to a clean Python 3.12 with a tuned web server PR-72 is small in lines of code but required restoring the memory semantics that the old image used to give for free. The fix is one line: a call to a Python garbage collector function that locks all currently allocated objects into a generation the collector ignores from then on. Without it, the collector's normal scanning would touch every shared page and force a private copy. With it, all the workers share the same physical pages, and each worker's private memory stays at about 2 gigabytes instead of 90, which is roughly a 95 percent reduction.

The configuration file documents the trade-off: a popular automatic-restart feature that recycles workers when their memory grows too large is intentionally turned off, because every worker reports about 88 gigabytes of resident memory (almost all of it shared) and any threshold under that would trigger an infinite restart loop. Long-term recycling is handled instead by capping each worker's lifetime at four hours, with a small jitter so the workers never recycle in lockstep.

The thunder-lock deadlock

Figure 2 | Query duration versus answer size on the same knowledge graph, PloverDB (orange) versus a similar service called Plater (blue), across 82 real-world queries on a log-log scale. PloverDB completes most small queries in under a tenth of a second, and the per-query gap widens with smaller answer sizes. Reproduced from Glen, Deutsch, Ramsey 2025 (Bioinformatics 41(7):btaf380), CC-BY 4.0.

The instability tracked in Issue-87 turned out to be a configuration bug in the web server itself, not a Python or graph problem. The web server has a flag that serializes how worker processes accept incoming connections by passing a single internal lock among them. Under normal conditions that flag eliminates a small contention cost on busy servers. But our workers recycle frequently (each worker is replaced every few minutes after handling a fixed number of requests), and a worker dying while it holds the lock leaves the lock permanently held. Every other worker then blocks on it indefinitely. The pod looks alive from the outside (health checks pass), but every actual query times out.

The fix PR-88 removes the flag entirely. With the in-container reverse proxy I added (described below) absorbing connection bursts upstream, the contention cost the flag was protecting against is negligible. The same change raises the worker recycle threshold from 100 requests to 10,000 (the previous value was forcing constant 90-gigabyte process forks) and adds a jitter so workers never recycle at the same instant. Worker-deadlock incidents on the central deployment dropped from multiple per month to zero in the three months since the rollout.

The same change routes the proxy and application logs to standard output so the cluster log viewer can see them, since the central deployment does not grant me a shell into the pods.

In-container reverse proxy as a connection buffer

Once the deadlock was fixed, the next failure mode was the application's connection queue saturating at 100 in-flight connections under bursts of health-check traffic from the load balancer Issue-73. The legacy image had bundled a small reverse proxy in front of the application; the bare Python image did not. I restored an in-container reverse proxy PR-77 that buffers incoming connections before they reach the application server, raised the queue size five times (from 100 to 512), and configured response buffering so slow downstream readers no longer hold up worker processes.

I also added a dedicated health-check route with a short timeout, so the load balancer's probes succeed even when worker processes are busy on slow real queries. That was the path that had been triggering the cascading-failure mode on every busy day.

Operational hardening

A second-order pass PR-79 made the container survive in production for arbitrary uptime. The startup script now spawns a watchdog that polls the reverse proxy every 30 seconds and restarts it if it dies. Earlier versions of that check were fooled by zombie processes (a process can be technically dead but still listed as running until its parent acknowledges it), so the new check verifies the process status flag explicitly. The watchdog also runs the system log rotation tool on every cycle, so the application and proxy log files cannot grow without bound on multi-day pods. The application server itself was hardened with a 45-second hard timeout per request and a flag that suppresses noisy traceback prints when clients disconnect mid-response.

Versioning the deployed binary

Verifying which build is live used to require parsing a free-form date string. I exposed a structured version endpoint PR-91 PR-93 that returns the deployed Git commit identifier, the commit timestamp in Pacific Time, and a separate field for the version of the shared vocabulary the project uses to label biological relationships PR-81. The downstream tools and the project dashboard now consume those fields directly instead of running a regular expression over an info blob, so confirming "is this version actually rolled out?" is a one-line check.

Reading list