SignChat real-time ASL-to-voice

lead frontend engineer · May 2026 · github.com/nlevites/signchat

Next.js, TypeScript, MediaPipe, ONNX, LiveKit, ElevenLabs

The SignChat landing page: a person in a hoodie signing to a webcam over a dusk-gradient background, the headline 'Sign with your hands, they hear your voice', the subline 'Real-time ASL-to-voice and live captions. Free, in your browser, no install.', a Start a call button, and floating panels showing the live sign classifier output and the reconstructed sentence. — Figure 1 | the product in one screen: the signer's webcam feeds an in-browser classifier, recognized signs are stitched into a sentence, and the hearing peer hears a synthetic voice with live captions on both sides.

SignChat is a video chat where one person signs in American Sign Language and the other person hears it as natural spoken voice, in real time and entirely in the browser. a 250-sign classifier runs locally on the signer's webcam, a language model stitches the recognized signs into fluent English, and a text-to-speech service streams it back as voice in about half a second from sign to sound. it is live at signchat.org.

i built it with a small team at BeaverHacks 2026 (Oregon State University) as the lead frontend engineer and the top contributor by commit count. i built the whole web app, the call room and everything around it. nlevites/signchat

What I built

the whole product is one URL: you open it, start a call, and share the link. i built the full meeting flow behind that. the lobby and preflight screens check the camera and microphone and let you pick a synthetic voice before joining. the call room is a tile layout with click-to-swap tiles, live microphone and camera state, captions on both sides, and a toast system for connection events. i also built the landing and marketing pages and wired the browser-side voice-to-text so the hearing side gets transcribed too.

i also built a live debug overlay. the signer's hand and pose landmarks go over the call's data channel, so the hearing peer sees the same tracking the classifier works from, in real time.

How it works under the hood

the translation pipeline is a chain of browser-direct hops with no SignChat-operated relay in the middle. webcam frames run through MediaPipe (in-browser landmark tracking) into a roughly 1.7-million-parameter classifier exported to ONNX and run in WebAssembly, so no GPU is needed on the client. stabilized sign tokens go to a hosted language model that reconstructs a sentence, which a streaming text-to-speech service turns into voice. the audio and video transport runs over LiveKit, and a small server surface only mints the short-lived credentials each provider needs.

the numbers the team measured: about 0.6 seconds from sign to first audible audio at the median, a 250-sign vocabulary, and zero relay servers on the per-turn path.

Reading list

signchat.org 3-min demo nlevites/signchat BeaverHacks