Conference Proceedings
- Home
- Beyond Captions A Practical Pipeline for Live AI-Generated Sign Language Avatars
Beyond Captions A Practical Pipeline for Live AI-Generated Sign Language Avatars
Description
We all treat closed captions as the baseline for accessibility, but for millions of users, they’re not enough. Sign languages possess unique grammar, emotion, and context that plain text simply can’t capture. While human interpreters are the gold standard, they don’t scale for the massive world of live streaming. So, how can we do better?
This talk chronicles our journey to build a system that renders expressive, real-time sign language avatars directly into a live HTTP Adaptive Streaming (HAS) feed. We broke this enormous challenge down into a four stage AI pipeline: Audio-to-Text, Text-to-Gloss (the lexical components of sign language), Gloss-to-Pose, and finally Pose-to-Avatar.
But the real engineering puzzle wasn’t just the AI models, it was where to run them. Every decision in a distributed system is a trade-off. This talk is a deep dive into the system architecture and those tough decisions. We’ll explore questions like:
Should Audio-to-Text run in the cloud for maximum accuracy, or can the edge handle it to reduce latency?
How can we use edge servers to handle Text-to-Gloss translation to support regional sign language dialects?
What are the performance-versus-personalization trade-offs of offloading pose generation and avatar rendering to the client device?
This isn’t an AI-magic talk. It’s a practical look at the system level challenges of integrating a complex, real-time AI workload into the video pipelines we work with every day. Attendees will leave with a framework for thinking about how to build and deploy novel, real-time accessibility features, balancing latency, scalability, and user experience.
This talk was presented at Demuxed 2025 in London, a conference by and for engineers working in video. Every year we host a conference with lots of great new talks like this – learn more at https://demuxed.com
Conference
Speakers
Learning Categories
Other Proceedings
Here are some other proceedings that you might find interesting.
What Codec Should I Use?
Alan Resnick
Doing Server-Side Ad Insertion on Live Sports for 25.3M Concurrent Users
Ashutosh Agrawal
Is now the time to solve the deepfake threat?
Roderick Hodgson
Super Resolution: The scaler of tomorrow, here today!
Nick Chadwick
The do's and don'ts about Streaming security
Javier Brines Garcia
Modeling the conceptual structure of FFmpeg in JavaScript
Ryan Harvey
Objectionable Uses of Objective Quality Metrics
Richard Fliam
RTMP: web video innovation or Web 1.0 hack… how did we get to now?
Sarah Allen
Large-Scale Media Archive Migration to the Cloud
Konstantin Wilms
HEVC Upload Experiments
Chris Ellsworth
Related Courses
Below are some courses that might interest you based on the learning categories and topic tags of this conference proceeding.
What Codec Should I Use?
Alan Resnick
Doing Server-Side Ad Insertion on Live Sports for 25.3M Concurrent Users
Ashutosh Agrawal
Is now the time to solve the deepfake threat?
Roderick Hodgson
Super Resolution: The scaler of tomorrow, here today!
Nick Chadwick
The do's and don'ts about Streaming security
Javier Brines Garcia
Modeling the conceptual structure of FFmpeg in JavaScript
Ryan Harvey
Objectionable Uses of Objective Quality Metrics
Richard Fliam
RTMP: web video innovation or Web 1.0 hack… how did we get to now?