Why Video Captioning Needs Built-In Viewer Feedback in 2022 (and How We Do It)

Conference Proceedings

Home
Why Video Captioning Needs Built-In Viewer Feedback in 2022 (and How We Do It)

Description

We’ve all heard the hype, excitement, and fear of how AI systems are getting smarter & smarter, developing sentience, and generally taking over the world creating a future of subjugation and despair for the human race. However, I am fairly confident that this bleak picture is not in our near future because there is one major problem: AI doesn’t even understand us that well.

Anyone that has used voice recognition on their phone or in their car will recognize that speech-to-text technology still has a long way to go. In the video world, this is nowhere more obvious than in auto-generated video captioning. While auto-generated captions are better than no captions at all – incorrect spellings, wrong words, bad punctuation, and misplaced phrasing breaks among other discrepancies mean that human review and improvement is still needed for captions to accurately represent what is said and heard in videos. (If you’re watching a video for any long length of time and are not noticing any errors, that is thanks to human review!) Accuracy in captioning is not a trivial matter since captioning errors are not just a minor annoyance. ADA accessibility compliance demands 99% accurate captions, speaker labels, and phrase breaks among other features that none of the auto-generated captioning services on the market today meet. Yet, most auto-generated caption errors can be improved by far more people than only costly transcribers. That’s why I propose that while the speech recognition wizards keep improving their methods and services, it is on us video engineers to allow interested viewers, those who are already watching and interested in fixing errors they see, the chance to easily give feedback to improve transcriptions of both recorded and live video. The goal of this is to increase accuracy and watchability for fellow viewers while also giving the machines better and better data to keep on improving. In this talk, I will give a brief review of current speech-to-text technology, where it is limited, and why it will be limited until completely new techniques come along. Then I will outline both high-level ideas and actionable steps for video developers to add more feedback systems into their video players. This includes a demo that proposes updates to video player UIs for viewers to be able to easily give feedback, a backend that handles the inputs of an open-ended crowdsourced system in a productive manner, and updates to the caption file formats we use to capture this feedback effectively. One day in the future, every video will be captioned 100% correctly by automation. But until that day, its on us to incorporate simple feedback systems so that every video has the chance to be captioned correctly! This talk was presented at Demuxed ’22, a conference for video nerds in San Francisco featuring amazing talks like this one. Demuxed ’22 was made possible by sponsors like our Platinum sponsor Daily (https://daily.co) and organized by people from Mux (https://mux.com). For more information about the conference and community, see https://2022.demuxed.com.