Txt2Vid: Ultra-Low Bitrate Compression of Talking-Head Videos via Text

Conference Proceedings

Home
Txt2Vid: Ultra-Low Bitrate Compression of Talking-Head Videos via Text

Description

tl;dr:

We ask the question: “Can we compress AV content generated via webcams to just text and recover videos with similar Quality-of-Experience compared to standard codecs in a low bitrate regime?” and answer it in affirmative using state-of-the-art deep learning models. Long Version: Video represents the majority of internet traffic today leading to a continuous technological arms race between generating higher quality content, transmitting larger file sizes, and supporting network infrastructure. Adding to this is the recent COVID-19 pandemic fueled surge in the use of video conferencing tools. Since videos take up substantial bandwidth (~100 Kbps to few Mbps), improved video compression can have a substantial impact on billions of people in developing countries or other locations with limited or unreliable broadband connectivity. Moreover, a reduction in required bandwidth can have a significant impact on global network performance by decreasing the network load for live and pre-recorded content, providing broader access to multimedia content worldwide. In this talk, we present a novel video compression pipeline, called Txt2Vid, which substantially reduces data transmission rates by compressing webcam videos (“talking-head videos”) to a text transcript. The text is transmitted and decoded into a realistic reconstruction of the original video using recent advances in deep learning-based voice cloning and lip-syncing models. Our generative pipeline achieves two to three orders of magnitude reduction in the bitrate as compared to the standard audio-video codecs, while maintaining equivalent Quality-of-Experience based on a subjective evaluation by users (n=242) in an online study. The code for this work is available as an open-source project on GitHub (https://github.com/tpulkit/txt2vid.git). The focus of our work is on audio-video (AV) content transmitted from webcams during video conferencing or webinars. Current compression codecs (such as H.264 or AV1 for videos, and AAC for audio) lossily compress the input AV content by discarding details that have the least impact on user experience. However, the distortion measures targeted by these codecs are often low-level and attempt to penalize deviation from the original pixel values, or audio samples. But what matters most is the final quality-of-experience (QoE) when this media stream is shown to a human end-consumer. Thus, in our proposed pipeline, instead of working with pixel-wise fidelity metrics we directly approximate the original content such that the QoE is maintained. Compressing to text, we can achieve bitrates of ~100bps at similar QoE compared to a standard codec. The pipeline uses a state-of-the-art voice cloning model to convert text-to-speech (TTS), and a lip-syncing model to convert audio to reconstructed video using a driving video at the decoder. Our pipeline can be used for storing the webcam AV content as a text file or for streaming this content on the fly. We evaluated our pipeline using a subjective study on Amazon Mturk to compare user preferences between Txt2Vid generated videos and videos compressed with standard codecs at varying levels of compression, for multiple contents. The Sell: We believe the proposed framework has the potential to change the landscape of video storage and streaming. It can enable several applications with great potential for social good expanding the reach of video communication technology. Some examples include better accessibility in areas with poor internet availability, transmission of pedagogical content for remote learning, real-time machine translation of talks, etc. It can also enable some fun applications such as joining an AV call but just typing in your input instead of speaking. While we used specific tools in our pipeline to demonstrate its capabilities, we envision significant progress in the components used over the coming years. We would like to highlight that current implementation is just a prototype of the proposed pipeline, and a perfect timing for the community to be involved to make it more practical and accessible. Potential improvements to the current framework and implementation include reducing computational complexity and latency for streaming, improved Quality-of-Experience to include more non-verbal cues, and assuage ethical concerns over usage of such a technology. We call upon the community to build upon current implementation and adapt it for different applications. Presented at Demuxed 2021