Skip to content
  • SVTA University Calendar
  • Courses
    • In-Person Training
  • Hot Topics
  • Education Resources
    • Conferences
      • Demuxed
      • Mile High Video
      • NAB Streaming Summit
      • SEGMENTS
      • Streaming Tech Sweden
    • Industry Resources
    • Media Samples
    • SVTA Webinars
  • Instructors
  • Register
  • Log In
  • SVTA University Calendar
  • Courses
    • In-Person Training
  • Hot Topics
  • Education Resources
    • Conferences
      • Demuxed
      • Mile High Video
      • NAB Streaming Summit
      • SEGMENTS
      • Streaming Tech Sweden
    • Industry Resources
    • Media Samples
    • SVTA Webinars
  • Instructors
  • Register
  • Log In
$0.00 0 Cart

Conference Proceedings

  • Home
  • Txt2Vid: Ultra-Low Bitrate Compression of Talking-Head Videos via Text
Txt2Vid: Ultra-Low Bitrate Compression of Talking-Head Videos via Text

Description

tl;dr:

We ask the question: “Can we compress AV content generated via webcams to just text and recover videos with similar Quality-of-Experience compared to standard codecs in a low bitrate regime?” and answer it in affirmative using state-of-the-art deep learning models. Long Version: Video represents the majority of internet traffic today leading to a continuous technological arms race between generating higher quality content, transmitting larger file sizes, and supporting network infrastructure. Adding to this is the recent COVID-19 pandemic fueled surge in the use of video conferencing tools. Since videos take up substantial bandwidth (~100 Kbps to few Mbps), improved video compression can have a substantial impact on billions of people in developing countries or other locations with limited or unreliable broadband connectivity. Moreover, a reduction in required bandwidth can have a significant impact on global network performance by decreasing the network load for live and pre-recorded content, providing broader access to multimedia content worldwide. In this talk, we present a novel video compression pipeline, called Txt2Vid, which substantially reduces data transmission rates by compressing webcam videos (“talking-head videos”) to a text transcript. The text is transmitted and decoded into a realistic reconstruction of the original video using recent advances in deep learning-based voice cloning and lip-syncing models. Our generative pipeline achieves two to three orders of magnitude reduction in the bitrate as compared to the standard audio-video codecs, while maintaining equivalent Quality-of-Experience based on a subjective evaluation by users (n=242) in an online study. The code for this work is available as an open-source project on GitHub (https://github.com/tpulkit/txt2vid.git). The focus of our work is on audio-video (AV) content transmitted from webcams during video conferencing or webinars. Current compression codecs (such as H.264 or AV1 for videos, and AAC for audio) lossily compress the input AV content by discarding details that have the least impact on user experience. However, the distortion measures targeted by these codecs are often low-level and attempt to penalize deviation from the original pixel values, or audio samples. But what matters most is the final quality-of-experience (QoE) when this media stream is shown to a human end-consumer. Thus, in our proposed pipeline, instead of working with pixel-wise fidelity metrics we directly approximate the original content such that the QoE is maintained. Compressing to text, we can achieve bitrates of ~100bps at similar QoE compared to a standard codec. The pipeline uses a state-of-the-art voice cloning model to convert text-to-speech (TTS), and a lip-syncing model to convert audio to reconstructed video using a driving video at the decoder. Our pipeline can be used for storing the webcam AV content as a text file or for streaming this content on the fly. We evaluated our pipeline using a subjective study on Amazon Mturk to compare user preferences between Txt2Vid generated videos and videos compressed with standard codecs at varying levels of compression, for multiple contents. The Sell: We believe the proposed framework has the potential to change the landscape of video storage and streaming. It can enable several applications with great potential for social good expanding the reach of video communication technology. Some examples include better accessibility in areas with poor internet availability, transmission of pedagogical content for remote learning, real-time machine translation of talks, etc. It can also enable some fun applications such as joining an AV call but just typing in your input instead of speaking. While we used specific tools in our pipeline to demonstrate its capabilities, we envision significant progress in the components used over the coming years. We would like to highlight that current implementation is just a prototype of the proposed pipeline, and a perfect timing for the community to be involved to make it more practical and accessible. Potential improvements to the current framework and implementation include reducing computational complexity and latency for streaming, improved Quality-of-Experience to include more non-verbal cues, and assuage ethical concerns over usage of such a technology. We call upon the community to build upon current implementation and adapt it for different applications. Presented at Demuxed 2021

Conference

Demuxed 2021

Speakers

Pulkit Tandon

Student

Learning Categories

Content Creation
Encoding
AI
Ai Encoding
Machine Learning

Other Proceedings

Here are some other proceedings that you might find interesting.

What Codec Should I Use?

Alan Resnick

Doing Server-Side Ad Insertion on Live Sports for 25.3M Concurrent Users

Ashutosh Agrawal

Is now the time to solve the deepfake threat?

Roderick Hodgson

Super Resolution: The scaler of tomorrow, here today!

Nick Chadwick

The do's and don'ts about Streaming security

Javier Brines Garcia

Modeling the conceptual structure of FFmpeg in JavaScript

Ryan Harvey

Objectionable Uses of Objective Quality Metrics

Richard Fliam

RTMP: web video innovation or Web 1.0 hack… how did we get to now?

Sarah Allen

Large-Scale Media Archive Migration to the Cloud

Konstantin Wilms

HEVC Upload Experiments

Chris Ellsworth

Related Courses

Below are some courses that might interest you based on the learning categories and topic tags of this conference proceeding.

What Codec Should I Use?

Alan Resnick

Doing Server-Side Ad Insertion on Live Sports for 25.3M Concurrent Users

Ashutosh Agrawal

Is now the time to solve the deepfake threat?

Roderick Hodgson

Super Resolution: The scaler of tomorrow, here today!

Nick Chadwick

The do's and don'ts about Streaming security

Javier Brines Garcia

Modeling the conceptual structure of FFmpeg in JavaScript

Ryan Harvey

Objectionable Uses of Objective Quality Metrics

Richard Fliam

RTMP: web video innovation or Web 1.0 hack… how did we get to now?

Sarah Allen

Large-Scale Media Archive Migration to the Cloud

Konstantin Wilms

HEVC Upload Experiments

Chris Ellsworth

Follow

Twitter Linkedin-in

User Area

  • Account
  • FAQs
  • Orders
  • Registration
  • Account
  • FAQs
  • Orders
  • Registration

Resources

  • About
  • FAQs
  • Legal Hub
  • Support
  • How-To Take A Course
  • How-To Navigate the Interface
  • About
  • FAQs
  • Legal Hub
  • Support
  • How-To Take A Course
  • How-To Navigate the Interface

SVTA Sites

  • Diversity and Inclusion
  • LABS
  • OATC
  • Open Caching
  • SEGMENTS
  • Streaming Video Wiki
  • SVTA Fellows
  • SVTA University
  • Diversity and Inclusion
  • LABS
  • OATC
  • Open Caching
  • SEGMENTS
  • Streaming Video Wiki
  • SVTA Fellows
  • SVTA University

© Copyright Streaming Video Technology Alliance (SVTA).

About the SVTA University

The SVTA University (SVTAU) is an educational arm of the Streaming Video Technology Alliance, providing courses and other instructional content related to understanding and working with components within the streaming video stack.

About the SVTA

The Streaming Video Technology Alliance is a global technical association committed to bringing video streaming companies together to help build a better viewer experience at scale. Find out more at www.svta.org.

Payment Forms

Stay In-the-Know!

Enter your email address below to subscribe to our newsletter for the latest in available courses and other Institute news. Note that by doing so, you agree to our privacy policy.

Loading...

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.