Content Aware Encoding for low latency live streaming encoders using deep learning

Conference Proceedings

Home
Content Aware Encoding for low latency live streaming encoders using deep learning

Description

Livestreaming has emerged as a captivating medium that is reshaping the way we engage, communicate, and entertain with the world. Platforms like Twitch, YouTube Live, Facebook Live and others have become go-to destinations for audiences seeking real-time experiences, immediate interaction, and the thrill of being part of a live event. Real-time transcoding plays a crucial role in delivering high-quality, compatible, and optimized video content to these audiences using a range of devices and platforms.

Most of this transcoding today, occurs using a fixed adaptive bitrate (ABR) ladder that has a predetermined set of bitrate/resolution combinations for each encoded stream. This static or one size fits all approach has the following limitations: content type blindness leading to inefficient use of bandwidth and suboptimal video quality (VQ). To overcome these limitations, Netflix pioneered content aware encoding (or Per-Shot encoding) for VOD use case. The video content is analyzed offline for every shot and efficient encoding decisions such as bitrate, resolution, quantization level is chosen based on convex hull of a given shot (best quality-bitrate points obtained from an ocean of encodes with different parameters) to maximize the VQ while minimizing bandwidth requirements. Live streaming does not have the luxury of “infinite” latency that VOD offers. Real-time transcoding at scale puts additional cap on the processing capacity. This is a challenging problem that has attracted quite a bit of research in recent times. There are several approaches to content aware encoding for low latency encoding. Finding the best possible quality-bitrate trade-off in real-time with available compute while maintaining latency is the name of this game. In this talk, we showcase our work using deep learning (DL) that predicts the “optimal” bitrate for incoming video in real time using data from input and encoder lookahead. We train a fully connected regression network using input statistics (luma histogram) and encoder lookahead statistics (SAD, mv and activity histograms). The ground truth for our purpose of achieving “optimal” bitrate is the bitrate which achieves a minimum VMAF value of 90 (this is the minimum quality bar) for each chosen shot during training. This regression network is very light on compute and can efficiently run without affecting real-time performance or density of the encoder. This network needs a minimum of four frames of lookahead data to produce a high prediction accuracy. We have trained our network to have maximum savings for low complexity content with negligible loss in video quality and bypass very high complexity content. We tested this algorithm using a variety of video clips downloaded from Twitch.tv and here are some of our results: 1. Bitrate savings of more than 30% with less than 1 VMAF point degradation for easy content such as talking head and low complexity content. 2. Bitrate savings of 9% on average for medium complexity content with less than 1 VMAF point degradation. 3. Negligible savings for high complexity content (as the algorithm knows lowering bitrate would cause VQ degradation). The reasons for using a deep learning-based approach to predict CAE bitrate over traditional approaches are twofold: 1. Nonlinear function produced from DL provides precise bitrate savings without degrading VQ. 2. DL models can be trained/retrained by the content distributor using propriety and specific set to maximize bitrate savings while maintaining high VQ. This approach is applicable for both hardware and software-based encoders which have access to the encoder lookahead statistics mentioned above. These bitrate savings can provide substantial savings on CDN bandwidth and storage costs for content distributors. This talk was presented at Demuxed ’23, a conference for video nerds in San Francisco featuring amazing talks like this one.