- Applications that require support for encoding higher resolution and frame rate video in real time such as 4Kp60
- Applications that require real time or fast 1080p or 4K encoding at higher quality than the real-time presets
The current generation of Intel processors with on-chip graphics, the Intel® Xeon® Processor E3 v5 family, is limited to a maximum of four CPU cores. So, the only way we can scale up performance in terms of resolutions, frame rate or quality is by distributing the encoding process across multiple processors. This is exactly what chunked encoding enables (see blog). However, the queuing, distribution and re-aggregation overheads combined with the minimum chunk/GOP size delay, result in ‘latency’ – in the order of several seconds. While this is prohibitive for most live or real time applications, it cannot be addressed by reducing the chunk size significantly, since it would impact the video quality (refer to blog). The only effective alternative to chunking is frame level distribution of video encoding that significantly reduces latency.
Frame Level, Tile Based Encoding
Frame level, tile based encoding entails frame level distribution or sub-division of a picture into specified number of rectangular blocks that can be coded independently. This can be done using a toolset called TILES, readily supported by the HEVC standard. Here are two key advantages of tile based encoding:
- Across tile boundary compression loss that is typically much lower than the cross chunk/GOP boundary efficiency loss, and can be further minimized.
- Lesser reference area requirement per tile for estimation.
For example, assume a 2 x 2 tile as shown in Figure 1, where the entire ultra-high definition (UHD) frame is sub-divided into 4 x HD (1920x1080) regions.
Figure 1: Video frame divided into 2 x 2 or 3 x 2 configurations
Tile based encoding restricts the reference area on each device. In the 2 x 2 tile grid, this is roughly one-fourth of the total bandwidth, and instead of reducing the overall throughput, it in fact increases the same because of additional overlap regions. However, tiles come with the ability to partition a single large picture into independent smaller pictures that fit the total available bandwidth on each device.
Such a design presents a scalable option to enable higher resolution, quality or bit depth processing by using additional devices. We can thus realize a design across six devices instead of four, by using 3 x 2 tiles as shown in Figure 1.
Harnessing Platforms with High Bandwidth Efficiency
A key requirement to achieve high performance tile based distributed encoding is a powerful grid of interconnected, low power Intel processors with on-chip graphics. Apart from this, the most critical factor for efficient, distributed encoding with tiles is the data bandwidth across a grid of processors. Tile based encoding requires every frame to be decoded prior to distribution, which means every processor is provided with uncompressed (raw) video input. While an uncompressed UHD/4K frame is roughly 12Mbytes in size, high quality encoding will need 4-5x of this to process a frame to account for inter-processor data exchange between encoding modules. This translates into data bandwidth of over 4Gbytes/second for 4Kp60 encoding in real time.
Moreover, when compared to the typical bit-rates of high quality compressed inputs (up to 75 Mbytes/second), tile based distributed encoding consumes more than 50 times the inter-processor bandwidth of chunked encoding. So, instead of the typical cloud architectures connected over the internet, we require specialized, very high throughput interconnect between the processors (10Gig Ethernet, PCIe, etc.)
Now, let us consider a few options that meet these two requirements:
- HP Moonshot System with Intel® Xeon® Processor E3 v4 or v5 family-based cartridges
- Artysen Maxcore Server with Sharp Streamer Pro Cards
- Multi-socket Intel-based servers with Intel® Visual Compute Accelerator 2 (Intel® VCA 2) add-in cards (from vendors like Supermicro, Dell, HP, 2CRSI, and Advantech)
Using a tile based design, very high quality 4Kp60 HDR encoding can be performed in real time on 6 Intel Xeon Processor E3 v4 or v5 family-based HP Moonshot cartridges, 3 Sharp Streamer Pro cards or 2 Intel VCA 2 cards, respectively, on the servers listed above. Depending on the complexity of the algorithm (or the speed vs. quality trade-off), we can use multiple such interconnected devices. This way, we can leverage all the CPU + GPU computational power, and realize high quality encoding at high resolutions and frame rates, in real time or faster than real time.
Gearing Up For 8K
The bottom line is, distributed encoding and associated platforms provide two main advantages:
Low Power: Since it is specifically designed for graphics and video delivery, the Intel Xeon Processor E3-1585L v5 processor included in the Intel VCA 2 card consumes just 45W of power. Multi-socket, general-purpose Intel® Xeon® Processor E5 family-based processors that deliver equivalent performance for video encoding consume up to 6x the power.
Total Cost of Ownership (TCO): All the above examples are available in 1RU/2RU boxes that deliver lower operational costs. The platforms are also competitively positioned in terms of ownership costs.
That brings us to a key question: Does this solution scale up to 8K? Soon. The number of processors interconnected in the server examples we have provided do not aggregate to sufficient processing power for 8Kp120 yet. But the good news is that the architectures of the hardware processors and server design are capable of scaling up to more processors, and could be critical to initial deployments of real time 8K encoding.
_________________________________________________________
For more insights into distributed encoding, contact us at mkt@ittiam.com
Explore Ittiam’s i265 family of H.265 codecs @https://www.ittiam.com/products/software-ips/video/h265-hevc/