Does H.264 in Chrome 50 put an end to the need for transcoding WebRTC video sessions?
NO
The hard part in video communications from computing perspective is encoding and decoding of the video stream. This is what we usually call transcoding although sometimes we would be required to perform decoding and encoding for other purposes.
The secret sauce of video codecs is in the encoding side. A video encoder (e.g. H.264 encoder) can make various decisions about the compression of the video but one thing must be kept, the encoded stream must be decodable by a standard reference decoder. The efficiency of the codec implementation is in the encoder.
As efficient as a codec and its implementation may be, encoding and decoding of video streams is a CPU intensive task. Even when utilizing Intel’s HW acceleration, only a handful of high resolution (1080p or 720p) 30 FPS (Frames Per Second) H.264/VP8 sessions are possible per Zeon processor. Things don’t get much better when moving to DSPs.
The rule of thumb is:
Do at most to avoid server decode and encode
Client side encode and decode
When 2 clients are engaged in a video session, video is encoded at the sending side and decoded at the receiving side. When 2 clients are in a 1 to 1 communication, there is no need for transcoding even if media relay is required. This is assuming there is a common codec (and codec format) both support. Even if there is a need for some server side intervention such as a case where encryption needs to be terminated or added at the edge of the enterprise network, this requires only handing the encryption part, no need to touch the media.
The need for server side media handling
The simple case of requiring server side video processing is when there is no common codec 2 clients can agree on. In such case, options are to default to audio only or have the server bridge the codec gap and transcode the video.
There are other cases where video processing may be required even if both clients support the same code.
There may be differences in supported video formats. Example, Edge supports H.264UC (the Microsoft version of H.264). This will not work with the common H.264AVC so server side transcoding will be required.
Advanced processing that requires changing the video stream. Some of these can be done on the client side and managed through signaling. For example, change of resolution and FPS (dependency between video frames make dropping of frames a problem) that is typically requested through signaling. Other tweaks to the video such as removing/adding elements or analysis of video content for compliance purposes may require work in the uncompressed domain, hence, require decoding and encoding the video even if it is the same coded.
Recording can be done through directing a copy of the video stream to the recording server and processing of the stream there, packaging audio and video to a playable format such as MP4. Implementations of video call recording vary based on requirements and approaches, some put more responsibility on the client side while others do it on the server side only.
Another capability that in many cases require server side processing is conferencing.
The different options for video conferencing implementation
Once we go beyond the 1×1 communication there is a need to mix the video and dynamically decide which participants are being displayed and in what layout.
There are 3 common approaches to this. Here they are in brief, more on this in Tsahi Levent-Levi’s post.
Full Mesh
In this case each participant sends his audio and video to all other participants. This is a server less mode of operation where each client builds the video layout for himself and decides which audio stream to play (or just plays them all).
When number of participants is small it works out pretty well.
When number of participants starts to increase (typically over 4), it becomes more of a full mess than a full mesh and upstream bandwidth required by each participant becomes a real issue.
The big advantage of this mode is that there is no server involved. Cost effective and private.
MCU
At the other end there is the MCU topology. This is typically used in high quality enterprise video communication. Think board rooms with big screens.
In this topology each video and audio stream is terminated (decoded) by the MCU server and a dedicated layout is built for each participant.
Naturally this is not a cost effective option (only to the equipment vendors) and not scalable for web environments or high capacity services.
SFU
The third option is a combination of the 2 above. The Selective Forwarding Unit (SFU) combines the smart client capability to build its own layout with smart forwarding decision making of the server. The SFU receives audio and video streams from all participants and based on its algorithms performs audio mixing and decides which video streams to send to which participant. It can for example request to receive 2 resolutions of video from each participant (high and low) and based on each clients’ capabilities forward the best resolution to each client.
While this option lacks some of the slick capabilities of an MCU it is powerful, scalable and cost effective.
SFU is what Google Hangouts uses for large video conferences.
The impact of H.264 in Chrome on the need for transcoding
At the end of March Google published the release notes of Chrome 50 beta channel. Among other features and fixes it now supports H.264 (behind a flag). This was an expected addition to Chrome given the WebRTC Mandatory to Implement (MTI) video codec decision at the IETF that both VP8 and H.264 should be supported by browsers.
Does this put an end to the need for WebRTC video transcoding and to server side decode and encode?
As can be concluded from this post, server side video processing is required not only for codec compatibility but also for other needs.
From the WebRTC prism and given the growing usage of Chrome it is safe to assume that the need for transcoding WebRTC video sessions is decreasing. Even as new codecs are introduced (VP9 and H.265) the previous codecs (H.264 and VP8) are still supported in the devices adding the latest and greatest codecs so in cases where the newer codecs are not commonly supported by all participants it will be usually possible to default to the H.264/VP8 option.
Video transcoding is still required in some cases but the need is becoming less frequent.
Alan Percy says
Amir, while your post gives a good overview of the various video conferencing techniques, it misses one important issue – mobile bandwidth. Not all mobile devices (smartphones, tablets, etc) will have continuous unlimited bandwidth or processing power. A mobile device on a congested network may require a lower resolution or frame-rate from all the other participants – a scenario that the MCU architecture handles easily, delivering a stream of a codec/resolution/frame-rate that best fits the real-world situation. Full mesh is unworkable for mobile devices. A SFU architecture would force the device into choosing between one of many poor compromises.
As long as the conference is going to be recorded, server-side processing is going to be required, why not give the best experience to the user?
Amir Zmora says
Alan,
The full mesh option is workable as long as number of participants is small. As mentioned in the post, main problem is bandwidth.
MCU looks like an easy solution but it has scale and cost issues. When you talk about high capacity web scale this option is out of the door. For example Google Hangouts gives you 15 participants. In the scale of Google an MCU is not an option. SFU is.
The solution for supporting low FPS and low resolution is to request from high resolution participants 2 streams – high and low – and forward the right stream based on each client capabilities.
A video conferencing service architecture may include a combination of all 3 options or part of them and usage of each based on the service logic and type/number of participants.
Going back to the Google Hangouts example, they use a combination of full mesh and SFU.
Amir