2017-10-25

FOMS and Demuxed

On October 3rd and 4th I attended the FOMS workshop in San Francisco then Demuxed on the 5th. There were a lot discussions about video, mostly distribution and playback via web browsers. It was interesting as it’s a different take from my daily work on VLC. Vendors developed very specific techniques targeted at their particular use case, often to get around bogus past decisions or competing solutions.

As Matroska (used by WebM) was primarily designed for playback over network connections (that were slow at the time of design) it was interesting to see if we can cover all these use cases in an optimal way. It is especially important to remain relevant as the AV1 codec is coming soon. It seems to be getting huge traction already and might end up being the main codec everyone uses in the years to come, especially for web videos. Even though it’s targeted at high quality it seems people want to use it ASAP for very low bitrates. I suppose the quality gain for the same bitrate is even more significant there.
FOMS 2017

Two subjects particularly caught my attention in terms of challenges for the container.

Extremely low latency

It seems a lot of companies are looking at reducing the time between the moment something happens and the time it’s displayed on your screen. In the age of Twitter it sucks to see a goal or other (e)sport event happening on your feed before you actually get to see it. In games it also means the people streaming the game can interact in real time with what people are seeing.

Due to the nature of video encoding you can hardly get lower than one frame delay (17 ms in 60fps) and the transmission latency (10 ms if you have an incredible ping). But right now the target is more around a few second or a single second. One of the issue here is how adaptive streaming is currently used. It encodes a bunch of frames and then tell the user it’s available (in various bitrates). That’s because the container needs to know all the frames it contains before it can actually be used. So they wrap about 1s of video to have a minimum latency.

Matroska and EBML have a mode called live streaming. It allows writing frames as they come in and never rewriting the beginning of the file to tell how much data it contains or where the data actually are. So you can start reading the file even while it’s being written. Many years ago GStreamer was used to stream conferences that way (without even an actual file being written) and that’s how VLC 3.0 sends videos to the Chromecast. This is also how most Matroska/WebM muxers work. They write in “live streaming” mode by default: they write a special “unknown” value in the length field and when the size is known this value is overwritten. So a streamer can create files on the fly that people could start reading. And when the file is done write the proper values so that the next people reading from that file actually get proper values they can use to seek.

I hope the web people get a look at this as it would allow to go way below the 1s latency target they currently have. It would also work for adaptive streaming as you still get Clusters that you can cut in many parts on a CDN as currently done for WebM. This solution is already compatible with most Matroska/WebM readers. It’s been in our basic tests suite for at least 7 years.

CMAF

I learned the existence of a new MP4 variant called CMAF (Common Media Application Format). It’s an ISOBMFF profile based on Fragmented MP4 (fMP4). It was developed by Microsoft and Apple. The goal was to use a similar format between DASH and HLS to reduce the cost of storage on CDNs and get better caching. In the end it might not be of much use because the different vendors don’t support the same DRM systems and so at least 2 variants of the same content will still be needed.

This is an interesting challenge for Matroska as with AV1 coming there will be a battle for what container to use to distribute videos. It’s not the main adoption issue anymore though. For example Apple only supported HLS with MPEG TS until iOS10 so many Javascript frameworks remux the incoming fragmented fMP4 to TS on the fly and feed that to iOS.

Regular MP4 files were not meant to be good for progressive downloading, nor fragmented playback needed for adaptive streaming as the index was needed for playback and so needed to be loaded beforehand and not necessarily at the front of the file. The overhead (the amount of data the container adds on top of the actual codec data) wasn’t not great either. So far it was a key advantage towards Matroska/WebM as these were two of the main criteria when the format was designed 15 years ago. There were cases where MP4 could be smaller by at the price of using compressed headers. The situation changes with fMP4 and CMAF. In fact the overhead is slightly lower than Matroska/WebM. And that’s pretty much the only advantage it has over Matroska.

On a 25 MB file of 44 kbps (where overhead is really hurting) the difference between the fMP4 file and one passed through mkclean is 77 KB or 0.3%. It may seem peanuts, especially for such a small bitrate, but I think Matroska should do better.

Looking at the fMP4 file, it seems the frames are all packed in a blob and the boundaries between each frame in a separate blob (‘trun’ box). And that’s about it. It must only work with fixed frame rates and probably allows no frame drop. But that’s efficient for the use case of web video over CDNs that were encoded and muxed for that special purpose. There’s hardly any overhead apart from the regular track header.

One way Matroska could be improved for such a case would be to allow frame lacing for video. It is already used heavily for audio to reduce the overhead and since audio doesn’t need a timestamp for each block, the sampling rate is enough (except when there are drops during recording, in which case lacing is not used). We could allow lacing video frames as long as the default duration for the track is set (similar to a frame rate) and that each frame has the same characteristics in the Matroska Block, especially the keyframe flag. So keyframes would stand alone and many other video frames could be laced to reduce the overhead, the same way it’s done for audio. With such a small bitrate it could make a significant difference. On higher bitrates not really, but the overhead difference between fMP4 and Matroska is probably small if not at the advantage of Matroska in this case (thanks to header compression).

I will submit the proposal to the CELLAR workgroup of the IETF (Internet Engineering Task Force), a group that is currently working on specifying properly EBML, Matroska but also FFv1 and FLAC. This is not a big change, it’s just something that we didn’t allow before. And because it’s already in use for audio in just about every Matroska/WebM file that exists, the parsing already exists in current players and may work out of the box with video frame lacing. It doesn’t add any new element.

The advantages of Matroska over MP4 remain the same for fMP4.
Demuxed 2017

TL;DR

Matroska has a lot to offer to web distribution, like one frame latency at scale not possible with ISOBMFF formats, doesn’t require new designs for current and future use cases and is the most open and free solution.