Mentioned in my previous two posts, Smooth Streaming  is Microsoft’s completion of HTTP-based adaptive streaming, which is a truly a hybrid media delivery method. It may act like streaming, but it is in fact based on HTTP progressive download. The HTTP downloads are performed in a series of small chunks, allowing the media to get easily and cheaply cached along the edge of the network, closer to the end users. The Expression Encoder provides multiple encoded bitrates of the same media source also allows Silverlight clients to seamlessly and dynamically switch between bitrates depending on network conditions and CPU power. The end users experience is one of reliable, consistent playback without stutter, buffering or “last mile” pipe-line clog.

A Microsoft media format and no ASF? That’s right. Why MP4 and not ASF? Well, here are a few reasons:
•    MP4 is a lighter container and uses less overhead.
•    MP4 are easier to “digest” by .NET.
•    MP4 is more widely supported.
•    MP4 was architected with H.264 video codec support, although ASF can be as well, it’s not as native.
•    MP4 was designed to natively support payload fragmentation

There are actually two parts to the Smooth Streaming format: the wire format and the disk file format. In Smooth Streaming, a video is recorded in full length to the disk as a single file (one file per encoded bit rate), but it’s transferred to the client as a series of small file chunks. The wire format defines the structure of the chunks that are sent by IIS to the client, whereas the file format defines the structure of the contiguous file on disk, enabling better file management. Fortunately, the MP4 specification allows MP4 to be internally organized as a series of fragments, which means that in Smooth Streaming the wire format is a direct subset of the file format.

So exactly what are these “fragments”? The basic unit of an MP4 file is called a “box.” These boxes can contain both data and metadata. The MP4 specification allows for various ways to organize data and metadata boxes within a file. In most media scenarios, it’s considered useful to have the metadata written before the data so that a player client application can have more information about the video/audio that it’s about to play before it plays it. However, in live streaming scenarios it’s often not possible to write the metadata up-front about the whole data stream because it’s simply not yet fully known.

The following figure is a high-level overview of what a Smooth Streaming file looks like on the inside:

Smooth Stream pic1

In a nutshell, the file starts with file-level metadata (‘moov‘) which generically describes the file, but the bulk of the payload is actually contained in the fragment boxes which also carry more accurate fragment-level metadata (‘moof‘) and media data (‘mdat‘). (The diagram only shows 2 fragments, but a typical Smooth Streaming file has a fragment per every 2 seconds of video/audio.) Closing the file is a ‘mfra‘ index box which allows easy and accurate seeking within the file.
When a Silverlight client requests a video time slice from the IIS Smooth Streaming server, the server simply seeks to the approriate starting fragment in the MP4 file and then lifts the fragment out of the file and sends it over the wire to the client. This is why we refer to the fragments as the “wire format.” This technique greatly enhances the efficiency of the IIS server because it requires no remuxing or rewriting overhead.
Here is what an MP4 fragment looks like in more detail:

File Fragment

We say that the Smooth Streaming format is based on the MP4 file format because even though we’re following the ISO specification, we specify our own box organization schema and some custom boxes. In order to differentiate Smooth Streaming files from “vanilla” MP4 files, we use new file extensions: *.ismv (video+audio) and *.isma  (audio only). I keep forgetting to ask the IIS Media team what the acronyms exactly stand for, but my best guess would be “IIS Smooth Streaming Media Video (Audio)”.

Smooth Streaming Media Assets

A typical Smooth Streaming media asset therefore consists of the following files:

  • MP4 files containing video/audio
    • *.ismv – contains video and audio, or only video
      • 1 ISMV file per encoded video bitrate
    • *.isma – contains only audio
      • In videos with audio, the audio track can be muxed into an ISMV file instead of a separate ISMA file
  • Server manifest file
    • *.ism
    • Describes the relationships between media tracks, bitrates and files on disk
    • Only used by the IIS Smooth Streaming server – not by client
  • Client manifest file
    • *.ismc
    • Describes to the client the available streams, codecs used, bitrates encoded, video resolutions, markers, captions, etc.
    • It’s the first file delivered to the client

Both manifest file formats are based on XML. The server manifest file format is based specifically on the SMIL 2.0 XML format specification.

A folder containing a single Smooth Streaming media asset might look something like this:

Smooth Streaming media assets

A folder containing a Smooth Streaming media assets

In this particular case the audio track is contained in the NBA_3000000.ismv file.

Smooth Streaming Manifest Files

The Smooth Streaming Wire/File Format specification defines the manifest XML language as well as the MP4 box structure. Because the manifests are based on XML they are highly extensible. Among the features already included in the current Smooth Streaming format specification is support for:

  • VC-1, WMA, H.264 and AAC codecs
  • Text streams
  • Multi-language audio tracks
  • Alternate video and audio tracks (i.e. multiple camera angles, director’s commentary, etc.)
  • Multiple hardware profiles (i.e. same bitrates targeted at different playback devices)
  • Script commands, markers/chapters, captions
  • Client manifest Gzip compression
  • URL obfuscation
  • Live encoding and streaming

Smooth Streaming Playback: Bringing It All Home

Microsoft’s adaptive streaming prototype (used for NBC Olympics 2008) relied on physically chopping up long video files into small file chunks. In order to retrieve the chunks for the web server, the player client simply needed to download files in a logical sequence: 00001.vid, 00002.vid, 00003.vid, etc.

As I’ve explained in this and previous posts, Smooth Streaming uses a more sophisticated file format and server design. The videos are no longer split up into thousands of file chunks, but are instead “virtually” split up into fragments (typically 1 fragment per video GOP) and stored within a single contiguous MP4 file. This implies two significant changes in server and client design too:

  1. The server must be able to translate URL requests into exact byte range offsets within the MP4 file, and
  2. The client can request chunks in a more developer-friendly manner, such as by timecode instead of by index number

The first thing a Silverlight client requests from the Smooth Streaming server is the *.ismc client manifest. The manifest tells it which codecs were used  to compress the content (so that the Silverlight runtime can initialize the correct decoder and build the playback pipeline), which bitrates and resolutions are available, and a list of all the available chunks and either their start times or durations.

With IIS7 Smooth Streaming, a client is expected to request fragments in the form of RESTful URLs:

http://video.foo.com/NBA.ism/QualityLevels(400000)/Fragments(video=610275114)

http://video.foo.com/NBA.ism/QualityLevels(64000)/Fragments(audio=631931065)

The values passed in the URL represent encoded bitrate (i.e. 400000) and the fragment start offset (i.e. 610275114) expressed in an agreed-upon time unit (usually 100 ns). These values are known from the client manifest.

Upon receiving a request like this, the IIS7 Smooth Streaming component looks up the quality level (bitrate) in the corresponding *.ism server manifest and maps it to a physical *.ismv or *.isma file on disk. It then goes and reads the appropriate MP4 file, and based on its ‘tfra’ index box figures out which fragment box (‘moof’ + ‘mdat’) corresponds to the requested start time offset. It then extracts the said fragment box and sends it over the wire to the client as a standalone file. This is a particularly important part of the overall design because the sent fragment/file can now be automatically cached further down the network, potentially saving the origin server from sending the same fragment/file again to another client requesting the same RESTful URL.

As you can see, requesting chunks of video/audio from the server is easy. But what about dynamic bitrate switching that makes adaptive streaming so effective? This part of the Smooth Streaming experience is implemented entirely in client-side Silverlight application code – the server plays no part in the bitrate switching process. The client-side code looks at chunk download times, buffer fullness, rendered frame rates, and other factors – and based on them decides when to request higher or lower bitrates from the server. Remember, if during the encoding process we ensure that all bitrates of the same source are perfectly frame aligned (same length GOPs, no dropped frames), then switching between bitrates is completely seamless – and Smooth.

We say that the Smooth Streaming format is based on the MP4 file format because even though we’re following the ISO specification, we specify our own box organization schema and some custom boxes. In order to differentiate Smooth Streaming files from “vanilla” MP4 files, we use new file extensions: *.ismv (video+audio) and *.isma (audio only). I keep forgetting to ask the IIS Media team what the acronyms exactly stand for, but my best guess would be “IIS Smooth Streaming Media Video (Audio)”.

Smooth Streaming Media Assets

A typical Smooth Streaming media asset therefore consists of the following files:

  • MP4 files containing video/audio
    • *.ismv – contains video and audio, or only video
      • 1 ISMV file per encoded video bitrate
    • *.isma – contains only audio
      • In videos with audio, the audio track can be muxed into an ISMV file instead of a separate ISMA file
  • Server manifest file
    • *.ism
    • Describes the relationships between media tracks, bitrates and files on disk
    • Only used by the IIS Smooth Streaming server – not by client
  • Client manifest file
    • *.ismc
    • Describes to the client the available streams, codecs used, bitrates encoded, video resolutions, markers, captions, etc.
    • It’s the first file delivered to the client

Both manifest file formats are based on XML. The server manifest file format is based specifically on the SMIL 2.0 XML format specification.

A folder containing a single Smooth Streaming media asset might look something like this:

VN:F [1.9.3_1094]
Rating: 0.0/10 (0 votes cast)
VN:F [1.9.3_1094]
Rating: 0 (from 0 votes)
  • Share/Bookmark